System Virtualization and OS Virtual Machines
=============================================
-:Date: 2013-10-29
+:Date: 2013-12-19
:Authors: Ivan Boule, Olivier Matz
Plan
Contents
--------
-.. contents::
- :depth: 2
- :backlinks: none
+- History of Virtualization
+- Virtualization Usage and Taxonomy
+- Process Level Virtualization
-History
-=======
+ - ABI Emulation
+ - Virtual Servers
+
+- System Level Virtualization
+
+ - Transparent Hardware Emulation
+ - Transparent Hardware Virtualization
+ - Paravirtualization
+ - Hardware-Assisted Virtualization
+
+- Conclusion
+
+Who am I?
+---------
+
+- Olivier MATZ ``<olivier.matz@6wind.com>``
+- Software engineer since 10 years at 6WIND
+- 6WIND is a software company designing high performance network
+ software
+
+ - http://www.6wind.com
+
+- I'm mainly developing low-level code: Linux kernel, drivers and
+ applications close to the operating system
History of Virtual Machines
----------------------------
+===========================
-- VM introduced in the sixties on IBM/370 series
+Sixties: introduction of IBM/370 series
+---------------------------------------
-- Co-Designed VM: IBM AS/400
+- Generalization of virtual memory
+- Microprogramation of instructions on small models
+- CP/CMS hypervisor
+
+.. figure:: ibm370.jpg
+ :width: 60%
+
+.. note::
+
+ - IBM/370: généralisation de la mémoire virtuelle
+ - IBM/370: microprogrammation de certaines instructions sur les
+ petits modeles
+ - IBM/370: hyperviseur CP/CMS (Control Program/Conversational
+ Monitoring System), gérant des machines virtuelles sous lequel on
+ pouvait faire tourner indifféremment des CMS, des DOS et des
+ OS. Proposé à des clients le temps d’effectuer des migrations des
+ DOS vers OS, il sera souvent conservé pour la très grande
+ convivialité de CMS utilisé comme système de temps partagé.
+
+ le produit VM/370, créé par IBM dans les années 1970, permettait à
+ plusieurs usagers d'utiliser en temps partagé un ordinateur qui
+ exécute le système d'exploitation IBM DOS. IBM DOS tout seul
+ n'offrait pas la possibilité d'utilisation en temps partagé2.
+
+ - temps partagé entre VM
+
+Eighties: IBM AS/400
+--------------------
+
+- Many logical machines in one physical machine
+- High level (virtual) ISA including I/Os (TIMI)
+
+ - Take advantage of advances in hardware without recompilation
+ - User binaries contain both TIMI instructions and machine instructions
+ - Easier transition to PowerPC
+
+.. figure:: ibm-as400.jpg
+ :width: 40%
+
+.. note::
+
+ - IBM/AS-400: c'est un mini-ordinateur de la gamme IBM, fin des années 1980
+ - IBM/AS-400: possibilité de "découper" plusieurs machines logiques
+ dans une machine physique.
+ - IBM/AS-400: un programme ne parle pas directement au matériel, il
+ utilise un set d'instructions haut-niveau (ISA), ce qui rend le
+ programme indépendant du CPU sur lequel il tourne. Ceci a facilité
+ la transition vers les PowerPC.
+ - http://en.wikipedia.org/wiki/IBM_System_i
+ - XXX IBM/AS-400: pourquoi "co-designed VM" ? XXX rechercher sur internet
+ - emulation des instructions CPU de "haut niveau"
+ XXX regarder comment ca marche: est-ce que c'est un hyperviseur ou
+ un interpreteur.
+
+Nineties and later: application VMs
+-----------------------------------
+
+.. figure:: java.png
+ :width: 15%
+
+- Java
- - High level ISA including I/Os
- - Proprietary CISC → PowerPC
+ - a Java program is compiled into a portable bytecode
+ - the JVM is a fictive computer that is able to run this bytecode
-- Application VMs
+- Microsoft Common Language Infrastructure (.Net)
- - Sun Java, Microsoft Common Language Infrastructure
+.. note::
-- OS VMs
+ - http://en.wikipedia.org/wiki/Java_virtual_machine
+
+Now: OS virtual machines
+------------------------
- - VMware (virtualized PC on x86)
+- Run an operating system virtualized top of a virtual machine
+- Examples:
+
+ - VMware products (virtualized PC on x86)
+ - KVM
- Virtual PC (PC emulation on Mac OS/PowerPC)
- Many others : Bochs, VirtualBox, Qemu, ...
- Reduction of Total Cost of Ownership (TCO)
- Increase utilisation of server resources
+ - Spawn new servers "on demand" (ex: Amazon EC2 and Elastic Load
+ Balancer)
- Reduction of Total Cost of Functioning
- Cooling
- Occupied Space
-- Hardware Consolidation
-
-- Reduction of Build Of Material (BOM) for high-volume low-end
- products
-
-- Isolation of OS for security purposes
-
+- Isolation of OS for security purposes (Qubes, Cells)
+
+.. note::
+
+ - reduction TCO + TCF: parler du cas data center. On peut parler DE
+ migration à chaud, d'élasticité, ...
+ - amazon ec2:
+
+ - un client peut créer des machines virtuelles à la demande
+ - Elastic Load Balancer: Les ELB permettent de répartir la charge
+ entre les instances EC2
+ - Autoscaling: Permet de gérér automatiquement l'élasticité sur
+ un ou plusieurs groupes d'instances EC2
+ - Cloud Watch: Permet de suivre et monitorer des métriques des
+ instances EC2 pour envoyer des notifications ou prendre des
+ actions
+
+ - "qubes" (security) http://qubes-os.org/trac/wiki/QubesScreenshots
+
+ - Based on a secure bare-metal hypervisor (Xen)
+ - Networking code sand-boxed in an unprivileged VM (using IOMMU/VT-d)
+ - USB stacks and drivers sand-boxed in an unprivileged VM (currently
+ experimental feature)
+ - No networking code in the privileged domain (dom0)
+ - All user applications run in “AppVMs”, lightweight VMs based on
+ Linux
+ - Centralized updates of all AppVMs based on the same template
+ - Qubes GUI virtualization presents applications like if they were
+ running locally
+ - Qubes GUI provides isolation between apps sharing the same desktop
+ - Secure system boot based (optional)
Virtualization in high-throughput network equipments
----------------------------------------------------
-.. figure:: high-thput1.jpg
-
-.. figure:: high-thput2.jpg
-
-Virtualization in Multimedia devices
-------------------------------------
-
-- Reduction of Build Of Material (BOM) for high-volume low-end
- products
-
- - No need for a general purpose processor
-
- - 20 to 25 % BOM reduction
+.. figure:: high-thput.svg
+ :width: 100%
- - Run Linux together with OS supporting Codecs on a single TI DSP
+.. note::
- - Leverage Linux environment
+ - Initialement, on a un système qui tourne sur plusieurs anciennes
+ cartes (plus la carte de management sous linux). On veut mettre à
+ jour le matériel, il est alors possible si la nouvelle carte est
+ plus puissante de virtualiser les anciennes sans modifier le
+ logiciel.
- - Reuse existing DSP software
+ dataplane + control plane -> en une carte
-XXX 2 images
+ - Reprendre ce qui a été dit au slide précédent
Usages of Virtual Machines
--------------------------
- Web sites hosting
-- OS partitionning
-
- - Time sharing
- - Security
-
- OS/kernel education & training
- OS fault recovery
- Run applications not supported by host OS
+- OS migration without reinstalling it on a new hardware
+
+.. note::
+
+ - time sharing: on veut utiliser plusieurs OS sur la meme machine:
+ analogie avec plusieurs processes.
+
+ - eduction & training: on peut imaginer le cas d'un TP, comme présenté
+ dans l'article linux mag 140 sur la libvirt: chaque étudiant
+ travaille sur une machine virtuelle préconfigurée XXX a relire
+
+ - backward compatibility: préciser que c'est utile lorsque le matériel
+ n'est plus disponible par exemple.
+
+ - run app not supported by host OS: wine
+
+ - Certains services ne sont accessibles qu'au niveau de l'OS
+ (routage, filtrage, ...). Avoir plusieurs OS permet de les
+ dupliquer (ex: daisy chain tcp avec des VR)
+
Recovery Servers
----------------
+- Another example: one backup server to replace any machine
+
.. figure:: recovery.png
+ :width: 100%
+
+.. note::
+ - La virtualisation permet de faire de la haute disponibilité à pas
+ cher. Souvent c'est le logiciel qui crashe. On peut dupliquer tout
+ une architecture reseau:
+ - apache
+ - mySQL
+ - mail
+ - etc...
+
+ - Un seul serveur backup à droite pour tous les autres
+ serveurs. Permet de ne pas avoir 8 machines. Si un des 4 se casse la
+ gueule, c'est celui de droite qui prend la main.
+
+ - en effet, chaque machine a sa propre configuration
+ systeme/reseau/filtrage... Il n'est pas forcément évident de
+ mettre les 4 services sur une même machine sans virtualisation.
Multi-Core CPU Issues (1)
-------------------------
- Adaptation to multi-pro even more difficult than RTOS
+.. note::
+
+ - cas des applications multi-threadées mais conçues avec en tête le
+ fait que la machine n'a qu'un seul core. la virtualisation systeme
+ permet de paralleliser ces applis sur des machine physiques
+ multicores (chaque VM étant mono-core), expliqué slide suivant.
+
+ - Beaucoup d'applications sont encore monoprocesseur. Cela simplifie
+ drastiquement la manière de coder, il n'y a pas de race condition,
+ pas besoin de locks/mutex. XXX
+
+ - ce probleme se pose moins sur un système classique que sur des
+ systèmes anciens ou des systèmes temps réel. En effet, les systèmes
+ classiques modernes supportent très bien le multicore et il
+ suffirait de lancer plusieurs applications simultanément. XXX
+
+ - certaines applications RT multithreadées comptent sur le fait qu'il
+ n'y a qu'un CPU, et que 2 threads ne sont jamais executés de manière
+ réellement concurrente
+
Multi-Core CPU Issues (2)
-------------------------
- Scalability managed at virtualization level
+.. note::
+
+ - La virtualisation système permet de faire tourner plusieurs instance
+ d'un système d'exploitation non SMP sur un processeur multicore.
+
+ - Cela peut permettre d'éviter de réécrire un logiciel conçu pour une
+ machine mono-core. Le logiciel dont il est question ici est plutôt
+ un logiciel RT ou un noyau, car si c'est une application standard,
+ le problème ne se pose pas.
+
Virtualization Taxonomy
=======================
+.. note::
+
+ taxonomy = inventaire
+
Machines Interfaces
-------------------
.. figure:: isa-abi.svg
+ :width: 70%
- ISA = Instruction Set Architecture
- ABI = Application Binary Interface
- Process level interface
- - User-level non privileged ISA instructions + OS systems 14 calls
+ - User-level non privileged ISA instructions + OS systems calls
+
+.. note::
+
+ - ISA: Instruction Set Architecture
+
+ les instructions du CPU (donner des exemples, comme le MOV, CLI/STI
+ pour vérouiller les interruptions), les périphériques, la MMU
+ (comment elle est doit être configurée), ...
+
+ C'est l'interface qui est utilisé par le système d'exploitation.
+
+ - ABI: Application Binary Interface
+
+ C'est l'interface qui permet à un processus de communiquer avec
+ l'extérieur. Il s'agit principalement d'appels systèmes (read,
+ write, gettimeofday, execve, sleep).
+
+ l'abi contient les instructions non-privilegiées + l'api de l'OS.
+ D'autres instructions comme le cli/sti ne font pas partie de l'ISA.
+
+ - exemple de la couche de compatibilité pour une application 32 bits
+ tournant sur un kernel 64 bits.
Virtualization Taxonomy
-----------------------
-- Process level virtualization
+- Virtualization at process level (ABI)
- Emulation of Operating System ABI
- - Emulation of OS ABI, cross-architecture
- Virtual Servers
-- System level virtualization
+- Virtualization at system level (ISA)
+
+ - Standalone vs Hosted Virtualization
+ - Machine Emulation vs Machine Virtualization
+
+.. note::
+
+ - un processus tourne déjà dans une machine virtuelle fournie par
+ l'OS, mais pas au même niveau. Historiquement, l'objectif d'un
+ système d'exploitation multitâche est de fournir des machines
+ virtuelles pour les applications (donc les utilisateurs). Chaque
+ application "pense" qu'elle est tout seule sur le processeur.
+
+ Chaque application peut avoir accès aux ressources via les appels
+ systèmes, comme si l'application était la seule à parler aux
+ périphériques. C'est au système d'exploitation d'ordonnancer les
+ processus et leurs requetes.
-- Standalone / Hosted Virtualization
-- Machine Emulation / Machine Virtualization
+ - la virtualisation systeme fonctionne sur le même principe mais à un
+ niveau différent. Nous allons voir dans les slides suivants les
+ différents types de virtualisation (standalone vs hosted, et
+ emulation vs virtualisation).
Hosted versus Standalone Virtualization
---------------------------------------
- OS run in a VM is named a Guest OS
+.. note::
+
+ - hosted = hebergée
+
+ - guest = invité
+
+ - standalone = autonome, plus petit
+
+ - en général, le "hosted" n'accede pas réellement au hardware mais à des
+ périphériques émulés
+
+ - le cas kvm est ambigu: le kernel qui tourne en mode root
+ s'execute réellement sur le bare-hardware.
+
Hosted Virtualization
---------------------
.. figure:: hosted.svg
+ :width: 100%
Example: VMware Workstation
----------------------------
-.. figure:: vmware-wks.png
+.. figure:: vmware-wks.svg
+ :width: 100%
+ :class: fill
- Hosted VM
- Unmodified OSes
-------------------------
.. figure:: standalone.svg
+ :width: 100%
Example: VMware ESX
-------------------
+.. figure:: vmware-esx.svg
+ :width: 100%
+ :class: fill
+
- Standalone VMM
- Supports unmodified OS binaries
- Guest OS
- runs in user mode
-Process Level Virtualization
-============================
+Process Level Virtualization: ABI Emulation
+===========================================
Process level ABI Emulation
---------------------------
- Goal: execute binary applications of a given system **X** on the ABI of
- another system **Y**
+ another system *Y*
-- Emulate system **X** ABI on top of system **Y** ABI
+- Emulate system **X** ABI on top of system *Y* ABI
- Emulation done by application-level code
-- System **Y** must provide services equivalent to those of system
+- System *Y* must provide services equivalent to those of system
**X** (file system, sockets, etc...)
+- Example: **X** = Windows and *Y* = Linux
+
+.. note::
+
+ - exemple du CreateFile() de windows qui serait émulé par un open()
+ sur un unix
+
Process Level (ABI) Emulators
-----------------------------
-- Wine - Windows Emulator on Unix/Linux
+- Wine run Windows applications on POSIX-compliant operating
+ systems
- Windows API in userland
- Adobe Photoshop, Google Picasa, ...
-- Cygwin
+- Cygwin: recompile POSIX applications so they can run under Windows
- Unix emulation on Windows
- POSIX library
- GNU development tool chain (gcc, gdb)
- X Window, GNOME, Apache, sshd, ...
+.. note::
+
+ - **DEMO**: lancer un .exe avec wine64
+ - l'ABI dépend du système d'exploitation mais aussi de l'architecture.
+
+ - les appels systèmes sont différents entre linux et windows
+ - mais les appels systemes ne s'invoquent pas de la même manière
+ sur 2 architectures différentes. Par exemple, sur un x86, on
+ utilise un INT 0x80 (en fait SYSENTER maintenant), et les
+ arguments sont placés dans des registres particuliers
+
+ - google picasa for linux inclut une version embarquée de wine
+
Process Level Cross-architecture Emulators
------------------------------------------
- Emulated OS and native OS are the same (ex: both are linux)
- Emulated arch is different than native architecture (ex: x86 and
powerpc)
+ - Note: we define what is "emulation" later in the presentation
-- Example: qemu-user::
+- Example: qemu-user
+
+.. code-block:: sh
$ gcc hello.c
$ ./a.out
hello
-
$ powerpc-linux-gnu-gcc -static hello.c
$ ./a.out
bash: ./a.out: cannot execute binary file
$ qemu-ppc ./a.out
hello
+.. note::
+
+ - par exemple, vous récupérer une freebox ou un routeur basé sur du
+ mips ou arm, et vous voulez lancer et débugger une application.
+
+Process Level Virtualization: virtual servers
+=============================================
+
Virtual Servers (1)
-------------------
- Single OS kernel / Multiple resource instances
+ - can run several linux distributions on the same kernel
+
- Isolated kernel execution environments
- Root file system
- Network: Routing table, IP tables, interfaces...
- - Process for signals
+ - Process signals
- Solaris 10 Containers
-- LXC, Linux-VServer, openVZ
+- LXC, Linux-VServer, openVZ: namespaces and cgroups
- FreeBSD Jail
+.. note::
+
+ - tous les processus sont vus par le kernel
+
+ - les processus ont des vues différentes du système d'exploitation et
+ sont cloisonnés. Ils n'ont pas conscience des domaines adjacents et
+ ont des vues différentes du système (FS, réseau, ...).
+
+ - Les namespaces de linux sont un bon exemple (lxc, openVZ).
+
+ - XXX reflechir à une demo... ?
+
+ - expliquer comment ça peut être implémenté dans le kernel: un
+ parametre supplémentaire pour chaque appel systeme
+
+ - dire que niveau sécurité, c'est pas encore ça pour cloisonner.
+
+ - voir dessin slide suivant
+
+ - signal -> table of process ?
+
Virtual Servers (2)
-------------------
.. figure:: virtual-servers.svg
+ :width: 100%
Virtual Servers (3)
-------------------
- Con's
- - No OS heterogeneity (no GPOS/RTOS combination)
+ - No OS heterogeneity
- Single OS binary instance (common point of failure)
-Transparent Hardware Emulation
-==============================
+System Level Virtualization: Transparent Hardware Emulation
+===========================================================
Transparent Hardware Emulation (1)
----------------------------------
Transparent Hardware Emulation (2)
----------------------------------
-- Emulate machine X on top of machine Y
+- Emulate machine **X** on top of machine *Y*
-- Interpretation
+- Interpretation: read, decode, execute
- - 1 instruction of X executed by N instructions of Y
+ - 1 instruction of **X** executed by N instructions of *Y*
- Huge slow down method
- Dynamic Binary Translation
- - Convert blocs of X instructions in Y instructions
-
-- Application-level emulator runs on a native OS
-- One VM running a single Guest OS
-
-QEMU Architecture
------------------
-
-.. figure:: qemu.svg
+ - Convert blocs of **X** instructions in *Y* instructions
+ - Conversion is done once per basic block
+ - Advanced: dynamic optimization of 'hot' blocs
+
+- The emulator is usually a standard application running on a native
+ OS
+
+.. note::
+
+ - Expliquer comment un emulateur peut être implémenté, c'est un gros
+ switch/case, chaque instruction doit être parsée et son comportement
+ doit être émulé. L'émulateur doit conserver dans des variables
+ l'état des registres.
+
+ - Voilà pourquoi on en arrive à faire de la translation de blocs de
+ code. Attention, la translation dynamique ne se fait qu'à la volée,
+ c'est plus difficile de prendre le binaire, le convertir, et l'executer
+ (translation statique).
+
+ - https://en.wikipedia.org/wiki/Binary_translation
+ - Dynamic binary translation looks at a short sequence of
+ code—typically on the order of a single basic block—then
+ translates it and caches the resulting sequence.
+ - Code is only translated as it is discovered and when possible, and
+ branch instructions are made to point to already translated and
+ saved code (memoization).
+ - Apple Computer implemented a dynamic translating emulator for M68K
+ code in their PowerPC line of Macintoshes, which achieved a very
+ high level of reliability, performance and compatibility
+ - Intel: IA32 over Itanium
QEMU: Hosted Hardware Emulator
------------------------------
- Cross ISA Emulation
- - Emulate machine X on top of machine Y
+ - Emulate machine **X** on top of machine *Y*
- Interpretation + translation
-- Intel x86, PowerPC, ARM, Sparc architectures
+- Intel x86, PowerPC, ARM, Mips, Sparc, ...
- Emulation of SMP architectures
- Hard Disk drives, CD-ROM, network controllers, USB controllers, ...
- Synchronous emulation of device I/O operations
-Transparent Hardware Virtualization
-===================================
+.. note::
+
+ - **DEMO**: lancer kid icarus avec mednafen
+ - ``mednafen -vdriver sdl -nes.xscale 4 -nes.yscale 4 ~/cours_ivan/cours_virt/Kid\ Icarus\ \(Europe\)\ \(Rev\ A\).zip``
+ - voir /usr/share/doc/mednafen/mednafen.html
+ - http://idoc64.free.fr/ASM/instruction.htm
+ - QSDZ = dir, ret=start, tab=select, OP=buttons
+ - Alt-D affiche le debugger
+ - addresse A6 diminue qd on perd des vies
+ - shift-W: write breakpoint, R pour run
+ - on peut essayer de mettre une grosse valeur:
+ Poke A6 30 1
+ - ne marche pas, car sature
+ - breakpoint à A6
+ - shift P (poke in rom): ED45 60 1 (on met un RTS)
+ c'est l'endroit qui sature
+ - Poke A6 30 1
+ - à l'adresse DB6C, c'est l'endroit où on stocke A6 après s'etre fait toucher
+ par un monstre::
+
+ LDA A6: charge la valeur
+ SEC: set carry
+ SBC: sub with carry
+ BCS: branch on carry set (on comprend que si ça vaut < 0, on met 0)
+
+ - 7E42 0 1 -> on met 0 sur le decrement des monstres
+
+System Level Virtualization: Transparent Hardware Virtualization
+================================================================
Transparent Hardware Virtualization
-----------------------------------
- Share machine resources among multiple VMs
+.. note::
+
+ - le slide décrit la problematique qui est la meme que pour l'émulation
+
+ - peut etre donner aussi les exemples style kqemu ou virtualbox
+ (modules accélération). Dire aussi que ça ne concerne toujours pas
+ les Intel-VT, dire que ça va plus vite que l'émulation
+
+ - share machine resource: exemple des pages memoires en copy-on-write
+
Full CPU Virtualization (1)
---------------------------
- Present same functional CPU to all Guest OSes
-- VMM manages a CPU context for each VM
+- VMM manages a CPU context for each vCPU of each VM
- saved copy of CPU registers
- representation of software-emulated CPU context
-- VMM shares physical CPUs among all VMs
+- VMM shares physical CPUs among all vCPU of VMs
- VMM includes a VM scheduler
- round-robin
- priority-based
+.. note::
+
+ - representation of software-emulated CPU context: exemple, savoir que
+ les IT sont masquées ou non.
+
Full CPU Virtualization (2)
---------------------------
- Run each Guest OS in non-privileged mode
.. figure:: cpu-virt.svg
+ :width: 100%
"Hardware-Sensitive" Instructions
---------------------------------
- Done once, saved in Translation Cache
- Example: Vmware
+.. note::
+
+ - instructions priviligées: ex, masquage des IT
+
+ - intruction critiques: ex, read de status flag, de CR3, ...
+
Privileged Instructions Virtualization
--------------------------------------
- But no exception for popf => VMM not aware of Guest OS action
(unmask interrupts)
+.. note::
+
+ - premier pb: pushf est autorisé et met toujours en pile des flags
+ disant que les IT sont autorisées
+ - popf doit aussi être intercepté car il faut mettre à jour le
+ statut des IT
+
Critical Instructions Virtualization (2)
----------------------------------------
- VMM emulates expected effect of critical instruction, if any.
+.. note::
+
+ - **PAUSE**
+ - XXX est-ce que la translation doit être faite uniquement sur le
+ code qui a vocation à tourner en ring 0 ?
+
Full Memory Virtualization
--------------------------
- 4 GB on most 32-bit architectures (Intel x86, PowerPC)
- - Manages virtual page → physical case mappings
+ - Manages virtual page → physical page mappings
- Manages « swap » space to extend physical memory
-MMU & Virtual Address Space
----------------------------
+.. note::
+
+ - la MMU est un composant hardware
+
+Reminder about MMU (1)
+----------------------
+
+- Here is a minimal code example:
-.. figure:: mmu1.svg
+ .. code-block:: sh
-Intel x86 MMU
--------------
+ # a program that takes x and y in memory, and
+ # computes the sum
+ mov %0x200000,eax # retrieve <x> in eax
+ mov %0x200004,ebx # retrieve <y> in ebx
+ add ebx,eax # compute x+y in eax
+ mov eax,%0x200008 # save the result in memory
+
+- This program can run on one cpu
+- If the addresses are physical, it is not possible to run multiple
+ instance of this program as they would modify the same memory
+
+.. note::
+
+ - une mauvaise solution est de modifier le binaire à chaque execution
+
+Reminder about MMU (2)
+----------------------
+
+.. figure:: mmu-slide1.svg
+ :width: 70%
+
+Reminder about MMU (3)
+----------------------
+
+.. figure:: mmu-slide2.svg
+ :width: 95%
+
+Reminder about MMU (4): Intel x86 MMU
+-------------------------------------
.. figure:: mmu2.svg
+ :width: 100%
Memory Virtualization (1)
-------------------------
+.. figure:: mmu-slide3.svg
+ :width: 70%
+
+Memory Virtualization (2)
+-------------------------
+
- Machine Physical Memory
- Physical memory available on the machine
- Guest OS manages virtual address spaces of its processes
-Memory Virtualization (2)
+Memory Virtualization (3)
-------------------------
- Guest OS manages Guest Physical Pages
- VMM dynamically translates Guest Physical Pages into Machine
Physical Pages
-Memory Virtualization (3)
+Memory Virtualization (4)
-------------------------
-.. figure:: mem-virt.svg
+.. figure:: mmu-slide4.svg
+ :width: 95%
+
+.. note::
+
+ - passer en dynamique, expliquer comment sont fait les
+ translations, parler du tlb
+ montrer l'ordre chronologique des choses
+ - on dézoome un coup, en statique
+ virtual memory vs VM memory vs host physical memory
+ pas de mmu dans ce cas
+ - montrer en dynamique avec une seule MMU comment
+ l'hyperviseur configure la MMU
+
+ utiliser les memes couleurs pour les types de memoire
+ on va détourner la mmu pour faire la translation qui nous
+ va bien
+
+ - mettre un nombre dans CR3
+ - mettre des barres verticales dans les page tables
+ - TLB sous la forme d'un tableau avec des lignes vides
+ - find -> get
+ - mmu plus large
+ - zoom sur les PTE à droite
+ - faire apparaitre les adresses
+ - dissocier les valeurs des adresses virtuelles et physiques, mettre
+ des couleurs différentes pour ces adresses
+ - voir si on ne peut pas faire apparaitre que les 20 bits significatifs
+ et pas les 12 bits d'offset qd on parle des adresses
+
+ - Lorsque le guest accede à CR3 (ou un PTE), cela génère une faute,
+ gérée par le VMM. Le VMM va translater l'adresse donnée par l'OS de
+ la VM et remplir le registre CR3 avec l'adresse physique
+ correspondant à la zone utilisée par la VM pour y stocker ses tables
+ de pages. Il faut que tout accès à cette table de page génère une
+ faute pour que le VMM soit notifié de tout changement et puisse
+ configurer la MMU réelle en conséquence (en faisant la translation
+ d'adresse). (slide 47)
+
+ - La lecture de CR3 ne génère pas de TRAP, il faut donc faire comme
+ pour les pushf et popf, c'est à dire de la translation de code.
-Memory Virtualization (4)
+Memory Virtualization (5)
-------------------------
- VMM maintains Shadow Page Tables
- Emulates operation in shadow page table
- Updates effective MMU page table entry, if needed
-Memory Virtualization (5)
+Memory Virtualization (6)
-------------------------
- PTE entries can be tagged with a context ID
- VMM must flush TLB when switching VMs
-Memory Virtualization (6)
+Memory Virtualization (7)
-------------------------
- VMM must respect Guest OS virtual page faults
- Pages with same content's (e.g. zero-ed pages)
-Memory Virtualization (7)
+Memory Virtualization (8)
-------------------------
- VMM can swap real pages of a VM
- no more available for normal kernel allocation service
- VMM assigns same amount of physical pages to other VM's
-Paravirtualization
-==================
-
-Paravirtualization (1)
-----------------------
+.. note::
-- OS adaptation to avoid binary translation overhead
-- Requires access to OS source code
-- Include drivers of virtual devices
-- Examples:
+ - ballooning: un module kernel est dans les guests, il communique
+ avec le VMM. Si le VMM a besoin de mémoire physique pour une
+ autre VM, il peut demander au module d'allouer de la mémoire, qui
+ est alors perdue pour les autres services. Cette mémoire est
+ "redonnée" au VMM.
+ - besoin de précisions et sources là dessus
- - Xen
- - User Mode Linux (UML)
+System Level Virtualization: Paravirtualization
+===============================================
-Paravirtualization (2)
+CPU Paravirtualization
----------------------
-- Still run each Guest OS in non-privileged mode
+- Still run each Guest OS in non-privileged mode, but with minimal
+ virtualization overhead
-- But with minimal virtualization overhead
-
-- => Modified Guest OS kernel
-
- - Remove Hardware-Sensitive Instructions
-
- - Use fast VMM system calls instead, if needed
+- OS adaptation to avoid binary translation overhead
- - Minimise usage of Privileged Instructions
+ - Remove Hardware-Sensitive Instructions, use fast VMM system calls
+ - Minimize/avoid usage of Privileged Instructions
- Only affect Machine/CPU dependant part of OS
-- OS portage on new architecture with same CPU
+- OS portage on new architecture with same CPU, without system ISA
- - Without system ISA
+- Examples: Xen legacy, User Mode Linux (UML), CoLinux
-Paravirtualization (3)
+I/O Paravirtualization
----------------------
-- Guest OS only use Virtual I/O Devices, in a cooperative way
+- Multiplexing VMM physical devices among VMs
- Front-end driver in Guest OS
- Back-end driver in VMM
+ - Virtual ethernet, virtual disks
-- VMM multiplex VM Virtual Devices on physical devices
+- Fast virtual devices for VM to VM communications
- - Virtual Ethernet
- - Virtual Disks
+ - Example: vmxnet3
-- Data transfer through I/O rings
+- Data transfer through syscalls, shmem, rings, ...
+- Pros: scalability, VM migration
Virtual I/O Devices
-------------------
.. figure:: virt-devices.svg
+ :width: 100%
-Paravirtualization Example: Xen
--------------------------------
+Paravirtualization Example: Xen Legacy
+--------------------------------------
- Objectives
- - Scalable, support more than 100 VM
+ - Scalable
- Share resources of Server machines
- Intel IA-32, x86-64, ARM, ...
- Have access (and manages) all physical devices
- Modified version of Linux, FreeBSD
-Hardware-Assisted Virtualization
-================================
+.. note::
+
+ XXX vérifier le coup de domain 0
+
+System Level Virtualization: Hardware-Assisted Virtualization
+=============================================================
Hardware Assisted Virtualization (1)
------------------------------------
Hardware Assisted Virtualization (3)
------------------------------------
-- DMA virtualization
+- Directed I/O virtualization
- IO-MMU (Intel VT-d)
--------------------------------
.. figure:: vt-x.svg
+ :width: 100%
Intel VT-x CPU Virtualization (1)
---------------------------------
- VM entries & VM exits use a new data structure
- - Virtual Machine Control Structure (VMCS) per VM
+ - Virtual Machine Control Structure (VMCS) per VM CPU (vCPU)
- Referenced with a memory physical address
- Format and layout hidden
- New VT-x instructions to access a VMCS
- Guest State Area
- Saved value of registers before beeing changed by
- - VM Exits (e.g., Segment Registers, CR3, IDTR)
+ VM Exits (e.g. Segment Registers, CR3, IDTR)
-- Hidden CPU state (e.g., CPU Interruptibility State)
+ - Hidden CPU state (e.g., CPU Interruptibility State)
- Host State Area
- - VM Control Fields
+- VM Control Fields
+
- Interrupt Virtualization
- Exceptions bitmaps
- I/O bitmaps
- Model Specific Register R/W bitmaps
- Execution rights for CPU Privileged Instructions
+.. note::
+
+ - host state area est l'endroit ou l'état du processeur du VMM est
+ stocké. Il est restauré sur VMExit.
+
+ - Switching from root mode to non-root mode is called "VM entry", the
+ switch back is "VM exit". The VMCS includes a guest and host state
+ area which is saved/restored at VM entry and exit. Most importantly,
+ the VMCS controls which guest operations will cause VM exits.
+
+ The VMCS provides fairly fine-grained control over what the guests
+ can and can't do. For example, a hypervisor can allow a guest to
+ write certain bits in shadowed control registers, but not
+ others. This enables efficient virtualization in cases where guests
+ can be allowed to write control bits without disrupting the
+ hypervisor, while preventing them from altering control bits over
+ which the hypervisor needs to retain full control. The VMCS also
+ provides control over interrupt delivery and exceptions.
+
+ Whenever an instruction or event causes a VM exit, the VMCS contains
+ information about the exit reason, often with accompanying
+ detail. For example, if a write to the CR0 register causes an exit,
+ the offending instruction is recorded, along with the fact that a
+ write access to a control register caused the exit, and information
+ about source and destination register. Thus the hypervisor can
+ efficiently handle the condition without needing advanced techniques
+ such as CSAM and PATM described above.
+
+ VT-x inherently avoids several of the problems which software
+ virtualization faces. The guest has its own completely separate
+ address space not shared with the hypervisor, which eliminates
+ potential clashes. Additionally, guest OS kernel code runs at
+ privilege ring 0 in VMX non-root mode, obviating the problems by
+ running ring 0 code at less privileged levels. For example the
+ SYSENTER instruction can transition to ring 0 without causing
+ problems. Naturally, even at ring 0 in VMX non-root mode, any I/O
+ access by guest code still causes a VM exit, allowing for device
+ emulation.
+
+ - Tout l'état du processur visible est sauvé dans ou restauré depuis
+ la memoire:
+
+ - tous les registres, meme ceux de controle
+ - interruptability state
+
+ - La VMCS contient ce que la VM a le droit de faire
+
+ - IO bitmaps = bitfield qui dit quels ports IO (instructions in et
+ out) sont autorisés.
Intel VT-x Interrupt Virtualization
-----------------------------------
- Used by VMM to control VM interrupts
+.. note::
+
+ - la window permet de délayer l'interruption hardware (et donc le vm
+ exit) tant que le guest n'a pas demasqué ses IT.
+
+ - VT-x also includes an interrupt-window exiting VM-execution
+ control. When this control is set to 1, a VM exit occurs whenever
+ guest software is ready to receive interrupts. A VMM can set this
+ control when it has a virtual interrupt to deliver to a
+ guest. Similarly, VT-i includes a PAL service that a VMM can use to
+ register that it has a virtual interrupt pending. When guest
+ software is ready to receive such an interrupt, the service
+ transfers control to the VMM via the new virtual external interrupt
+ vector.
+
Intel VT-x MMU Virtualization
-----------------------------
-----------------------------
.. figure:: vt-x-mem.svg
+ :width: 100%
Intel VT-x Extended Page Tables (1)
-----------------------------------
-----------------------------------
.. figure:: vt-x-mmu.svg
+ :width: 100%
+
+.. note::
+
+ - le TLB contient cache les 2 translations VA->GPA et GPA->MPA
+
+ - There is only one downside: nested paging or EPT makes the virtual
+ to real physical address translation a lot more complex if the TLB
+ does not have the right entry. For each step we take in the blue
+ area, we need to do all the steps in the orange area. Thus, four
+ table searches in the "native situation" have become 16 searches
+ (for each of the four blue steps, four orange steps).
+
+ http://www.anandtech.com/show/2480/10
TLB Flush Issue
---------------
.. figure:: tlb-flush-issue.svg
+ :width: 100%
+
+.. note::
+
+ - 2 processes dans dess VMs différentes peuvent utiliser la même
+ adresse virtuelle
Intel VT-x Virtual Processor Identifier
---------------------------------------
- VPID loaded from VMCS on VM Enter
-DMA Virtualization (1)
-----------------------
+.. note::
-- Enable Guest OS to manage I/O devices
+ - faire la demo de Windows dans un KVM, on peut parcourir le
+ gestionnaire de périphérique pour voir que ce n'est pas du tout ce
+ que j'ai sur mon PC. En plus ça fait bien la transition avec la
+ virtualisation DMA.
- - I/O devices assigned by VMM to Guest OSes
+.. Intel Virtualization Technology for Directed I/O
+ ================================================
-- Transparent mode
+Intel VT-d Principles
+---------------------
- - Use native device driver of Guest OS
- - Unaware of physical memory Virtualization
+- Enable Guest OS to directly manage physical I/O devices
-- Enforce isolation between Guest Oses
+ - Guest I/O operations bypass VMM
- - Guest OS only view hardware ressources assigned by VMM (memory,
- devices)
+- In full transparent mode
-DMA Principles
---------------
+ - Use native device drivers of Guest OS
+ - Guest OS unaware of underlying physical memory virtualization
-.. figure:: dma.svg
+- Enforce isolation between Guest VMs
-DMA Virtualization (2)
-----------------------
+ - Guest OS can only access I/O ressources (ports, PCI devices) assigned to it
+ - PCI I/O device can only perform DMA to machine physical pages assigned to
+ Guest VM owning that device.
+
+Intel Directed IO
+-----------------
.. figure:: dma-virt.svg
+ :width: 100%
+
+DMA Principles
+--------------
+
+.. figure:: dma.svg
+ :width: 100%
DMA Virtualization Issue
------------------------
- Guest Physical Address must be translated into its corresponding
Machine Physical Address when used for DMA operations by device
-- GPA Translation cannot be done by VMM
+- GPA -> MPA translation cannot be done by VMM
- VMM cannot catch device-specific driver operations to setup I/O
buffers addresses
+- GPA -> MPA translation done by IOMMU on the Bus Controller
+
Intel VT-d Protection Domains
-----------------------------
-- Intel VT-d provides DMA Protection Domains
+- Intel VT-d provides DMA Protection Domain
- Extension of IOMMU translation mechanism
- - Isolated context of a subset of the Machine Physical Memory (MPA)
- - Correspond to the portion of Machine Physical Memory allocated to
+ - Isolated context of a subset of the Machine Physical Memory
+ - Corresponds to the portion of Machine Physical Memory allocated to
a VM
-- I/O devices assigned by VMM to a DMA Protection Domain
+- I/O devices associated by VMM with a DMA Protection Domain
- - Achieves DMA isolation by restricting memory view of I/O devices
+ - Achieves DMA isolation by restricting memory access of I/O devices
through DMA address translation
Intel VT-d DMA Translation
-----------------------------
.. figure:: vt-d.svg
+ :width: 100%
PCI DMA Requester Identification
--------------------------------
- 16-bit PCI DMA Requester Identifier
.. figure:: dma-req-id.svg
+ :width: 80%
- Assigned by PCI configuration software
- Bus # indexes Bus Context Table in Root Context Table
----------------------------------
.. figure:: device-domain-mapping.svg
+ :width: 100%
Virtual DMA Address Translation
-------------------------------
-- VDA ↔ MPA VT-d Page Tables similar to IA-32 processor Page Tables
+- DVA ↔ MPA Page Tables similar to IA-32 processor Page Tables
- 4KB or larger page size granularity
- Initialized at VM creation time
- With same translations of the VM Extended Page Table
-Device Virtualization
----------------------
+VMM and Directed I/O
+--------------------
+
+- Unplugs assigned PCI device from VMM driver and reset it
+
+- Associates PCI device with VT-d Protection Domain of the Guest VM
+
+- Maps device memory BARs in Guest VM physical space
+
+- Arranges for OS of Guest VM to probe PCI device(s) assigned to it
-- Share I/O device among multiple VMs
+- Handles device interrupts and redirect them to Guest VM
+
+- Reset assigned PCI device upon Guest VM shut down
+
+.. Device Virtualization
+ =====================
+
+Device Virtualization Principles
+--------------------------------
+
+- Share I/O device among multiple Guest VMs
- With no performance lost
- While enforcing VM isolation and protection
- Move device virtualization from the VMM to the device itself
-- Requires support from the device
+- PCIe extension
-- Example of Ethernet controllers
+- PF/VF requires support from the device
-Ethernet Device Virtualization
-------------------------------
+Ethernet Device Virtual Functions
+---------------------------------
.. figure:: ethernet-dev-virt.svg
-Intel Single Root I/O Virtualization
-------------------------------------
+Single Root I/O Virtualization
+------------------------------
- SR-IOV capable PCI Device can be partitionned into multiple Virtual
Functions
- SR-IOV Device appears in PCI configuration space as multiple PCI
Virtual Functions
-- Each Device Virtual Function includes
+- Virtual Functions are "lightweight" PCI functions including
- - PCI configuration registers
+ - PCI probing capabilities
- DMA streams
- Interrupts
- Requires VT-d for DMA virtualization
-Intel SR-IOV (1)
-----------------
-
-- VMM manages physical PCI device
-
-- Create a PCI Virtual Function for each VM
-
- - Include it into VM PCI configuration space to be probed by VM
- GuestOS kernel
- - Map it to Protection Domain of VM
-
-- Programs the sharing of physical devices ressources between VFs
-
-- PCI Device Virtual Functions directly managed by specific VF-Aware
- GuestOS drivers (kind of Para-Virtualization)
-
-Intel SR-IOV (2)
-----------------
-
-XXX
-
-Intel SR-IOV - Ethernet example
--------------------------------
-
-- Intel Kawela (1GB) / Niantic (10GB) Ethernet NICs
-
- - Multiple RX/TX packet queues per port
-
-- Virtual Device Machine Queues
-
- - 1 RX paquet queue per VF
-
-- Filters multiple unicast Ethernet Addresses
-
-- Layer-2 paquet filtering based on Ethernet Destination Address
-
-- Duplicate Broadcast / Multicast packets for all VFs
-
-- Load balancing between TX paquets sent by VFs
+- Virtual Functions have no configuration resources
-Virtualization and Embedded Systems
-===================================
-
-Old Embedded Systems (1)
-------------------------
-
-- Relatively simple architecture
-
-- Single-purpose devices
-
-- Dominated by hardware constraints
-
- - Memory, battery charge
-
-- Dedicated functionalities, with moderated software size and
- complexity
-
-- Real-time constraints
-
-Old Embedded Systems (2)
+SR-IOV Device Management
------------------------
-- Closed environment (« black boxes »)
-
-- Fixed hardware configuration
+- VMM manages the physical PCI device
-- Full software provided by device vendor
+- VMM creates a PCI Virtual Function for each VM
-- No dynamic loading of applications
+ - Includes it into VM PCI configuration space to be probed by OS kernel
+ of Guest VM
+ - Associates VF with VT-d Protection Domain of the Guest VM
-- Software updates rareful
+- VMM programs the sharing of physical devices ressources between VFs
-Embedded Systems Now (1)
-------------------------
-
-- Take on features of general-purpose OS's
-
-- Growing functionalities => growing complexity and size
-
-- Run applications originally developed for PC's
-
- - Sophisticated Human Machine Interfaces (HMI)
- - Safari Web browser on iPhones
-
-- Dynamic loading of applications
-
- - Iphone
- - Google Android
-
-Embedded Systems Now (2)
-------------------------
+- Virtual Functions managed by specific VF-aware drivers in kernel of
+ Guest OS (kind of Para-Virtualization)
-- Dynamically load device's owner specific applications
-
- - Games
-
-- Applications developped by engineers with no expertise
- in embedded systems
-
- - Java applications
-
-- Need for exchanges with external world
-
- - USB, Bluetooth, Wi-Fi
- - TCP/IP
-
-- Need for open API's, and openness in general
-
-- Need for high-level systems (Linux, Windows)
-
-Embedded Systems Challenges
----------------------------
-
-- Still Real-Time systems (part of it)
-
- - Baseband stack of mobile phones
-
-- Still hardware constraints
-
- - Battery
- - Memory (to minimize device's cost)
-
-- Also used in mission/life critical situations
-
- - Weapons
- - Cars
-
-- High requirements on reliability and security
-
-Mobile Handsets
----------------
-
-XXX
-
-- Run Android/Linux applications on baseband processor
-
-- Re-use existing legacy modem software stack with its RTOS (no
- changes)
-
-- Support of Linux at a minimal development cost
-
-- Operating System independence for future evolutions
-
-- Security & Protection through OS isolation
-
-::
-
- HMI: Human-Machine-Interface
- PIM: Personal Information
-
-Virtualization in Embedded Systems (1)
---------------------------------------
-
-- Support for heterogeneous OS's environments
-
-- Real-time OS
-
- - Legacy software
- - Dedicated applications whose real-time constraints cannot be
- achieved by General-Purpose systems
- - Licence issues (« GPL contamination »)
-
-- General Purpose OS
-
- - Openness
- - HMI
-
-Virtualization in Embedded Systems (2)
---------------------------------------
-
-- Concurrent execution of RTOS and GP-OS on the same CPU
-
-- Reduces cost (Bill Of Material)
-
-- Requires the underlying VMM to provide
-
- - Memory isolation between OS's
- - CPU scheduling among OS's, with higher priority to the RTOS
- - Device partitionning
- - Communication mechanism between OS's
-
-Virtualization in Embedded Systems (3)
---------------------------------------
-
-- Leverage multi-cores support with virtual machine abstraction
-
-- 1 core per OS => no need for CPU scheduling
-
-- 2 low-performance cores consume less power than a single high
- performance CPU => simplify power management
-
-- New model of software distribution, shipping application with its own OS
-
- - No OS configuration/version incoherency
-
-Security Through Virtualization
--------------------------------
-
-- Notion of Trusted Computing Base (TCB)
-
- - Part of the system that provides security foundations
- - Should only include hardware and VMM
- - May also include RTOS, for performance/legacy reasons
-
-- Run GP OS in an isolated Virtual Machine
-
- - Avoid damaged GP OS to compromise the secure parts (data,
- services) of the system
-
-Embedded + Virtualization Challenges (1)
-----------------------------------------
-
-- Full isolation of VM's does not fit cooperation requirements between OS's
-
-- Efficient communication mechanisms between VM's
-
-- Global scheduling, with interleaved priorities
-
-- Global Energy Management
-
-Embedded + Virtualization Challenges (2)
-----------------------------------------
-
-- Efficient communication mechanisms between VM's
-
- - Virtual Ethernet device not adapted
- - Need VMM-controlled shared memory transfers
-
-- Example: Video streaming on a Smartphone
-
- - Video data received via the baseband managed by RTOS
- - Video data displayed by a Media Player running on GPOS
- - Avoid copy of video data transfered between the 2 OS's !
-
-Task Scheduling Issues
-----------------------
+Intel Niantic Virtual Functions (1)
+-----------------------------------
-- Standard server-oriented Virtualization model
+.. figure:: eth-sr-iov.svg
+ :width: 80%
- - The VMM schedules VM's on the CPU
- - The OS on each VM runs its own scheduler
+Intel Niantic Virtual Functions (2)
+-----------------------------------
-- Interleaved priorities in Embedded Systems
+- Virtual Devices on Intel Niantic (10GB) NICs
- - Baseband task of RTOS with a high priority
- - But GPOS Media-Player must have a higher priority than some
- low-priority tasks of RTOS
- - Enable a VM to yield the CPU
+- Layer-2 packet filtering based on destination MAC address
- - Use a RT task as a proxy of GP OS application, and make it yield
- the CPU
+- Filters multiple unicast MAC addresses / VLAN identifiers
-Multi-Users Devices
--------------------
+- Can duplicate Broadcast / Multicast packets for all VFs
-- Mobile phone has 3 types of users, each with specific private data
- to protect from the others
+- Multiple RX queues per VF (RSS)
- - The person owning the device, with address book, emails,
- documents, etc.
- - Different wireless providers, for example private and
- professionnal: network access properly authenticated, ensure
- correct billing !
- - Third-party service providers, for instance multimedia providers.
+- Load balancing between TX packets sent by VFs
-- Owner and third-parties must be granted secure financial
- transactions
+- Anti-Spoofing mechanism on transmission
-Virtualization in Hardware
---------------------------
+ - Source MAC address
+ - VLAN identifier
-- Only way to build a real TCB
+Pro/Cons of I/O hardware virtualization
+---------------------------------------
- - Without penalizing performances
+- Improves I/O performances on physical devices directly managed by Guest VMs
-- Should include support for
+- Only useful in specific configurations
- - Memory Partitionning
- - Physical Memory / Machine Memory mapping
- - Coupled with multi-cores
- - Device Partitioning
+- PCI device Virtual Functions intended to scale, but require locking
+ of total VM physical memory into machine physical memory
- - Interrupt routing
- - I/O DMA coupled with memory partitioning & Physical Memory /
- Machine Memory mapping
+- Not compatible with transparent VM migration
Conclusion / Evolution of Virtualization
========================================
- Accelerated emulation : faster, code is executed natively, overhead
for privilegied actions
- Virtual servers : fast and scalable, but same OS and one kernel
-- Paravirtualization : fast, needs a modified OS
+- Paravirtualization : fast, needs a modified OS (or drivers)
- HW-assisted virtualization : solves most of the issues
+.. note::
+
+ - needs a modified OS is not true for devices
+
Evolutions of Virtualization
----------------------------
- Virtualization on desktops and small devices
- Security (isolates work and personal area)
+
+Thanks
+------
+
+- Any question ?