2 .. System Virtualization and OS Virtual Machines slides file, created by
3 hieroglyph-quickstart on Mon Oct 28 09:39:30 2013.
5 =============================================
6 System Virtualization and OS Virtual Machines
7 =============================================
10 :Authors: Ivan Boule, Olivier Matz
18 - History of Virtualization
19 - Virtualization Usage and Taxonomy
20 - Process Level Virtualization
25 - System Level Virtualization
27 - Transparent Hardware Emulation
28 - Transparent Hardware Virtualization
30 - Hardware-Assisted Virtualization
37 - Olivier MATZ ``<olivier.matz@6wind.com>``
38 - Software engineer since 10 years at 6WIND
39 - 6WIND is a software company designing high performance network
42 - http://www.6wind.com
44 - I'm mainly developing low-level code: Linux kernel, drivers and
45 applications close to the operating system
47 History of Virtual Machines
48 ===========================
50 Sixties: introduction of IBM/370 series
51 ---------------------------------------
53 - Generalization of virtual memory
54 - Microprogramation of instructions on small models
57 .. figure:: ibm370.jpg
62 - IBM/370: généralisation de la mémoire virtuelle
63 - IBM/370: microprogrammation de certaines instructions sur les
65 - IBM/370: hyperviseur CP/CMS (Control Program/Conversational
66 Monitoring System), gérant des machines virtuelles sous lequel on
67 pouvait faire tourner indifféremment des CMS, des DOS et des
68 OS. Proposé à des clients le temps d’effectuer des migrations des
69 DOS vers OS, il sera souvent conservé pour la très grande
70 convivialité de CMS utilisé comme système de temps partagé.
72 le produit VM/370, créé par IBM dans les années 1970, permettait à
73 plusieurs usagers d'utiliser en temps partagé un ordinateur qui
74 exécute le système d'exploitation IBM DOS. IBM DOS tout seul
75 n'offrait pas la possibilité d'utilisation en temps partagé2.
77 - temps partagé entre VM
82 - Many logical machines in one physical machine
83 - High level (virtual) ISA including I/Os (TIMI)
85 - Take advantage of advances in hardware without recompilation
86 - User binaries contain both TIMI instructions and machine instructions
87 - Easier transition to PowerPC
89 .. figure:: ibm-as400.jpg
94 - IBM/AS-400: c'est un mini-ordinateur de la gamme IBM, fin des années 1980
95 - IBM/AS-400: possibilité de "découper" plusieurs machines logiques
96 dans une machine physique.
97 - IBM/AS-400: un programme ne parle pas directement au matériel, il
98 utilise un set d'instructions haut-niveau (ISA), ce qui rend le
99 programme indépendant du CPU sur lequel il tourne. Ceci a facilité
100 la transition vers les PowerPC.
101 - http://en.wikipedia.org/wiki/IBM_System_i
102 - XXX IBM/AS-400: pourquoi "co-designed VM" ? XXX rechercher sur internet
103 - emulation des instructions CPU de "haut niveau"
104 XXX regarder comment ca marche: est-ce que c'est un hyperviseur ou
107 Nineties and later: application VMs
108 -----------------------------------
115 - a Java program is compiled into a portable bytecode
116 - the JVM is a fictive computer that is able to run this bytecode
118 - Microsoft Common Language Infrastructure (.Net)
122 - http://en.wikipedia.org/wiki/Java_virtual_machine
124 Now: OS virtual machines
125 ------------------------
127 - Run an operating system virtualized top of a virtual machine
130 - VMware products (virtualized PC on x86)
132 - Virtual PC (PC emulation on Mac OS/PowerPC)
133 - Many others : Bochs, VirtualBox, Qemu, ...
135 Virtualization Usages
136 =====================
138 System Virtualization Principles
139 --------------------------------
141 - Run multiple OS's on the same machine
143 - By design, an OS assumes to have full control over all physical
144 resources of the machine
146 - Manage sharing/partitioning of machine resources between Guest OS's
149 - Physical memory & MMU
152 Goals of System Virtualization
153 ------------------------------
155 - Reduction of Total Cost of Ownership (TCO)
157 - Increase utilisation of server resources
158 - Spawn new servers "on demand" (ex: Amazon EC2 and Elastic Load
161 - Reduction of Total Cost of Functioning
167 - Isolation of OS for security purposes (Qubes, Cells)
171 - reduction TCO + TCF: parler du cas data center. On peut parler DE
172 migration à chaud, d'élasticité, ...
175 - un client peut créer des machines virtuelles à la demande
176 - Elastic Load Balancer: Les ELB permettent de répartir la charge
177 entre les instances EC2
178 - Autoscaling: Permet de gérér automatiquement l'élasticité sur
179 un ou plusieurs groupes d'instances EC2
180 - Cloud Watch: Permet de suivre et monitorer des métriques des
181 instances EC2 pour envoyer des notifications ou prendre des
184 - "qubes" (security) http://qubes-os.org/trac/wiki/QubesScreenshots
186 - Based on a secure bare-metal hypervisor (Xen)
187 - Networking code sand-boxed in an unprivileged VM (using IOMMU/VT-d)
188 - USB stacks and drivers sand-boxed in an unprivileged VM (currently
189 experimental feature)
190 - No networking code in the privileged domain (dom0)
191 - All user applications run in “AppVMs”, lightweight VMs based on
193 - Centralized updates of all AppVMs based on the same template
194 - Qubes GUI virtualization presents applications like if they were
196 - Qubes GUI provides isolation between apps sharing the same desktop
197 - Secure system boot based (optional)
199 Virtualization in high-throughput network equipments
200 ----------------------------------------------------
202 .. figure:: high-thput.svg
207 - Initialement, on a un système qui tourne sur plusieurs anciennes
208 cartes (plus la carte de management sous linux). On veut mettre à
209 jour le matériel, il est alors possible si la nouvelle carte est
210 plus puissante de virtualiser les anciennes sans modifier le
213 dataplane + control plane -> en une carte
215 - Reprendre ce qui a été dit au slide précédent
217 Usages of Virtual Machines
218 --------------------------
220 - Server virtualization
224 - OS/kernel education & training
227 - OS kernel development
228 - Test machine = development host
230 - Keep backward compatibility of legacy software
232 - Run applications not supported by host OS
234 - OS migration without reinstalling it on a new hardware
238 - time sharing: on veut utiliser plusieurs OS sur la meme machine:
239 analogie avec plusieurs processes.
241 - eduction & training: on peut imaginer le cas d'un TP, comme présenté
242 dans l'article linux mag 140 sur la libvirt: chaque étudiant
243 travaille sur une machine virtuelle préconfigurée XXX a relire
245 - backward compatibility: préciser que c'est utile lorsque le matériel
246 n'est plus disponible par exemple.
248 - run app not supported by host OS: wine
250 - Certains services ne sont accessibles qu'au niveau de l'OS
251 (routage, filtrage, ...). Avoir plusieurs OS permet de les
252 dupliquer (ex: daisy chain tcp avec des VR)
257 - Another example: one backup server to replace any machine
259 .. figure:: recovery.png
264 - La virtualisation permet de faire de la haute disponibilité à pas
265 cher. Souvent c'est le logiciel qui crashe. On peut dupliquer tout
266 une architecture reseau:
272 - Un seul serveur backup à droite pour tous les autres
273 serveurs. Permet de ne pas avoir 8 machines. Si un des 4 se casse la
274 gueule, c'est celui de droite qui prend la main.
276 - en effet, chaque machine a sa propre configuration
277 systeme/reseau/filtrage... Il n'est pas forcément évident de
278 mettre les 4 services sur une même machine sans virtualisation.
280 Multi-Core CPU Issues (1)
281 -------------------------
285 - No more achieved through Frequency/Speed increase
286 - But obtained with higher density & multi-core chips
288 - Many RTOS designed with mono-processor assumption
290 - Adding multi-processor support is complex & costly
291 - Scaling requires time, at best...
293 - Legacy RT applications also designed for mono-processor
295 - Adaptation to multi-pro even more difficult than RTOS
299 - cas des applications multi-threadées mais conçues avec en tête le
300 fait que la machine n'a qu'un seul core. la virtualisation systeme
301 permet de paralleliser ces applis sur des machine physiques
302 multicores (chaque VM étant mono-core), expliqué slide suivant.
304 - Beaucoup d'applications sont encore monoprocesseur. Cela simplifie
305 drastiquement la manière de coder, il n'y a pas de race condition,
306 pas besoin de locks/mutex. XXX
308 - ce probleme se pose moins sur un système classique que sur des
309 systèmes anciens ou des systèmes temps réel. En effet, les systèmes
310 classiques modernes supportent très bien le multicore et il
311 suffirait de lancer plusieurs applications simultanément. XXX
313 - certaines applications RT multithreadées comptent sur le fait qu'il
314 n'y a qu'un CPU, et que 2 threads ne sont jamais executés de manière
315 réellement concurrente
317 Multi-Core CPU Issues (2)
318 -------------------------
320 - OS virtualization allows to run simultaneously on a multi-cores CPU
321 multiple instances of mono-processor OS's
323 - Each OS instance is run in a mono-processor
325 - Virtual Machine assigned to a single CPU core
327 - No need to change legacy software
329 - Scalability managed at virtualization level
333 - La virtualisation système permet de faire tourner plusieurs instance
334 d'un système d'exploitation non SMP sur un processeur multicore.
336 - Cela peut permettre d'éviter de réécrire un logiciel conçu pour une
337 machine mono-core. Le logiciel dont il est question ici est plutôt
338 un logiciel RT ou un noyau, car si c'est une application standard,
339 le problème ne se pose pas.
341 Virtualization Taxonomy
342 =======================
346 taxonomy = inventaire
351 .. figure:: isa-abi.svg
354 - ISA = Instruction Set Architecture
356 - System level interface
357 - All CPU instructions, memory architecture, I/O
359 - ABI = Application Binary Interface
361 - Process level interface
362 - User-level non privileged ISA instructions + OS systems calls
366 - ISA: Instruction Set Architecture
368 les instructions du CPU (donner des exemples, comme le MOV, CLI/STI
369 pour vérouiller les interruptions), les périphériques, la MMU
370 (comment elle est doit être configurée), ...
372 C'est l'interface qui est utilisé par le système d'exploitation.
374 - ABI: Application Binary Interface
376 C'est l'interface qui permet à un processus de communiquer avec
377 l'extérieur. Il s'agit principalement d'appels systèmes (read,
378 write, gettimeofday, execve, sleep).
380 l'abi contient les instructions non-privilegiées + l'api de l'OS.
381 D'autres instructions comme le cli/sti ne font pas partie de l'ISA.
383 - exemple de la couche de compatibilité pour une application 32 bits
384 tournant sur un kernel 64 bits.
386 Virtualization Taxonomy
387 -----------------------
389 - Virtualization at process level (ABI)
391 - Emulation of Operating System ABI
394 - Virtualization at system level (ISA)
396 - Standalone vs Hosted Virtualization
397 - Machine Emulation vs Machine Virtualization
401 - un processus tourne déjà dans une machine virtuelle fournie par
402 l'OS, mais pas au même niveau. Historiquement, l'objectif d'un
403 système d'exploitation multitâche est de fournir des machines
404 virtuelles pour les applications (donc les utilisateurs). Chaque
405 application "pense" qu'elle est tout seule sur le processeur.
407 Chaque application peut avoir accès aux ressources via les appels
408 systèmes, comme si l'application était la seule à parler aux
409 périphériques. C'est au système d'exploitation d'ordonnancer les
410 processus et leurs requetes.
412 - la virtualisation systeme fonctionne sur le même principe mais à un
413 niveau différent. Nous allons voir dans les slides suivants les
414 différents types de virtualisation (standalone vs hosted, et
415 emulation vs virtualisation).
417 Hosted versus Standalone Virtualization
418 ---------------------------------------
420 - Hosted Virtualization
422 - Hosted VM Monitor (VMM) runs on top of native OS
423 - VMware WKS, Microsoft VirtualPC, QEMU/KVM, UML
425 - Standalone Virtualization
427 - VMM directly runs on bare hardware
428 - VMware ESX, IBM/VM, Xen
430 - OS run in a VM is named a Guest OS
438 - standalone = autonome, plus petit
440 - en général, le "hosted" n'accede pas réellement au hardware mais à des
443 - le cas kvm est ambigu: le kernel qui tourne en mode root
444 s'execute réellement sur le bare-hardware.
446 Hosted Virtualization
447 ---------------------
449 .. figure:: hosted.svg
452 Example: VMware Workstation
453 ----------------------------
455 .. figure:: vmware-wks.svg
461 - Specific device drivers
463 - Guest OS executed in user mode
465 Standalone Virtualization
466 -------------------------
468 .. figure:: standalone.svg
474 .. figure:: vmware-esx.svg
479 - Supports unmodified OS binaries
481 - Configuration with appropriate device drivers
488 Process Level Virtualization: ABI Emulation
489 ===========================================
491 Process level ABI Emulation
492 ---------------------------
494 - Goal: execute binary applications of a given system **X** on the ABI of
497 - Emulate system **X** ABI on top of system *Y* ABI
499 - Emulation done by application-level code
501 - System *Y* must provide services equivalent to those of system
502 **X** (file system, sockets, etc...)
504 - Example: **X** = Windows and *Y* = Linux
508 - exemple du CreateFile() de windows qui serait émulé par un open()
511 Process Level (ABI) Emulators
512 -----------------------------
514 - Wine run Windows applications on POSIX-compliant operating
517 - Windows API in userland
518 - Adobe Photoshop, Google Picasa, ...
520 - Cygwin: recompile POSIX applications so they can run under Windows
522 - Unix emulation on Windows
524 - Bash shell + many Unix commands
525 - GNU development tool chain (gcc, gdb)
526 - X Window, GNOME, Apache, sshd, ...
530 - **DEMO**: lancer un .exe avec wine64
531 - l'ABI dépend du système d'exploitation mais aussi de l'architecture.
533 - les appels systèmes sont différents entre linux et windows
534 - mais les appels systemes ne s'invoquent pas de la même manière
535 sur 2 architectures différentes. Par exemple, sur un x86, on
536 utilise un INT 0x80 (en fait SYSENTER maintenant), et les
537 arguments sont placés dans des registres particuliers
539 - google picasa for linux inclut une version embarquée de wine
541 Process Level Cross-architecture Emulators
542 ------------------------------------------
544 - Emulate the Operating System ABI
546 - Emulated OS and native OS are the same (ex: both are linux)
547 - Emulated arch is different than native architecture (ex: x86 and
549 - Note: we define what is "emulation" later in the presentation
558 $ powerpc-linux-gnu-gcc -static hello.c
560 bash: ./a.out: cannot execute binary file
566 - par exemple, vous récupérer une freebox ou un routeur basé sur du
567 mips ou arm, et vous voulez lancer et débugger une application.
569 Process Level Virtualization: virtual servers
570 =============================================
575 - Single OS kernel / Multiple resource instances
577 - can run several linux distributions on the same kernel
579 - Isolated kernel execution environments
582 - Network: Routing table, IP tables, interfaces...
585 - Solaris 10 Containers
586 - LXC, Linux-VServer, openVZ: namespaces and cgroups
591 - tous les processus sont vus par le kernel
593 - les processus ont des vues différentes du système d'exploitation et
594 sont cloisonnés. Ils n'ont pas conscience des domaines adjacents et
595 ont des vues différentes du système (FS, réseau, ...).
597 - Les namespaces de linux sont un bon exemple (lxc, openVZ).
599 - XXX reflechir à une demo... ?
601 - expliquer comment ça peut être implémenté dans le kernel: un
602 parametre supplémentaire pour chaque appel systeme
604 - dire que niveau sécurité, c'est pas encore ça pour cloisonner.
606 - voir dessin slide suivant
608 - signal -> table of process ?
613 .. figure:: virtual-servers.svg
624 - Low memory footprint
631 - No OS heterogeneity
632 - Single OS binary instance (common point of failure)
634 System Level Virtualization: Transparent Hardware Emulation
635 ===========================================================
637 Transparent Hardware Emulation (1)
638 ----------------------------------
640 - Run unmodified OS binaries
642 - Includes emulation of physical devices
644 - Cross ISA Emulation
650 - VirtualBox (Intel x86)
652 Transparent Hardware Emulation (2)
653 ----------------------------------
655 - Emulate machine **X** on top of machine *Y*
657 - Interpretation: read, decode, execute
659 - 1 instruction of **X** executed by N instructions of *Y*
660 - Huge slow down method
662 - Dynamic Binary Translation
664 - Convert blocs of **X** instructions in *Y* instructions
665 - Conversion is done once per basic block
666 - Advanced: dynamic optimization of 'hot' blocs
668 - The emulator is usually a standard application running on a native
673 - Expliquer comment un emulateur peut être implémenté, c'est un gros
674 switch/case, chaque instruction doit être parsée et son comportement
675 doit être émulé. L'émulateur doit conserver dans des variables
676 l'état des registres.
678 - Voilà pourquoi on en arrive à faire de la translation de blocs de
679 code. Attention, la translation dynamique ne se fait qu'à la volée,
680 c'est plus difficile de prendre le binaire, le convertir, et l'executer
681 (translation statique).
683 - https://en.wikipedia.org/wiki/Binary_translation
684 - Dynamic binary translation looks at a short sequence of
685 code—typically on the order of a single basic block—then
686 translates it and caches the resulting sequence.
687 - Code is only translated as it is discovered and when possible, and
688 branch instructions are made to point to already translated and
689 saved code (memoization).
690 - Apple Computer implemented a dynamic translating emulator for M68K
691 code in their PowerPC line of Macintoshes, which achieved a very
692 high level of reliability, performance and compatibility
693 - Intel: IA32 over Itanium
695 QEMU: Hosted Hardware Emulator
696 ------------------------------
698 - Cross ISA Emulation
700 - Emulate machine **X** on top of machine *Y*
702 - Interpretation + translation
704 - Intel x86, PowerPC, ARM, Mips, Sparc, ...
706 - Emulation of SMP architectures
708 - Emulates physical I/O devices
710 - Hard Disk drives, CD-ROM, network controllers, USB controllers, ...
711 - Synchronous emulation of device I/O operations
715 - **DEMO**: lancer kid icarus avec mednafen
716 - ``mednafen -vdriver sdl -nes.xscale 4 -nes.yscale 4 ~/cours_ivan/cours_virt/Kid\ Icarus\ \(Europe\)\ \(Rev\ A\).zip``
717 - voir /usr/share/doc/mednafen/mednafen.html
718 - http://idoc64.free.fr/ASM/instruction.htm
719 - QSDZ = dir, ret=start, tab=select, OP=buttons
720 - Alt-D affiche le debugger
721 - addresse A6 diminue qd on perd des vies
722 - shift-W: write breakpoint, R pour run
723 - on peut essayer de mettre une grosse valeur:
725 - ne marche pas, car sature
727 - shift P (poke in rom): ED45 60 1 (on met un RTS)
728 c'est l'endroit qui sature
730 - à l'adresse DB6C, c'est l'endroit où on stocke A6 après s'etre fait toucher
733 LDA A6: charge la valeur
736 BCS: branch on carry set (on comprend que si ça vaut < 0, on met 0)
738 - 7E42 0 1 -> on met 0 sur le decrement des monstres
740 System Level Virtualization: Transparent Hardware Virtualization
741 ================================================================
743 Transparent Hardware Virtualization
744 -----------------------------------
746 - Guest and host architectures are the same
748 - Execute native/unmodified OS binary images
750 - Provide in each VM a complete simulation of hardware
752 - Full CPU instruction set
753 - Interrupts, exceptions
754 - Memory access and MMU
757 - Share machine resources among multiple VMs
761 - le slide décrit la problematique qui est la meme que pour l'émulation
763 - peut etre donner aussi les exemples style kqemu ou virtualbox
764 (modules accélération). Dire aussi que ça ne concerne toujours pas
765 les Intel-VT, dire que ça va plus vite que l'émulation
767 - share machine resource: exemple des pages memoires en copy-on-write
769 Full CPU Virtualization (1)
770 ---------------------------
772 - Present same functional CPU to all Guest OSes
774 - VMM manages a CPU context for each vCPU of each VM
776 - saved copy of CPU registers
777 - representation of software-emulated CPU context
779 - VMM shares physical CPUs among all vCPU of VMs
781 - VMM includes a VM scheduler
788 - representation of software-emulated CPU context: exemple, savoir que
789 les IT sont masquées ou non.
791 Full CPU Virtualization (2)
792 ---------------------------
794 - Relationships between a VMM and VMs similar to relationships between
795 native OS and applications
797 - Guarantee mutual isolation between all VMs
798 - Protect VMM from all VMs
800 - Directly execute native binary images of Guest OS's in
803 - VMM emulates access to protected resources performed by Guest OSs
808 - Run each Guest OS in non-privileged mode
810 .. figure:: cpu-virt.svg
813 "Hardware-Sensitive" Instructions
814 ---------------------------------
816 - Interact with protected hardware resources
818 - Privileged Instructions (cannot be executed in user mode)
819 - Critical Instructions (can be, but should not be executed by Guest OS)
821 - Must be detected and faked by VMM
823 - Dynamic Binary Translation of kernel code
825 - Done once, saved in Translation Cache
830 - instructions priviligées: ex, masquage des IT
832 - intruction critiques: ex, read de status flag, de CR3, ...
834 Privileged Instructions Virtualization
835 --------------------------------------
837 - Only allowed in supervisor mode
839 - Ex: **cli/sti** to mask/unmask interrupts on Intel x86
841 - When executed in non-privileged mode
843 - CPU automatically detects a privilege violation
844 - Triggers a “privilege-violation” exception
846 - Caught by VMM which fakes the expected effect of the privileged
851 - VMM does not mask/unmask CPU interrupts
852 - records « interrupt mask status » in context of VM
854 Critical Instructions Virtualization (1)
855 ----------------------------------------
857 - Hardware-sensitive instructions
859 - Ex: Intel IA-32 pushf/popf::
861 pushf /* save EFLAG reg. to stack */
862 cli /* mask interrupts => clear EFLAG.IF */
864 popf /* restore EFLAG reg. => unmask interrupts */
866 - When executed in non-privileged mode
868 - The cli instruction triggers an exception caught by VMM => VMM
869 record interrupts masked for current VM
871 - But no exception for popf => VMM not aware of Guest OS action
876 - premier pb: pushf est autorisé et met toujours en pile des flags
877 disant que les IT sont autorisées
878 - popf doit aussi être intercepté car il faut mettre à jour le
881 Critical Instructions Virtualization (2)
882 ----------------------------------------
884 - Must be detected and emulated by VMM
886 - VMM dynamically analyses Guest OS binary code to find critical instructions
888 - VMM replaces critical instructions by a « trap » instruction to enter the VMM
890 - VMM emulates expected effect of critical instruction, if any.
895 - XXX est-ce que la translation doit être faite uniquement sur le
896 code qui a vocation à tourner en ring 0 ?
898 Full Memory Virtualization
899 --------------------------
901 - CPU include a Memory Management Unit (MMU)
903 - Isolated memory addressing spaces
904 - Independant of underlying physical memory layout
905 - Run mutually protected applications in parallel
907 - Virtual Memory managed by OS kernel
909 - Provides a virtual address space to each process
911 - 4 GB on most 32-bit architectures (Intel x86, PowerPC)
913 - Manages virtual page → physical page mappings
914 - Manages « swap » space to extend physical memory
918 - la MMU est un composant hardware
920 Reminder about MMU (1)
921 ----------------------
923 - Here is a minimal code example:
927 # a program that takes x and y in memory, and
929 mov %0x200000,eax # retrieve <x> in eax
930 mov %0x200004,ebx # retrieve <y> in ebx
931 add ebx,eax # compute x+y in eax
932 mov eax,%0x200008 # save the result in memory
934 - This program can run on one cpu
935 - If the addresses are physical, it is not possible to run multiple
936 instance of this program as they would modify the same memory
940 - une mauvaise solution est de modifier le binaire à chaque execution
942 Reminder about MMU (2)
943 ----------------------
945 .. figure:: mmu-slide1.svg
948 Reminder about MMU (3)
949 ----------------------
951 .. figure:: mmu-slide2.svg
954 Reminder about MMU (4): Intel x86 MMU
955 -------------------------------------
960 Memory Virtualization (1)
961 -------------------------
963 .. figure:: mmu-slide3.svg
966 Memory Virtualization (2)
967 -------------------------
969 - Machine Physical Memory
971 - Physical memory available on the machine
973 - Guest OS Physical Memory
975 - Part of machine memory assigned to a VM by VMM
977 - ∑ Guest Physical Memory can be > Machine Memory
979 - VMM uses « swap » space
981 - Guest OS Virtual Memory
983 - Guest OS manages virtual address spaces of its processes
985 Memory Virtualization (3)
986 -------------------------
988 - Guest OS manages Guest Physical Pages
990 - Manages MMU with its own page entries
991 - Translates Virtual Addresses into Guest Physical Addresses (GPA)
993 - VMM transparently manages Machine Physical Pages
995 - Guest Physical Address ≠ Machine Physical Address
996 - VMM dynamically translates Guest Physical Pages into Machine
999 Memory Virtualization (4)
1000 -------------------------
1002 .. figure:: mmu-slide4.svg
1007 - passer en dynamique, expliquer comment sont fait les
1008 translations, parler du tlb
1009 montrer l'ordre chronologique des choses
1010 - on dézoome un coup, en statique
1011 virtual memory vs VM memory vs host physical memory
1012 pas de mmu dans ce cas
1013 - montrer en dynamique avec une seule MMU comment
1014 l'hyperviseur configure la MMU
1016 utiliser les memes couleurs pour les types de memoire
1017 on va détourner la mmu pour faire la translation qui nous
1020 - mettre un nombre dans CR3
1021 - mettre des barres verticales dans les page tables
1022 - TLB sous la forme d'un tableau avec des lignes vides
1025 - zoom sur les PTE à droite
1026 - faire apparaitre les adresses
1027 - dissocier les valeurs des adresses virtuelles et physiques, mettre
1028 des couleurs différentes pour ces adresses
1029 - voir si on ne peut pas faire apparaitre que les 20 bits significatifs
1030 et pas les 12 bits d'offset qd on parle des adresses
1032 - Lorsque le guest accede à CR3 (ou un PTE), cela génère une faute,
1033 gérée par le VMM. Le VMM va translater l'adresse donnée par l'OS de
1034 la VM et remplir le registre CR3 avec l'adresse physique
1035 correspondant à la zone utilisée par la VM pour y stocker ses tables
1036 de pages. Il faut que tout accès à cette table de page génère une
1037 faute pour que le VMM soit notifié de tout changement et puisse
1038 configurer la MMU réelle en conséquence (en faisant la translation
1039 d'adresse). (slide 47)
1041 - La lecture de CR3 ne génère pas de TRAP, il faut donc faire comme
1042 pour les pushf et popf, c'est à dire de la translation de code.
1044 Memory Virtualization (5)
1045 -------------------------
1047 - VMM maintains Shadow Page Tables
1049 - Copies of Guest OS translation tables
1051 - VMM catches updates operations of translation tables performed by a
1054 - RW-protect all guest OS page tables
1055 - Emulates operation in shadow page table
1056 - Updates effective MMU page table entry, if needed
1058 Memory Virtualization (6)
1059 -------------------------
1061 - PTE entries can be tagged with a context ID
1063 - Avoids to flush TLB when switching current address space upon
1064 scheduling of a new process
1066 - usually PTE tag = OS process identifier
1068 - Processes of different Guest OSes can be assigned the same Process
1071 - VMM must flush TLB when switching VMs
1073 Memory Virtualization (7)
1074 -------------------------
1076 - VMM must respect Guest OS virtual page faults
1078 - Not map virtual pages unmapped by Guest OS
1079 - When Guest OS unmaps a virtual page:
1081 - VMM must delete the associated real-page/physical page
1084 - Conversely, VMM can transparently:
1086 - Introduce & resolve real-page faults for Guest OSes
1087 - Share physical pages between Guest OS's
1089 - Pages with same content's (e.g. zero-ed pages)
1091 Memory Virtualization (8)
1092 -------------------------
1094 - VMM can swap real pages of a VM
1096 - on "swap" space managed by VMM
1098 - VMM can dynamically distribute physical memory among VMs
1100 - Needs a specific support in Guest OS (Linux module)
1102 - VMM asks Guest OS to release memory
1104 - Guest OS self-allocates real pages
1105 - no more available for normal kernel allocation service
1106 - VMM assigns same amount of physical pages to other VM's
1110 - ballooning: un module kernel est dans les guests, il communique
1111 avec le VMM. Si le VMM a besoin de mémoire physique pour une
1112 autre VM, il peut demander au module d'allouer de la mémoire, qui
1113 est alors perdue pour les autres services. Cette mémoire est
1115 - besoin de précisions et sources là dessus
1117 System Level Virtualization: Paravirtualization
1118 ===============================================
1120 CPU Paravirtualization
1121 ----------------------
1123 - Still run each Guest OS in non-privileged mode, but with minimal
1124 virtualization overhead
1126 - OS adaptation to avoid binary translation overhead
1128 - Remove Hardware-Sensitive Instructions, use fast VMM system calls
1129 - Minimize/avoid usage of Privileged Instructions
1131 - Only affect Machine/CPU dependant part of OS
1133 - OS portage on new architecture with same CPU, without system ISA
1135 - Examples: Xen legacy, User Mode Linux (UML), CoLinux
1137 I/O Paravirtualization
1138 ----------------------
1140 - Multiplexing VMM physical devices among VMs
1142 - Front-end driver in Guest OS
1143 - Back-end driver in VMM
1144 - Virtual ethernet, virtual disks
1146 - Fast virtual devices for VM to VM communications
1150 - Data transfer through syscalls, shmem, rings, ...
1151 - Pros: scalability, VM migration
1156 .. figure:: virt-devices.svg
1159 Paravirtualization Example: Xen Legacy
1160 --------------------------------------
1165 - Share resources of Server machines
1167 - Intel IA-32, x86-64, ARM, ...
1169 - Special first Guest OS called Domain 0
1171 - Run in privileged mode
1172 - Have access (and manages) all physical devices
1173 - Modified version of Linux, FreeBSD
1177 XXX vérifier le coup de domain 0
1179 System Level Virtualization: Hardware-Assisted Virtualization
1180 =============================================================
1182 Hardware Assisted Virtualization (1)
1183 ------------------------------------
1185 - Support of Virtualization in Hardware
1186 - Run unmodified OS binaries
1187 - With minimal virtualization overhead
1188 - Simplify VMM development
1194 Hardware Assisted Virtualization (2)
1195 ------------------------------------
1197 - CPU virtualization
1200 - Intel VT-x (x86), Intel VT-i (Itanium) architectures
1203 - MMU virtualization
1205 - Intel Extended Page Tables (EPT)
1206 - AMD Nested Page Tables (NPT)
1208 Hardware Assisted Virtualization (3)
1209 ------------------------------------
1211 - Directed I/O virtualization
1213 - IO-MMU (Intel VT-d)
1215 - I/O Device virtualization
1217 - Self-Virtualizing devices
1218 - Single Root I/O Virtualization and Sharing Specification (SR-IOV)
1219 - Extensions to PCIe (PCI Express) Bus standard
1221 Intel VT-x Architecture
1222 -----------------------
1224 - Support unmodified Guest OS with no need for paravirtualization
1225 and/or binary code translation
1227 - Simplify VMM tasks & improve VMM performances
1229 - Minimize VMM memory footprint
1231 - Suppress shadowing of Guest OS page tables
1233 - Enable Guest OS to directly manage I/O devices
1235 - Without performance lost
1236 - While enforcing VM isolation and mutual protection
1238 Intel VT-x Architecture Overview
1239 --------------------------------
1241 .. figure:: vt-x.svg
1244 Intel VT-x CPU Virtualization (1)
1245 ---------------------------------
1247 - Virtual Machine eXtension (VMX)
1249 - Two new meta-modes of CPU operation
1253 - Behaviour similar to IA-32 without VT
1254 - Intended for VMM execution
1258 - Alternative IA-32 execution environment
1259 - Controlled by a VMM
1260 - Designed to run unchanged Guest OS in a VM
1262 - Both modes support rings 0-3 privilege levels
1264 - Allow VMM to use several privilege levels
1266 Intel VT-x CPU Virtualization (2)
1267 ---------------------------------
1269 - Two additional CPU mode transitions
1271 - From VMX root-mode to VMX non-root mode
1273 - Named VM Enter (VMLaunch instruction)
1275 - From VMX non-root mode to VMX root mode
1277 - Named VM Exit (event)
1279 - VM entries & VM exits use a new data structure
1281 - Virtual Machine Control Structure (VMCS) per VM CPU (vCPU)
1282 - Referenced with a memory physical address
1283 - Format and layout hidden
1284 - New VT-x instructions to access a VMCS
1286 Intel VT-x CPU Virtualization (3)
1287 ---------------------------------
1291 - Saved value of registers before beeing changed by
1292 VM Exits (e.g. Segment Registers, CR3, IDTR)
1294 - Hidden CPU state (e.g., CPU Interruptibility State)
1300 - Interrupt Virtualization
1301 - Exceptions bitmaps
1303 - Model Specific Register R/W bitmaps
1304 - Execution rights for CPU Privileged Instructions
1308 - host state area est l'endroit ou l'état du processeur du VMM est
1309 stocké. Il est restauré sur VMExit.
1311 - Switching from root mode to non-root mode is called "VM entry", the
1312 switch back is "VM exit". The VMCS includes a guest and host state
1313 area which is saved/restored at VM entry and exit. Most importantly,
1314 the VMCS controls which guest operations will cause VM exits.
1316 The VMCS provides fairly fine-grained control over what the guests
1317 can and can't do. For example, a hypervisor can allow a guest to
1318 write certain bits in shadowed control registers, but not
1319 others. This enables efficient virtualization in cases where guests
1320 can be allowed to write control bits without disrupting the
1321 hypervisor, while preventing them from altering control bits over
1322 which the hypervisor needs to retain full control. The VMCS also
1323 provides control over interrupt delivery and exceptions.
1325 Whenever an instruction or event causes a VM exit, the VMCS contains
1326 information about the exit reason, often with accompanying
1327 detail. For example, if a write to the CR0 register causes an exit,
1328 the offending instruction is recorded, along with the fact that a
1329 write access to a control register caused the exit, and information
1330 about source and destination register. Thus the hypervisor can
1331 efficiently handle the condition without needing advanced techniques
1332 such as CSAM and PATM described above.
1334 VT-x inherently avoids several of the problems which software
1335 virtualization faces. The guest has its own completely separate
1336 address space not shared with the hypervisor, which eliminates
1337 potential clashes. Additionally, guest OS kernel code runs at
1338 privilege ring 0 in VMX non-root mode, obviating the problems by
1339 running ring 0 code at less privileged levels. For example the
1340 SYSENTER instruction can transition to ring 0 without causing
1341 problems. Naturally, even at ring 0 in VMX non-root mode, any I/O
1342 access by guest code still causes a VM exit, allowing for device
1345 - Tout l'état du processur visible est sauvé dans ou restauré depuis
1348 - tous les registres, meme ceux de controle
1349 - interruptability state
1351 - La VMCS contient ce que la VM a le droit de faire
1353 - IO bitmaps = bitfield qui dit quels ports IO (instructions in et
1354 out) sont autorisés.
1356 Intel VT-x Interrupt Virtualization
1357 -----------------------------------
1359 - VMCS External Interrupt Exiting
1361 - All external interrupts cause VM Exit
1362 - Guest OS cannot mask external interrupts when executing Interrupt
1363 Masking instructions
1365 - VMCS Interrupt Window Exiting
1367 - VM Exit occurs whenever Guest OS ready to serve external interrupts
1369 - Used by VMM to control VM interrupts
1373 - la window permet de délayer l'interruption hardware (et donc le vm
1374 exit) tant que le guest n'a pas demasqué ses IT.
1376 - VT-x also includes an interrupt-window exiting VM-execution
1377 control. When this control is set to 1, a VM exit occurs whenever
1378 guest software is ready to receive interrupts. A VMM can set this
1379 control when it has a virtual interrupt to deliver to a
1380 guest. Similarly, VT-i includes a PAL service that a VMM can use to
1381 register that it has a virtual interrupt pending. When guest
1382 software is ready to receive such an interrupt, the service
1383 transfers control to the VMM via the new virtual external interrupt
1386 Intel VT-x MMU Virtualization
1387 -----------------------------
1389 - Extended Page Tables (EPT)
1391 - Second level of Page Tables in MMU
1392 - Translate Guest OS Physical Address into Machine Physical Address
1395 - Virtual Processor IDentifier (VPID)
1397 - Used to tag TLB entries
1398 - Avoid to flush TLB upon VM switch
1400 Virtual Memory Virtualization
1401 -----------------------------
1403 .. figure:: vt-x-mem.svg
1406 Intel VT-x Extended Page Tables (1)
1407 -----------------------------------
1409 - VMM controls Extended Page Tables
1411 - EPT used in VMX non-root operation
1413 - Activated on VM Enter
1414 - Desactivated on VM exit
1416 - EPTP register points to Extended Page Tables
1418 - Instanciated by VMM
1420 - Loaded from VMCS on VM entry
1423 Intel VT-x Extended Page Tables (2)
1424 -----------------------------------
1426 .. figure:: vt-x-mmu.svg
1431 - le TLB contient cache les 2 translations VA->GPA et GPA->MPA
1433 - There is only one downside: nested paging or EPT makes the virtual
1434 to real physical address translation a lot more complex if the TLB
1435 does not have the right entry. For each step we take in the blue
1436 area, we need to do all the steps in the orange area. Thus, four
1437 table searches in the "native situation" have become 16 searches
1438 (for each of the four blue steps, four orange steps).
1440 http://www.anandtech.com/show/2480/10
1445 .. figure:: tlb-flush-issue.svg
1450 - 2 processes dans dess VMs différentes peuvent utiliser la même
1453 Intel VT-x Virtual Processor Identifier
1454 ---------------------------------------
1456 - 16-bit VPID used to tag TLB entries
1458 - Enabled by VMM in VMCS
1459 - Unique VPID is assigned by VMM to each VM
1460 - VPID 0 reserved for VMM
1462 - Current VPID is 0x0000 when
1464 - Outside VMX operation
1465 - In VMX root mode operation
1466 - In VMX non-root mode if VPID disabled in VMCS
1468 - VPID loaded from VMCS on VM Enter
1472 - faire la demo de Windows dans un KVM, on peut parcourir le
1473 gestionnaire de périphérique pour voir que ce n'est pas du tout ce
1474 que j'ai sur mon PC. En plus ça fait bien la transition avec la
1477 .. Intel Virtualization Technology for Directed I/O
1478 ================================================
1480 Intel VT-d Principles
1481 ---------------------
1483 - Enable Guest OS to directly manage physical I/O devices
1485 - Guest I/O operations bypass VMM
1487 - In full transparent mode
1489 - Use native device drivers of Guest OS
1490 - Guest OS unaware of underlying physical memory virtualization
1492 - Enforce isolation between Guest VMs
1494 - Guest OS can only access I/O ressources (ports, PCI devices) assigned to it
1495 - PCI I/O device can only perform DMA to machine physical pages assigned to
1496 Guest VM owning that device.
1501 .. figure:: dma-virt.svg
1510 DMA Virtualization Issue
1511 ------------------------
1513 - Guest OS driver setup I/O registers of device with Guest Physical
1514 Address of I/O buffers
1516 - Guest Physical Address must be translated into its corresponding
1517 Machine Physical Address when used for DMA operations by device
1519 - GPA -> MPA translation cannot be done by VMM
1521 - VMM cannot catch device-specific driver operations to setup I/O
1524 - GPA -> MPA translation done by IOMMU on the Bus Controller
1526 Intel VT-d Protection Domains
1527 -----------------------------
1529 - Intel VT-d provides DMA Protection Domain
1531 - Extension of IOMMU translation mechanism
1532 - Isolated context of a subset of the Machine Physical Memory
1533 - Corresponds to the portion of Machine Physical Memory allocated to
1536 - I/O devices associated by VMM with a DMA Protection Domain
1538 - Achieves DMA isolation by restricting memory access of I/O devices
1539 through DMA address translation
1541 Intel VT-d DMA Translation
1542 --------------------------
1544 - VT-d hardware treats address specified in DMA request as DMA Virtual
1547 - DVA = GPA of the VM to which the I/O device is assigned
1549 - VT-d translates the DVA into its corresponding Machine Physical
1552 - Support of multiple Protection Domains
1554 - DVA to MPA translation table per Protection Domain
1555 - Must identify the device issuing a DMA request
1557 VT-d PCI Express North Bridge
1558 -----------------------------
1560 .. figure:: vt-d.svg
1563 PCI DMA Requester Identification
1564 --------------------------------
1566 - Mapping between PCI Device and Protection Domains
1567 - 16-bit PCI DMA Requester Identifier
1569 .. figure:: dma-req-id.svg
1572 - Assigned by PCI configuration software
1573 - Bus # indexes Bus Context Table in Root Context Table
1574 - (Device #, Function #) indexes Device Protection Domain in Bus
1577 Device / Protection Domain Mapping
1578 ----------------------------------
1580 .. figure:: device-domain-mapping.svg
1583 Virtual DMA Address Translation
1584 -------------------------------
1586 - DVA ↔ MPA Page Tables similar to IA-32 processor Page Tables
1588 - 4KB or larger page size granularity
1590 - Read/Write permissions
1592 - Protection Domains managed by VMM
1594 - Initialized at VM creation time
1595 - With same translations of the VM Extended Page Table
1597 VMM and Directed I/O
1598 --------------------
1600 - Unplugs assigned PCI device from VMM driver and reset it
1602 - Associates PCI device with VT-d Protection Domain of the Guest VM
1604 - Maps device memory BARs in Guest VM physical space
1606 - Arranges for OS of Guest VM to probe PCI device(s) assigned to it
1608 - Handles device interrupts and redirect them to Guest VM
1610 - Reset assigned PCI device upon Guest VM shut down
1612 .. Device Virtualization
1613 =====================
1615 Device Virtualization Principles
1616 --------------------------------
1618 - Share I/O device among multiple Guest VMs
1620 - With no performance lost
1621 - While enforcing VM isolation and protection
1623 - Move device virtualization from the VMM to the device itself
1627 - PF/VF requires support from the device
1629 Ethernet Device Virtual Functions
1630 ---------------------------------
1632 .. figure:: ethernet-dev-virt.svg
1634 Single Root I/O Virtualization
1635 ------------------------------
1637 - SR-IOV capable PCI Device can be partitionned into multiple Virtual
1640 - SR-IOV Device appears in PCI configuration space as multiple PCI
1643 - Virtual Functions are "lightweight" PCI functions including
1645 - PCI probing capabilities
1649 - Requires VT-d for DMA virtualization
1651 - Virtual Functions have no configuration resources
1653 SR-IOV Device Management
1654 ------------------------
1656 - VMM manages the physical PCI device
1658 - VMM creates a PCI Virtual Function for each VM
1660 - Includes it into VM PCI configuration space to be probed by OS kernel
1662 - Associates VF with VT-d Protection Domain of the Guest VM
1664 - VMM programs the sharing of physical devices ressources between VFs
1666 - Virtual Functions managed by specific VF-aware drivers in kernel of
1667 Guest OS (kind of Para-Virtualization)
1669 Intel Niantic Virtual Functions (1)
1670 -----------------------------------
1672 .. figure:: eth-sr-iov.svg
1675 Intel Niantic Virtual Functions (2)
1676 -----------------------------------
1678 - Virtual Devices on Intel Niantic (10GB) NICs
1680 - Layer-2 packet filtering based on destination MAC address
1682 - Filters multiple unicast MAC addresses / VLAN identifiers
1684 - Can duplicate Broadcast / Multicast packets for all VFs
1686 - Multiple RX queues per VF (RSS)
1688 - Load balancing between TX packets sent by VFs
1690 - Anti-Spoofing mechanism on transmission
1692 - Source MAC address
1695 Pro/Cons of I/O hardware virtualization
1696 ---------------------------------------
1698 - Improves I/O performances on physical devices directly managed by Guest VMs
1700 - Only useful in specific configurations
1702 - PCI device Virtual Functions intended to scale, but require locking
1703 of total VM physical memory into machine physical memory
1705 - Not compatible with transparent VM migration
1707 Conclusion / Evolution of Virtualization
1708 ========================================
1713 - Emulation : slow, multi-arch, simulates ISA (full machine) or ABI
1715 - Accelerated emulation : faster, code is executed natively, overhead
1716 for privilegied actions
1717 - Virtual servers : fast and scalable, but same OS and one kernel
1718 - Paravirtualization : fast, needs a modified OS (or drivers)
1719 - HW-assisted virtualization : solves most of the issues
1723 - needs a modified OS is not true for devices
1725 Evolutions of Virtualization
1726 ----------------------------
1730 - Big amount of data
1731 - Virtualization brings flexibility to data center
1733 - Operating systems in browsers ?
1735 - State of OS is stored remotely
1737 - Virtualization on desktops and small devices
1739 - Security (isolates work and personal area)