2 .. System Virtualization and OS Virtual Machines slides file, created by
3 hieroglyph-quickstart on Mon Oct 28 09:39:30 2013.
5 =============================================
6 System Virtualization and OS Virtual Machines
7 =============================================
10 :Authors: Ivan Boule, Olivier Matz
18 - History of Virtualization
19 - Virtualization Usage and Taxonomy
20 - Process Level Virtualization
25 - System Level Virtualization
27 - Transparent Hardware Emulation
28 - Transparent Hardware Virtualization
30 - Hardware-Assisted Virtualization
37 - Olivier MATZ ``<olivier.matz@6wind.com>``
38 - Software engineer since 10 years at 6WIND
39 - 6WIND is a software company designing high performance network
42 - http://www.6wind.com
44 - I'm mainly developing low-level code: Linux kernel, drivers and
45 applications close to the operating system
47 History of Virtual Machines
48 ===========================
50 Sixties: introduction of IBM/370 series
51 ---------------------------------------
53 - Generalization of virtual memory
54 - Microprogramation of instructions on small models
57 .. figure:: ibm370.jpg
62 - IBM/370: généralisation de la mémoire virtuelle
63 - IBM/370: microprogrammation de certaines instructions sur les
65 - IBM/370: hyperviseur CP/CMS (Control Program/Conversational
66 Monitoring System), gérant des machines virtuelles sous lequel on
67 pouvait faire tourner indifféremment des CMS, des DOS et des
68 OS. Proposé à des clients le temps d’effectuer des migrations des
69 DOS vers OS, il sera souvent conservé pour la très grande
70 convivialité de CMS utilisé comme système de temps partagé.
72 le produit VM/370, créé par IBM dans les années 1970, permettait à
73 plusieurs usagers d'utiliser en temps partagé un ordinateur qui
74 exécute le système d'exploitation IBM DOS. IBM DOS tout seul
75 n'offrait pas la possibilité d'utilisation en temps partagé2.
77 - temps partagé entre VM
82 - Many logical machines in one physical machine
83 - High level (virtual) ISA including I/Os (TIMI)
85 - Take advantage of advances in hardware without recompilation
86 - User binaries contain both TIMI instructions and machine instructions
87 - Easier transition to PowerPC
89 .. figure:: ibm-as400.jpg
94 - IBM/AS-400: c'est un mini-ordinateur de la gamme IBM, fin des années 1980
95 - IBM/AS-400: possibilité de "découper" plusieurs machines logiques
96 dans une machine physique.
97 - IBM/AS-400: un programme ne parle pas directement au matériel, il
98 utilise un set d'instructions haut-niveau (ISA), ce qui rend le
99 programme indépendant du CPU sur lequel il tourne. Ceci a facilité
100 la transition vers les PowerPC.
101 - http://en.wikipedia.org/wiki/IBM_System_i
102 - XXX IBM/AS-400: pourquoi "co-designed VM" ? XXX rechercher sur internet
103 - emulation des instructions CPU de "haut niveau"
104 XXX regarder comment ca marche: est-ce que c'est un hyperviseur ou
107 Nineties and later: application VMs
108 -----------------------------------
115 - a Java program is compiled into a portable bytecode
116 - the JVM is a fictive computer that is able to run this bytecode
118 - Microsoft Common Language Infrastructure (.Net)
122 - http://en.wikipedia.org/wiki/Java_virtual_machine
124 Now: OS virtual machines
125 ------------------------
127 - Run an operating system virtualized top of a virtual machine
130 - VMware products (virtualized PC on x86)
132 - Virtual PC (PC emulation on Mac OS/PowerPC)
133 - Many others : Bochs, VirtualBox, Qemu, ...
135 Virtualization Usages
136 =====================
138 System Virtualization Principles
139 --------------------------------
141 - Run multiple OS's on the same machine
143 - By design, an OS assumes to have full control over all physical
144 resources of the machine
146 - Manage sharing/partitioning of machine resources between Guest OS's
149 - Physical memory & MMU
152 Goals of System Virtualization
153 ------------------------------
155 - Reduction of Total Cost of Ownership (TCO)
157 - Increase utilisation of server resources
158 - Spawn new servers "on demand" (ex: Amazon EC2 and Elastic Load
161 - Reduction of Total Cost of Functioning
167 - Isolation of OS for security purposes (Qubes, Cells)
171 - reduction TCO + TCF: parler du cas data center. On peut parler DE
172 migration à chaud, d'élasticité, ...
175 - un client peut créer des machines virtuelles à la demande
176 - Elastic Load Balancer: Les ELB permettent de répartir la charge
177 entre les instances EC2
178 - Autoscaling: Permet de gérér automatiquement l'élasticité sur
179 un ou plusieurs groupes d'instances EC2
180 - Cloud Watch: Permet de suivre et monitorer des métriques des
181 instances EC2 pour envoyer des notifications ou prendre des
184 - "qubes" (security) http://qubes-os.org/trac/wiki/QubesScreenshots
186 - Based on a secure bare-metal hypervisor (Xen)
187 - Networking code sand-boxed in an unprivileged VM (using IOMMU/VT-d)
188 - USB stacks and drivers sand-boxed in an unprivileged VM (currently
189 experimental feature)
190 - No networking code in the privileged domain (dom0)
191 - All user applications run in “AppVMs”, lightweight VMs based on
193 - Centralized updates of all AppVMs based on the same template
194 - Qubes GUI virtualization presents applications like if they were
196 - Qubes GUI provides isolation between apps sharing the same desktop
197 - Secure system boot based (optional)
199 Virtualization in high-throughput network equipments
200 ----------------------------------------------------
202 .. figure:: high-thput.svg
207 - Initialement, on a un système qui tourne sur plusieurs anciennes
208 cartes (plus la carte de management sous linux). On veut mettre à
209 jour le matériel, il est alors possible si la nouvelle carte est
210 plus puissante de virtualiser les anciennes sans modifier le
213 dataplane + control plane -> en une carte
215 - Reprendre ce qui a été dit au slide précédent
217 Usages of Virtual Machines
218 --------------------------
220 - Server virtualization
224 - OS/kernel education & training
227 - OS kernel development
228 - Test machine = development host
230 - Keep backward compatibility of legacy software
232 - Run applications not supported by host OS
234 - OS migration without reinstalling it on a new hardware
238 - time sharing: on veut utiliser plusieurs OS sur la meme machine:
239 analogie avec plusieurs processes.
241 - eduction & training: on peut imaginer le cas d'un TP, comme présenté
242 dans l'article linux mag 140 sur la libvirt: chaque étudiant
243 travaille sur une machine virtuelle préconfigurée XXX a relire
245 - backward compatibility: préciser que c'est utile lorsque le matériel
246 n'est plus disponible par exemple.
248 - run app not supported by host OS: wine
250 - Certains services ne sont accessibles qu'au niveau de l'OS
251 (routage, filtrage, ...). Avoir plusieurs OS permet de les
252 dupliquer (ex: daisy chain tcp avec des VR)
257 - Another example: one backup server to replace any machine
259 .. figure:: recovery.png
264 - La virtualisation permet de faire de la haute disponibilité à pas
265 cher. Souvent c'est le logiciel qui crashe. On peut dupliquer tout
266 une architecture reseau:
272 - Un seul serveur backup à droite pour tous les autres
273 serveurs. Permet de ne pas avoir 8 machines. Si un des 4 se casse la
274 gueule, c'est celui de droite qui prend la main.
276 - en effet, chaque machine a sa propre configuration
277 systeme/reseau/filtrage... Il n'est pas forcément évident de
278 mettre les 4 services sur une même machine sans virtualisation.
280 Multi-Core CPU Issues (1)
281 -------------------------
285 - No more achieved through Frequency/Speed increase
286 - But obtained with higher density & multi-core chips
288 - Many RTOS designed with mono-processor assumption
290 - Adding multi-processor support is complex & costly
291 - Scaling requires time, at best...
293 - Legacy RT applications also designed for mono-processor
295 - Adaptation to multi-pro even more difficult than RTOS
299 - cas des applications multi-threadées mais conçues avec en tête le
300 fait que la machine n'a qu'un seul core. la virtualisation systeme
301 permet de paralleliser ces applis sur des machine physiques
302 multicores (chaque VM étant mono-core), expliqué slide suivant.
304 - Beaucoup d'applications sont encore monoprocesseur. Cela simplifie
305 drastiquement la manière de coder, il n'y a pas de race condition,
306 pas besoin de locks/mutex. XXX
308 - ce probleme se pose moins sur un système classique que sur des
309 systèmes anciens ou des systèmes temps réel. En effet, les systèmes
310 classiques modernes supportent très bien le multicore et il
311 suffirait de lancer plusieurs applications simultanément. XXX
313 - certaines applications RT multithreadées comptent sur le fait qu'il
314 n'y a qu'un CPU, et que 2 threads ne sont jamais executés de manière
315 réellement concurrente
317 Multi-Core CPU Issues (2)
318 -------------------------
320 - OS virtualization allows to run simultaneously on a multi-cores CPU
321 multiple instances of mono-processor OS's
323 - Each OS instance is run in a mono-processor
325 - Virtual Machine assigned to a single CPU core
327 - No need to change legacy software
329 - Scalability managed at virtualization level
333 - La virtualisation système permet de faire tourner plusieurs instance
334 d'un système d'exploitation non SMP sur un processeur multicore.
336 - Cela peut permettre d'éviter de réécrire un logiciel conçu pour une
337 machine mono-core. Le logiciel dont il est question ici est plutôt
338 un logiciel RT ou un noyau, car si c'est une application standard,
339 le problème ne se pose pas.
341 Virtualization Taxonomy
342 =======================
346 taxonomy = inventaire
351 .. figure:: isa-abi.svg
354 - ISA = Instruction Set Architecture
356 - System level interface
357 - All CPU instructions, memory architecture, I/O
359 - ABI = Application Binary Interface
361 - Process level interface
362 - User-level non privileged ISA instructions + OS systems calls
366 - ISA: Instruction Set Architecture
368 les instructions du CPU (donner des exemples, comme le MOV, CLI/STI
369 pour vérouiller les interruptions), les périphériques, la MMU
370 (comment elle est doit être configurée), ...
372 C'est l'interface qui est utilisé par le système d'exploitation.
374 - ABI: Application Binary Interface
376 C'est l'interface qui permet à un processus de communiquer avec
377 l'extérieur. Il s'agit principalement d'appels systèmes (read,
378 write, gettimeofday, execve, sleep).
380 l'abi contient les instructions non-privilegiées + l'api de l'OS.
381 D'autres instructions comme le cli/sti ne font pas partie de l'ISA.
383 - exemple de la couche de compatibilité pour une application 32 bits
384 tournant sur un kernel 64 bits.
386 Virtualization Taxonomy
387 -----------------------
389 - Virtualization at process level (ABI)
391 - Emulation of Operating System ABI
394 - Virtualization at system level (ISA)
396 - Standalone vs Hosted Virtualization
397 - Machine Emulation vs Machine Virtualization
401 - un processus tourne déjà dans une machine virtuelle fournie par
402 l'OS, mais pas au même niveau. Historiquement, l'objectif d'un
403 système d'exploitation multitâche est de fournir des machines
404 virtuelles pour les applications (donc les utilisateurs). Chaque
405 application "pense" qu'elle est tout seule sur le processeur.
407 Chaque application peut avoir accès aux ressources via les appels
408 systèmes, comme si l'application était la seule à parler aux
409 périphériques. C'est au système d'exploitation d'ordonnancer les
410 processus et leurs requetes.
412 - la virtualisation systeme fonctionne sur le même principe mais à un
413 niveau différent. Nous allons voir dans les slides suivants les
414 différents types de virtualisation (standalone vs hosted, et
415 emulation vs virtualisation).
417 Hosted versus Standalone Virtualization
418 ---------------------------------------
420 - Hosted Virtualization
422 - Hosted VM Monitor (VMM) runs on top of native OS
423 - VMware WKS, Microsoft VirtualPC, QEMU/KVM, UML
425 - Standalone Virtualization
427 - VMM directly runs on bare hardware
428 - VMware ESX, IBM/VM, Xen
430 - OS run in a VM is named a Guest OS
438 - standalone = autonome, plus petit
440 - en général, le "hosted" n'accede pas réellement au hardware mais à des
443 - le cas kvm est ambigu: le kernel qui tourne en mode root
444 s'execute réellement sur le bare-hardware.
446 Hosted Virtualization
447 ---------------------
449 .. figure:: hosted.svg
452 Example: VMware Workstation (1)
453 -------------------------------
455 .. figure:: vmware-wks.svg
458 Example: VMware Workstation (2)
459 -------------------------------
463 - Specific device drivers
465 - Guest OS executed in user mode
467 Standalone Virtualization
468 -------------------------
470 .. figure:: standalone.svg
473 Example: VMware ESX (1)
474 -----------------------
476 .. figure:: vmware-esx.svg
479 Example: VMware ESX (2)
480 -----------------------
483 - Supports unmodified OS binaries
485 - Configuration with appropriate device drivers
492 Process Level Virtualization: ABI Emulation
493 ===========================================
495 Process level ABI Emulation
496 ---------------------------
498 - Goal: execute binary applications of a given system **X** on the ABI of
501 - Emulate system **X** ABI on top of system *Y* ABI
503 - Emulation done by application-level code
505 - System *Y* must provide services equivalent to those of system
506 **X** (file system, sockets, etc...)
508 - Example: **X** = Windows and *Y* = Linux
512 - exemple du CreateFile() de windows qui serait émulé par un open()
515 Process Level (ABI) Emulators
516 -----------------------------
518 - Wine run Windows applications on POSIX-compliant operating
521 - Windows API in userland
522 - Adobe Photoshop, Google Picasa, ...
524 - Cygwin: recompile POSIX applications so they can run under Windows
526 - Unix emulation on Windows
528 - Bash shell + many Unix commands
529 - GNU development tool chain (gcc, gdb)
530 - X Window, GNOME, Apache, sshd, ...
534 - **DEMO**: lancer un .exe avec wine64
535 - l'ABI dépend du système d'exploitation mais aussi de l'architecture.
537 - les appels systèmes sont différents entre linux et windows
538 - mais les appels systemes ne s'invoquent pas de la même manière
539 sur 2 architectures différentes. Par exemple, sur un x86, on
540 utilise un INT 0x80 (en fait SYSENTER maintenant), et les
541 arguments sont placés dans des registres particuliers
543 - google picasa for linux inclut une version embarquée de wine
545 Process Level Cross-architecture Emulators
546 ------------------------------------------
548 - Emulate the Operating System ABI
550 - Emulated OS and native OS are the same (ex: both are linux)
551 - Emulated arch is different than native architecture (ex: x86 and
553 - Note: we define what is "emulation" later in the presentation
562 $ powerpc-linux-gnu-gcc -static hello.c
564 bash: ./a.out: cannot execute binary file
570 - par exemple, vous récupérer une freebox ou un routeur basé sur du
571 mips ou arm, et vous voulez lancer et débugger une application.
573 Process Level Virtualization: virtual servers
574 =============================================
579 - Single OS kernel / Multiple resource instances
581 - can run several linux distributions on the same kernel
583 - Isolated kernel execution environments
586 - Network: Routing table, IP tables, interfaces...
589 - Solaris 10 Containers
590 - LXC, Linux-VServer, openVZ: namespaces and cgroups
595 - tous les processus sont vus par le kernel
597 - les processus ont des vues différentes du système d'exploitation et
598 sont cloisonnés. Ils n'ont pas conscience des domaines adjacents et
599 ont des vues différentes du système (FS, réseau, ...).
601 - Les namespaces de linux sont un bon exemple (lxc, openVZ).
603 - XXX reflechir à une demo... ?
605 - expliquer comment ça peut être implémenté dans le kernel: un
606 parametre supplémentaire pour chaque appel systeme
608 - dire que niveau sécurité, c'est pas encore ça pour cloisonner.
610 - voir dessin slide suivant
612 - signal -> table of process ?
617 .. figure:: virtual-servers.svg
628 - Low memory footprint
635 - No OS heterogeneity
636 - Single OS binary instance (common point of failure)
638 System Level Virtualization: Transparent Hardware Emulation
639 ===========================================================
641 Transparent Hardware Emulation (1)
642 ----------------------------------
644 - Run unmodified OS binaries
646 - Includes emulation of physical devices
648 - Cross ISA Emulation
654 - VirtualBox (Intel x86)
656 Transparent Hardware Emulation (2)
657 ----------------------------------
659 - Emulate machine **X** on top of machine *Y*
661 - Interpretation: read, decode, execute
663 - 1 instruction of **X** executed by N instructions of *Y*
664 - Huge slow down method
666 - Dynamic Binary Translation
668 - Convert blocs of **X** instructions in *Y* instructions
669 - Conversion is done once per basic block
670 - Advanced: dynamic optimization of 'hot' blocs
672 - The emulator is usually a standard application running on a native
677 - Expliquer comment un emulateur peut être implémenté, c'est un gros
678 switch/case, chaque instruction doit être parsée et son comportement
679 doit être émulé. L'émulateur doit conserver dans des variables
680 l'état des registres.
682 - Voilà pourquoi on en arrive à faire de la translation de blocs de
683 code. Attention, la translation dynamique ne se fait qu'à la volée,
684 c'est plus difficile de prendre le binaire, le convertir, et l'executer
685 (translation statique).
687 - https://en.wikipedia.org/wiki/Binary_translation
688 - Dynamic binary translation looks at a short sequence of
689 code—typically on the order of a single basic block—then
690 translates it and caches the resulting sequence.
691 - Code is only translated as it is discovered and when possible, and
692 branch instructions are made to point to already translated and
693 saved code (memoization).
694 - Apple Computer implemented a dynamic translating emulator for M68K
695 code in their PowerPC line of Macintoshes, which achieved a very
696 high level of reliability, performance and compatibility
697 - Intel: IA32 over Itanium
699 QEMU: Hosted Hardware Emulator
700 ------------------------------
702 - Cross ISA Emulation
704 - Emulate machine **X** on top of machine *Y*
706 - Interpretation + translation
708 - Intel x86, PowerPC, ARM, Mips, Sparc, ...
710 - Emulation of SMP architectures
712 - Emulates physical I/O devices
714 - Hard Disk drives, CD-ROM, network controllers, USB controllers, ...
715 - Synchronous emulation of device I/O operations
719 - **DEMO**: lancer kid icarus avec mednafen
720 - ``mednafen -vdriver sdl -nes.xscale 4 -nes.yscale 4 ~/cours_ivan/cours_virt/Kid\ Icarus\ \(Europe\)\ \(Rev\ A\).zip``
721 - voir /usr/share/doc/mednafen/mednafen.html
722 - http://idoc64.free.fr/ASM/instruction.htm
723 - QSDZ = dir, ret=start, tab=select, OP=buttons
724 - Alt-D affiche le debugger
725 - addresse A6 diminue qd on perd des vies
726 - shift-W: write breakpoint, R pour run
727 - on peut essayer de mettre une grosse valeur:
729 - ne marche pas, car sature
731 - shift P (poke in rom): ED45 60 1 (on met un RTS)
732 c'est l'endroit qui sature
734 - à l'adresse DB6C, c'est l'endroit où on stocke A6 après s'etre fait toucher
737 LDA A6: charge la valeur
740 BCS: branch on carry set (on comprend que si ça vaut < 0, on met 0)
742 - 7E42 0 1 -> on met 0 sur le decrement des monstres
744 System Level Virtualization: Transparent Hardware Virtualization
745 ================================================================
747 Transparent Hardware Virtualization
748 -----------------------------------
750 - Guest and host architectures are the same
752 - Execute native/unmodified OS binary images
754 - Provide in each VM a complete simulation of hardware
756 - Full CPU instruction set
757 - Interrupts, exceptions
758 - Memory access and MMU
761 - Share machine resources among multiple VMs
765 - le slide décrit la problematique qui est la meme que pour l'émulation
767 - peut etre donner aussi les exemples style kqemu ou virtualbox
768 (modules accélération). Dire aussi que ça ne concerne toujours pas
769 les Intel-VT, dire que ça va plus vite que l'émulation
771 - share machine resource: exemple des pages memoires en copy-on-write
773 Full CPU Virtualization (1)
774 ---------------------------
776 - Present same functional CPU to all Guest OSes
778 - VMM manages a CPU context for each vCPU of each VM
780 - saved copy of CPU registers
781 - representation of software-emulated CPU context
783 - VMM shares physical CPUs among all vCPU of VMs
785 - VMM includes a VM scheduler
792 - representation of software-emulated CPU context: exemple, savoir que
793 les IT sont masquées ou non.
795 Full CPU Virtualization (2)
796 ---------------------------
798 - Relationships between a VMM and VMs similar to relationships between
799 native OS and applications
801 - Guarantee mutual isolation between all VMs
802 - Protect VMM from all VMs
804 - Directly execute native binary images of Guest OS's in
807 - VMM emulates access to protected resources performed by Guest OSs
812 - Run each Guest OS in non-privileged mode
814 .. figure:: cpu-virt.svg
817 "Hardware-Sensitive" Instructions
818 ---------------------------------
820 - Interact with protected hardware resources
822 - Privileged Instructions (cannot be executed in user mode)
823 - Critical Instructions (can be, but should not be executed by Guest OS)
825 - Must be detected and faked by VMM
827 - Dynamic Binary Translation of kernel code
829 - Done once, saved in Translation Cache
834 - instructions priviligées: ex, masquage des IT
836 - intruction critiques: ex, read de status flag, de CR3, ...
838 Privileged Instructions Virtualization
839 --------------------------------------
841 - Only allowed in supervisor mode
843 - Ex: **cli/sti** to mask/unmask interrupts on Intel x86
845 - When executed in non-privileged mode
847 - CPU automatically detects a privilege violation
848 - Triggers a “privilege-violation” exception
850 - Caught by VMM which fakes the expected effect of the privileged
855 - VMM does not mask/unmask CPU interrupts
856 - records « interrupt mask status » in context of VM
858 Critical Instructions Virtualization (1)
859 ----------------------------------------
861 - Hardware-sensitive instructions
863 - Ex: Intel IA-32 pushf/popf::
865 pushf /* save EFLAG reg. to stack */
866 cli /* mask interrupts => clear EFLAG.IF */
868 popf /* restore EFLAG reg. => unmask interrupts */
870 - When executed in non-privileged mode
872 - The cli instruction triggers an exception caught by VMM => VMM
873 record interrupts masked for current VM
875 - But no exception for popf => VMM not aware of Guest OS action
880 - premier pb: pushf est autorisé et met toujours en pile des flags
881 disant que les IT sont autorisées
882 - popf doit aussi être intercepté car il faut mettre à jour le
885 Critical Instructions Virtualization (2)
886 ----------------------------------------
888 - Must be detected and emulated by VMM
890 - VMM dynamically analyses Guest OS binary code to find critical instructions
892 - VMM replaces critical instructions by a « trap » instruction to enter the VMM
894 - VMM emulates expected effect of critical instruction, if any.
899 - XXX est-ce que la translation doit être faite uniquement sur le
900 code qui a vocation à tourner en ring 0 ?
902 Full Memory Virtualization
903 --------------------------
905 - CPU include a Memory Management Unit (MMU)
907 - Isolated memory addressing spaces
908 - Independant of underlying physical memory layout
909 - Run mutually protected applications in parallel
911 - Virtual Memory managed by OS kernel
913 - Provides a virtual address space to each process
915 - 4 GB on most 32-bit architectures (Intel x86, PowerPC)
917 - Manages virtual page → physical page mappings
918 - Manages « swap » space to extend physical memory
922 - la MMU est un composant hardware
924 Reminder about MMU (1)
925 ----------------------
927 - Here is a minimal code example:
931 # a program that takes x and y in memory, and
933 mov %0x200000,eax # retrieve <x> in eax
934 mov %0x200004,ebx # retrieve <y> in ebx
935 add ebx,eax # compute x+y in eax
936 mov eax,%0x200008 # save the result in memory
938 - This program can run on one cpu
939 - If the addresses are physical, it is not possible to run multiple
940 instance of this program as they would modify the same memory
944 - une mauvaise solution est de modifier le binaire à chaque execution
946 Reminder about MMU (2)
947 ----------------------
949 .. figure:: mmu-slide1.svg
952 Reminder about MMU (3)
953 ----------------------
955 .. figure:: mmu-slide2.svg
958 Reminder about MMU (4): Intel x86 MMU
959 -------------------------------------
964 Memory Virtualization (1)
965 -------------------------
967 .. figure:: mmu-slide3.svg
970 Memory Virtualization (2)
971 -------------------------
973 - Machine Physical Memory
975 - Physical memory available on the machine
977 - Guest OS Physical Memory
979 - Part of machine memory assigned to a VM by VMM
981 - ∑ Guest Physical Memory can be > Machine Memory
983 - VMM uses « swap » space
985 - Guest OS Virtual Memory
987 - Guest OS manages virtual address spaces of its processes
989 Memory Virtualization (3)
990 -------------------------
992 - Guest OS manages Guest Physical Pages
994 - Manages MMU with its own page entries
995 - Translates Virtual Addresses into Guest Physical Addresses (GPA)
997 - VMM transparently manages Machine Physical Pages
999 - Guest Physical Address ≠ Machine Physical Address
1000 - VMM dynamically translates Guest Physical Pages into Machine
1003 Memory Virtualization (4)
1004 -------------------------
1006 .. figure:: mmu-slide4.svg
1011 - passer en dynamique, expliquer comment sont fait les
1012 translations, parler du tlb
1013 montrer l'ordre chronologique des choses
1014 - on dézoome un coup, en statique
1015 virtual memory vs VM memory vs host physical memory
1016 pas de mmu dans ce cas
1017 - montrer en dynamique avec une seule MMU comment
1018 l'hyperviseur configure la MMU
1020 utiliser les memes couleurs pour les types de memoire
1021 on va détourner la mmu pour faire la translation qui nous
1024 - mettre un nombre dans CR3
1025 - mettre des barres verticales dans les page tables
1026 - TLB sous la forme d'un tableau avec des lignes vides
1029 - zoom sur les PTE à droite
1030 - faire apparaitre les adresses
1031 - dissocier les valeurs des adresses virtuelles et physiques, mettre
1032 des couleurs différentes pour ces adresses
1033 - voir si on ne peut pas faire apparaitre que les 20 bits significatifs
1034 et pas les 12 bits d'offset qd on parle des adresses
1036 - Lorsque le guest accede à CR3 (ou un PTE), cela génère une faute,
1037 gérée par le VMM. Le VMM va translater l'adresse donnée par l'OS de
1038 la VM et remplir le registre CR3 avec l'adresse physique
1039 correspondant à la zone utilisée par la VM pour y stocker ses tables
1040 de pages. Il faut que tout accès à cette table de page génère une
1041 faute pour que le VMM soit notifié de tout changement et puisse
1042 configurer la MMU réelle en conséquence (en faisant la translation
1043 d'adresse). (slide 47)
1045 - La lecture de CR3 ne génère pas de TRAP, il faut donc faire comme
1046 pour les pushf et popf, c'est à dire de la translation de code.
1048 Memory Virtualization (5)
1049 -------------------------
1051 - VMM maintains Shadow Page Tables
1053 - Copies of Guest OS translation tables
1055 - VMM catches updates operations of translation tables performed by a
1058 - RW-protect all guest OS page tables
1059 - Emulates operation in shadow page table
1060 - Updates effective MMU page table entry, if needed
1062 Memory Virtualization (6)
1063 -------------------------
1065 - PTE entries can be tagged with a context ID
1067 - Avoids to flush TLB when switching current address space upon
1068 scheduling of a new process
1070 - usually PTE tag = OS process identifier
1072 - Processes of different Guest OSes can be assigned the same Process
1075 - VMM must flush TLB when switching VMs
1077 Memory Virtualization (7)
1078 -------------------------
1080 - VMM must respect Guest OS virtual page faults
1082 - Not map virtual pages unmapped by Guest OS
1083 - When Guest OS unmaps a virtual page:
1085 - VMM must delete the associated real-page/physical page
1088 - Conversely, VMM can transparently:
1090 - Introduce & resolve real-page faults for Guest OSes
1091 - Share physical pages between Guest OS's
1093 - Pages with same content's (e.g. zero-ed pages)
1095 Memory Virtualization (8)
1096 -------------------------
1098 - VMM can swap real pages of a VM
1100 - on "swap" space managed by VMM
1102 - VMM can dynamically distribute physical memory among VMs
1104 - Needs a specific support in Guest OS (Linux module)
1106 - VMM asks Guest OS to release memory
1108 - Guest OS self-allocates real pages
1109 - no more available for normal kernel allocation service
1110 - VMM assigns same amount of physical pages to other VM's
1114 - ballooning: un module kernel est dans les guests, il communique
1115 avec le VMM. Si le VMM a besoin de mémoire physique pour une
1116 autre VM, il peut demander au module d'allouer de la mémoire, qui
1117 est alors perdue pour les autres services. Cette mémoire est
1119 - besoin de précisions et sources là dessus
1121 System Level Virtualization: Paravirtualization
1122 ===============================================
1124 CPU Paravirtualization
1125 ----------------------
1127 - Still run each Guest OS in non-privileged mode, but with minimal
1128 virtualization overhead
1130 - OS adaptation to avoid binary translation overhead
1132 - Remove Hardware-Sensitive Instructions, use fast VMM system calls
1133 - Minimize/avoid usage of Privileged Instructions
1135 - Only affect Machine/CPU dependant part of OS
1137 - OS portage on new architecture with same CPU, without system ISA
1139 - Examples: Xen legacy, User Mode Linux (UML), CoLinux
1141 I/O Paravirtualization
1142 ----------------------
1144 - Multiplexing VMM physical devices among VMs
1146 - Front-end driver in Guest OS
1147 - Back-end driver in VMM
1148 - Virtual ethernet, virtual disks
1150 - Fast virtual devices for VM to VM communications
1154 - Data transfer through syscalls, shmem, rings, ...
1155 - Pros: scalability, VM migration
1160 .. figure:: virt-devices.svg
1163 Paravirtualization Example: Xen Legacy
1164 --------------------------------------
1169 - Share resources of Server machines
1171 - Intel IA-32, x86-64, ARM, ...
1173 - Special first Guest OS called Domain 0
1175 - Run in privileged mode
1176 - Have access (and manages) all physical devices
1177 - Modified version of Linux, FreeBSD
1181 XXX vérifier le coup de domain 0
1183 System Level Virtualization: Hardware-Assisted Virtualization
1184 =============================================================
1186 Hardware Assisted Virtualization (1)
1187 ------------------------------------
1189 - Support of Virtualization in Hardware
1190 - Run unmodified OS binaries
1191 - With minimal virtualization overhead
1192 - Simplify VMM development
1198 Hardware Assisted Virtualization (2)
1199 ------------------------------------
1201 - CPU virtualization
1204 - Intel VT-x (x86), Intel VT-i (Itanium) architectures
1207 - MMU virtualization
1209 - Intel Extended Page Tables (EPT)
1210 - AMD Nested Page Tables (NPT)
1212 Hardware Assisted Virtualization (3)
1213 ------------------------------------
1215 - Directed I/O virtualization
1217 - IO-MMU (Intel VT-d)
1219 - I/O Device virtualization
1221 - Self-Virtualizing devices
1222 - Single Root I/O Virtualization and Sharing Specification (SR-IOV)
1223 - Extensions to PCIe (PCI Express) Bus standard
1225 Intel VT-x Architecture
1226 -----------------------
1228 - Support unmodified Guest OS with no need for paravirtualization
1229 and/or binary code translation
1231 - Simplify VMM tasks & improve VMM performances
1233 - Minimize VMM memory footprint
1235 - Suppress shadowing of Guest OS page tables
1237 - Enable Guest OS to directly manage I/O devices
1239 - Without performance lost
1240 - While enforcing VM isolation and mutual protection
1242 Intel VT-x Architecture Overview
1243 --------------------------------
1245 .. figure:: vt-x.svg
1248 Intel VT-x CPU Virtualization (1)
1249 ---------------------------------
1251 - Virtual Machine eXtension (VMX)
1253 - Two new meta-modes of CPU operation
1257 - Behaviour similar to IA-32 without VT
1258 - Intended for VMM execution
1262 - Alternative IA-32 execution environment
1263 - Controlled by a VMM
1264 - Designed to run unchanged Guest OS in a VM
1266 - Both modes support rings 0-3 privilege levels
1268 - Allow VMM to use several privilege levels
1270 Intel VT-x CPU Virtualization (2)
1271 ---------------------------------
1273 - Two additional CPU mode transitions
1275 - From VMX root-mode to VMX non-root mode
1277 - Named VM Enter (VMLaunch instruction)
1279 - From VMX non-root mode to VMX root mode
1281 - Named VM Exit (event)
1283 - VM entries & VM exits use a new data structure
1285 - Virtual Machine Control Structure (VMCS) per VM CPU (vCPU)
1286 - Referenced with a memory physical address
1287 - Format and layout hidden
1288 - New VT-x instructions to access a VMCS
1290 Intel VT-x CPU Virtualization (3)
1291 ---------------------------------
1295 - Saved value of registers before beeing changed by
1296 VM Exits (e.g. Segment Registers, CR3, IDTR)
1298 - Hidden CPU state (e.g., CPU Interruptibility State)
1304 - Interrupt Virtualization
1305 - Exceptions bitmaps
1307 - Model Specific Register R/W bitmaps
1308 - Execution rights for CPU Privileged Instructions
1312 - host state area est l'endroit ou l'état du processeur du VMM est
1313 stocké. Il est restauré sur VMExit.
1315 - Switching from root mode to non-root mode is called "VM entry", the
1316 switch back is "VM exit". The VMCS includes a guest and host state
1317 area which is saved/restored at VM entry and exit. Most importantly,
1318 the VMCS controls which guest operations will cause VM exits.
1320 The VMCS provides fairly fine-grained control over what the guests
1321 can and can't do. For example, a hypervisor can allow a guest to
1322 write certain bits in shadowed control registers, but not
1323 others. This enables efficient virtualization in cases where guests
1324 can be allowed to write control bits without disrupting the
1325 hypervisor, while preventing them from altering control bits over
1326 which the hypervisor needs to retain full control. The VMCS also
1327 provides control over interrupt delivery and exceptions.
1329 Whenever an instruction or event causes a VM exit, the VMCS contains
1330 information about the exit reason, often with accompanying
1331 detail. For example, if a write to the CR0 register causes an exit,
1332 the offending instruction is recorded, along with the fact that a
1333 write access to a control register caused the exit, and information
1334 about source and destination register. Thus the hypervisor can
1335 efficiently handle the condition without needing advanced techniques
1336 such as CSAM and PATM described above.
1338 VT-x inherently avoids several of the problems which software
1339 virtualization faces. The guest has its own completely separate
1340 address space not shared with the hypervisor, which eliminates
1341 potential clashes. Additionally, guest OS kernel code runs at
1342 privilege ring 0 in VMX non-root mode, obviating the problems by
1343 running ring 0 code at less privileged levels. For example the
1344 SYSENTER instruction can transition to ring 0 without causing
1345 problems. Naturally, even at ring 0 in VMX non-root mode, any I/O
1346 access by guest code still causes a VM exit, allowing for device
1349 - Tout l'état du processur visible est sauvé dans ou restauré depuis
1352 - tous les registres, meme ceux de controle
1353 - interruptability state
1355 - La VMCS contient ce que la VM a le droit de faire
1357 - IO bitmaps = bitfield qui dit quels ports IO (instructions in et
1358 out) sont autorisés.
1360 Intel VT-x Interrupt Virtualization
1361 -----------------------------------
1363 - VMCS External Interrupt Exiting
1365 - All external interrupts cause VM Exit
1366 - Guest OS cannot mask external interrupts when executing Interrupt
1367 Masking instructions
1369 - VMCS Interrupt Window Exiting
1371 - VM Exit occurs whenever Guest OS ready to serve external interrupts
1373 - Used by VMM to control VM interrupts
1377 - la window permet de délayer l'interruption hardware (et donc le vm
1378 exit) tant que le guest n'a pas demasqué ses IT.
1380 - VT-x also includes an interrupt-window exiting VM-execution
1381 control. When this control is set to 1, a VM exit occurs whenever
1382 guest software is ready to receive interrupts. A VMM can set this
1383 control when it has a virtual interrupt to deliver to a
1384 guest. Similarly, VT-i includes a PAL service that a VMM can use to
1385 register that it has a virtual interrupt pending. When guest
1386 software is ready to receive such an interrupt, the service
1387 transfers control to the VMM via the new virtual external interrupt
1390 Intel VT-x MMU Virtualization
1391 -----------------------------
1393 - Extended Page Tables (EPT)
1395 - Second level of Page Tables in MMU
1396 - Translate Guest OS Physical Address into Machine Physical Address
1399 - Virtual Processor IDentifier (VPID)
1401 - Used to tag TLB entries
1402 - Avoid to flush TLB upon VM switch
1404 Virtual Memory Virtualization
1405 -----------------------------
1407 .. figure:: vt-x-mem.svg
1410 Intel VT-x Extended Page Tables (1)
1411 -----------------------------------
1413 - VMM controls Extended Page Tables
1415 - EPT used in VMX non-root operation
1417 - Activated on VM Enter
1418 - Desactivated on VM exit
1420 - EPTP register points to Extended Page Tables
1422 - Instanciated by VMM
1424 - Loaded from VMCS on VM entry
1427 Intel VT-x Extended Page Tables (2)
1428 -----------------------------------
1430 .. figure:: vt-x-mmu.svg
1435 - le TLB contient cache les 2 translations VA->GPA et GPA->MPA
1437 - There is only one downside: nested paging or EPT makes the virtual
1438 to real physical address translation a lot more complex if the TLB
1439 does not have the right entry. For each step we take in the blue
1440 area, we need to do all the steps in the orange area. Thus, four
1441 table searches in the "native situation" have become 16 searches
1442 (for each of the four blue steps, four orange steps).
1444 http://www.anandtech.com/show/2480/10
1449 .. figure:: tlb-flush-issue.svg
1454 - 2 processes dans dess VMs différentes peuvent utiliser la même
1457 Intel VT-x Virtual Processor Identifier
1458 ---------------------------------------
1460 - 16-bit VPID used to tag TLB entries
1462 - Enabled by VMM in VMCS
1463 - Unique VPID is assigned by VMM to each VM
1464 - VPID 0 reserved for VMM
1466 - Current VPID is 0x0000 when
1468 - Outside VMX operation
1469 - In VMX root mode operation
1470 - In VMX non-root mode if VPID disabled in VMCS
1472 - VPID loaded from VMCS on VM Enter
1476 - faire la demo de Windows dans un KVM, on peut parcourir le
1477 gestionnaire de périphérique pour voir que ce n'est pas du tout ce
1478 que j'ai sur mon PC. En plus ça fait bien la transition avec la
1481 .. Intel Virtualization Technology for Directed I/O
1482 ================================================
1484 Intel VT-d Principles
1485 ---------------------
1487 - Enable Guest OS to directly manage physical I/O devices
1489 - Guest I/O operations bypass VMM
1491 - In full transparent mode
1493 - Use native device drivers of Guest OS
1494 - Guest OS unaware of underlying physical memory virtualization
1496 - Enforce isolation between Guest VMs
1498 - Guest OS can only access I/O ressources (ports, PCI devices) assigned to it
1499 - PCI I/O device can only perform DMA to machine physical pages assigned to
1500 Guest VM owning that device.
1505 .. figure:: dma-virt.svg
1514 DMA Virtualization Issue
1515 ------------------------
1517 - Guest OS driver setup I/O registers of device with Guest Physical
1518 Address of I/O buffers
1520 - Guest Physical Address must be translated into its corresponding
1521 Machine Physical Address when used for DMA operations by device
1523 - GPA -> MPA translation cannot be done by VMM
1525 - VMM cannot catch device-specific driver operations to setup I/O
1528 - GPA -> MPA translation done by IOMMU on the Bus Controller
1530 Intel VT-d Protection Domains
1531 -----------------------------
1533 - Intel VT-d provides DMA Protection Domain
1535 - Extension of IOMMU translation mechanism
1536 - Isolated context of a subset of the Machine Physical Memory
1537 - Corresponds to the portion of Machine Physical Memory allocated to
1540 - I/O devices associated by VMM with a DMA Protection Domain
1542 - Achieves DMA isolation by restricting memory access of I/O devices
1543 through DMA address translation
1545 Intel VT-d DMA Translation
1546 --------------------------
1548 - VT-d hardware treats address specified in DMA request as DMA Virtual
1551 - DVA = GPA of the VM to which the I/O device is assigned
1553 - VT-d translates the DVA into its corresponding Machine Physical
1556 - Support of multiple Protection Domains
1558 - DVA to MPA translation table per Protection Domain
1559 - Must identify the device issuing a DMA request
1561 VT-d PCI Express North Bridge
1562 -----------------------------
1564 .. figure:: vt-d.svg
1567 PCI DMA Requester Identification
1568 --------------------------------
1570 - Mapping between PCI Device and Protection Domains
1571 - 16-bit PCI DMA Requester Identifier
1573 .. figure:: dma-req-id.svg
1576 - Assigned by PCI configuration software
1577 - Bus # indexes Bus Context Table in Root Context Table
1578 - (Device #, Function #) indexes Device Protection Domain in Bus
1581 Device / Protection Domain Mapping
1582 ----------------------------------
1584 .. figure:: device-domain-mapping.svg
1587 Virtual DMA Address Translation
1588 -------------------------------
1590 - DVA ↔ MPA Page Tables similar to IA-32 processor Page Tables
1592 - 4KB or larger page size granularity
1594 - Read/Write permissions
1596 - Protection Domains managed by VMM
1598 - Initialized at VM creation time
1599 - With same translations of the VM Extended Page Table
1601 VMM and Directed I/O
1602 --------------------
1604 - Unplugs assigned PCI device from VMM driver and reset it
1606 - Associates PCI device with VT-d Protection Domain of the Guest VM
1608 - Maps device memory BARs in Guest VM physical space
1610 - Arranges for OS of Guest VM to probe PCI device(s) assigned to it
1612 - Handles device interrupts and redirect them to Guest VM
1614 - Reset assigned PCI device upon Guest VM shut down
1616 .. Device Virtualization
1617 =====================
1619 Device Virtualization Principles
1620 --------------------------------
1622 - Share I/O device among multiple Guest VMs
1624 - With no performance lost
1625 - While enforcing VM isolation and protection
1627 - Move device virtualization from the VMM to the device itself
1631 - PF/VF requires support from the device
1633 Ethernet Device Virtual Functions
1634 ---------------------------------
1636 .. figure:: ethernet-dev-virt.svg
1638 Single Root I/O Virtualization
1639 ------------------------------
1641 - SR-IOV capable PCI Device can be partitionned into multiple Virtual
1644 - SR-IOV Device appears in PCI configuration space as multiple PCI
1647 - Virtual Functions are "lightweight" PCI functions including
1649 - PCI probing capabilities
1653 - Requires VT-d for DMA virtualization
1655 - Virtual Functions have no configuration resources
1657 SR-IOV Device Management
1658 ------------------------
1660 - VMM manages the physical PCI device
1662 - VMM creates a PCI Virtual Function for each VM
1664 - Includes it into VM PCI configuration space to be probed by OS kernel
1666 - Associates VF with VT-d Protection Domain of the Guest VM
1668 - VMM programs the sharing of physical devices ressources between VFs
1670 - Virtual Functions managed by specific VF-aware drivers in kernel of
1671 Guest OS (kind of Para-Virtualization)
1673 Intel Niantic Virtual Functions (1)
1674 -----------------------------------
1676 .. figure:: eth-sr-iov.svg
1679 Intel Niantic Virtual Functions (2)
1680 -----------------------------------
1682 - Virtual Devices on Intel Niantic (10GB) NICs
1684 - Layer-2 packet filtering based on destination MAC address
1686 - Filters multiple unicast MAC addresses / VLAN identifiers
1688 - Can duplicate Broadcast / Multicast packets for all VFs
1690 - Multiple RX queues per VF (RSS)
1692 - Load balancing between TX packets sent by VFs
1694 - Anti-Spoofing mechanism on transmission
1696 - Source MAC address
1699 Pro/Cons of I/O hardware virtualization
1700 ---------------------------------------
1702 - Improves I/O performances on physical devices directly managed by Guest VMs
1704 - Only useful in specific configurations
1706 - PCI device Virtual Functions intended to scale, but require locking
1707 of total VM physical memory into machine physical memory
1709 - Not compatible with transparent VM migration
1711 Conclusion / Evolution of Virtualization
1712 ========================================
1717 - Emulation : slow, multi-arch, simulates ISA (full machine) or ABI
1719 - Accelerated emulation : faster, code is executed natively, overhead
1720 for privilegied actions
1721 - Virtual servers : fast and scalable, but same OS and one kernel
1722 - Paravirtualization : fast, needs a modified OS (or drivers)
1723 - HW-assisted virtualization : solves most of the issues
1727 - needs a modified OS is not true for devices
1729 Evolutions of Virtualization
1730 ----------------------------
1734 - Big amount of data
1735 - Virtualization brings flexibility to data center
1737 - Operating systems in browsers ?
1739 - State of OS is stored remotely
1741 - Virtualization on desktops and small devices
1743 - Security (isolates work and personal area)