.. System Virtualization and OS Virtual Machines slides file, created by hieroglyph-quickstart on Mon Oct 28 09:39:30 2013. ============================================= System Virtualization and OS Virtual Machines ============================================= :Date: 2013-12-19 :Authors: Ivan Boule, Olivier Matz Plan ==== Contents -------- - History of Virtualization - Virtualization Usage and Taxonomy - Process Level Virtualization - ABI Emulation - Virtual Servers - System Level Virtualization - Transparent Hardware Emulation - Transparent Hardware Virtualization - Paravirtualization - Hardware-Assisted Virtualization - Conclusion Who am I? --------- - Olivier MATZ ```` - Software engineer since 10 years at 6WIND - 6WIND is a software company designing high performance network software - http://www.6wind.com - I'm mainly developing low-level code: Linux kernel, drivers and applications close to the operating system History of Virtual Machines =========================== Sixties: introduction of IBM/370 series --------------------------------------- - Generalization of virtual memory - Microprogramation of instructions on small models - CP/CMS hypervisor .. figure:: ibm370.jpg :width: 60% .. note:: - IBM/370: généralisation de la mémoire virtuelle - IBM/370: microprogrammation de certaines instructions sur les petits modeles - IBM/370: hyperviseur CP/CMS (Control Program/Conversational Monitoring System), gérant des machines virtuelles sous lequel on pouvait faire tourner indifféremment des CMS, des DOS et des OS. Proposé à des clients le temps d’effectuer des migrations des DOS vers OS, il sera souvent conservé pour la très grande convivialité de CMS utilisé comme système de temps partagé. le produit VM/370, créé par IBM dans les années 1970, permettait à plusieurs usagers d'utiliser en temps partagé un ordinateur qui exécute le système d'exploitation IBM DOS. IBM DOS tout seul n'offrait pas la possibilité d'utilisation en temps partagé2. - temps partagé entre VM Eighties: IBM AS/400 -------------------- - Many logical machines in one physical machine - High level (virtual) ISA including I/Os (TIMI) - Take advantage of advances in hardware without recompilation - User binaries contain both TIMI instructions and machine instructions - Easier transition to PowerPC .. figure:: ibm-as400.jpg :width: 40% .. note:: - IBM/AS-400: c'est un mini-ordinateur de la gamme IBM, fin des années 1980 - IBM/AS-400: possibilité de "découper" plusieurs machines logiques dans une machine physique. - IBM/AS-400: un programme ne parle pas directement au matériel, il utilise un set d'instructions haut-niveau (ISA), ce qui rend le programme indépendant du CPU sur lequel il tourne. Ceci a facilité la transition vers les PowerPC. - http://en.wikipedia.org/wiki/IBM_System_i - XXX IBM/AS-400: pourquoi "co-designed VM" ? XXX rechercher sur internet - emulation des instructions CPU de "haut niveau" XXX regarder comment ca marche: est-ce que c'est un hyperviseur ou un interpreteur. Nineties and later: application VMs ----------------------------------- .. figure:: java.png :width: 15% - Java - a Java program is compiled into a portable bytecode - the JVM is a fictive computer that is able to run this bytecode - Microsoft Common Language Infrastructure (.Net) .. note:: - http://en.wikipedia.org/wiki/Java_virtual_machine Now: OS virtual machines ------------------------ - Run an operating system virtualized top of a virtual machine - Examples: - VMware products (virtualized PC on x86) - KVM - Virtual PC (PC emulation on Mac OS/PowerPC) - Many others : Bochs, VirtualBox, Qemu, ... Virtualization Usages ===================== System Virtualization Principles -------------------------------- - Run multiple OS's on the same machine - By design, an OS assumes to have full control over all physical resources of the machine - Manage sharing/partitioning of machine resources between Guest OS's - CPU - Physical memory & MMU - I/O devices Goals of System Virtualization ------------------------------ - Reduction of Total Cost of Ownership (TCO) - Increase utilisation of server resources - Spawn new servers "on demand" (ex: Amazon EC2 and Elastic Load Balancer) - Reduction of Total Cost of Functioning - Energy consumption - Cooling - Occupied Space - Isolation of OS for security purposes (Qubes, Cells) .. note:: - reduction TCO + TCF: parler du cas data center. On peut parler DE migration à chaud, d'élasticité, ... - amazon ec2: - un client peut créer des machines virtuelles à la demande - Elastic Load Balancer: Les ELB permettent de répartir la charge entre les instances EC2 - Autoscaling: Permet de gérér automatiquement l'élasticité sur un ou plusieurs groupes d'instances EC2 - Cloud Watch: Permet de suivre et monitorer des métriques des instances EC2 pour envoyer des notifications ou prendre des actions - "qubes" (security) http://qubes-os.org/trac/wiki/QubesScreenshots - Based on a secure bare-metal hypervisor (Xen) - Networking code sand-boxed in an unprivileged VM (using IOMMU/VT-d) - USB stacks and drivers sand-boxed in an unprivileged VM (currently experimental feature) - No networking code in the privileged domain (dom0) - All user applications run in “AppVMs”, lightweight VMs based on Linux - Centralized updates of all AppVMs based on the same template - Qubes GUI virtualization presents applications like if they were running locally - Qubes GUI provides isolation between apps sharing the same desktop - Secure system boot based (optional) Virtualization in high-throughput network equipments ---------------------------------------------------- .. figure:: high-thput.svg :width: 100% .. note:: - Initialement, on a un système qui tourne sur plusieurs anciennes cartes (plus la carte de management sous linux). On veut mettre à jour le matériel, il est alors possible si la nouvelle carte est plus puissante de virtualiser les anciennes sans modifier le logiciel. dataplane + control plane -> en une carte - Reprendre ce qui a été dit au slide précédent Usages of Virtual Machines -------------------------- - Server virtualization - Web sites hosting - OS/kernel education & training - OS fault recovery - OS kernel development - Test machine = development host - Keep backward compatibility of legacy software - Run applications not supported by host OS - OS migration without reinstalling it on a new hardware .. note:: - time sharing: on veut utiliser plusieurs OS sur la meme machine: analogie avec plusieurs processes. - eduction & training: on peut imaginer le cas d'un TP, comme présenté dans l'article linux mag 140 sur la libvirt: chaque étudiant travaille sur une machine virtuelle préconfigurée XXX a relire - backward compatibility: préciser que c'est utile lorsque le matériel n'est plus disponible par exemple. - run app not supported by host OS: wine - Certains services ne sont accessibles qu'au niveau de l'OS (routage, filtrage, ...). Avoir plusieurs OS permet de les dupliquer (ex: daisy chain tcp avec des VR) Recovery Servers ---------------- - Another example: one backup server to replace any machine .. figure:: recovery.png :width: 100% .. note:: - La virtualisation permet de faire de la haute disponibilité à pas cher. Souvent c'est le logiciel qui crashe. On peut dupliquer tout une architecture reseau: - apache - mySQL - mail - etc... - Un seul serveur backup à droite pour tous les autres serveurs. Permet de ne pas avoir 8 machines. Si un des 4 se casse la gueule, c'est celui de droite qui prend la main. - en effet, chaque machine a sa propre configuration systeme/reseau/filtrage... Il n'est pas forcément évident de mettre les 4 services sur une même machine sans virtualisation. Multi-Core CPU Issues (1) ------------------------- - CPU power gain - No more achieved through Frequency/Speed increase - But obtained with higher density & multi-core chips - Many RTOS designed with mono-processor assumption - Adding multi-processor support is complex & costly - Scaling requires time, at best... - Legacy RT applications also designed for mono-processor - Adaptation to multi-pro even more difficult than RTOS .. note:: - cas des applications multi-threadées mais conçues avec en tête le fait que la machine n'a qu'un seul core. la virtualisation systeme permet de paralleliser ces applis sur des machine physiques multicores (chaque VM étant mono-core), expliqué slide suivant. - Beaucoup d'applications sont encore monoprocesseur. Cela simplifie drastiquement la manière de coder, il n'y a pas de race condition, pas besoin de locks/mutex. XXX - ce probleme se pose moins sur un système classique que sur des systèmes anciens ou des systèmes temps réel. En effet, les systèmes classiques modernes supportent très bien le multicore et il suffirait de lancer plusieurs applications simultanément. XXX - certaines applications RT multithreadées comptent sur le fait qu'il n'y a qu'un CPU, et que 2 threads ne sont jamais executés de manière réellement concurrente Multi-Core CPU Issues (2) ------------------------- - OS virtualization allows to run simultaneously on a multi-cores CPU multiple instances of mono-processor OS's - Each OS instance is run in a mono-processor - Virtual Machine assigned to a single CPU core - No need to change legacy software - Scalability managed at virtualization level .. note:: - La virtualisation système permet de faire tourner plusieurs instance d'un système d'exploitation non SMP sur un processeur multicore. - Cela peut permettre d'éviter de réécrire un logiciel conçu pour une machine mono-core. Le logiciel dont il est question ici est plutôt un logiciel RT ou un noyau, car si c'est une application standard, le problème ne se pose pas. Virtualization Taxonomy ======================= .. note:: taxonomy = inventaire Machines Interfaces ------------------- .. figure:: isa-abi.svg :width: 70% - ISA = Instruction Set Architecture - System level interface - All CPU instructions, memory architecture, I/O - ABI = Application Binary Interface - Process level interface - User-level non privileged ISA instructions + OS systems calls .. note:: - ISA: Instruction Set Architecture les instructions du CPU (donner des exemples, comme le MOV, CLI/STI pour vérouiller les interruptions), les périphériques, la MMU (comment elle est doit être configurée), ... C'est l'interface qui est utilisé par le système d'exploitation. - ABI: Application Binary Interface C'est l'interface qui permet à un processus de communiquer avec l'extérieur. Il s'agit principalement d'appels systèmes (read, write, gettimeofday, execve, sleep). l'abi contient les instructions non-privilegiées + l'api de l'OS. D'autres instructions comme le cli/sti ne font pas partie de l'ISA. - exemple de la couche de compatibilité pour une application 32 bits tournant sur un kernel 64 bits. Virtualization Taxonomy ----------------------- - Virtualization at process level (ABI) - Emulation of Operating System ABI - Virtual Servers - Virtualization at system level (ISA) - Standalone vs Hosted Virtualization - Machine Emulation vs Machine Virtualization .. note:: - un processus tourne déjà dans une machine virtuelle fournie par l'OS, mais pas au même niveau. Historiquement, l'objectif d'un système d'exploitation multitâche est de fournir des machines virtuelles pour les applications (donc les utilisateurs). Chaque application "pense" qu'elle est tout seule sur le processeur. Chaque application peut avoir accès aux ressources via les appels systèmes, comme si l'application était la seule à parler aux périphériques. C'est au système d'exploitation d'ordonnancer les processus et leurs requetes. - la virtualisation systeme fonctionne sur le même principe mais à un niveau différent. Nous allons voir dans les slides suivants les différents types de virtualisation (standalone vs hosted, et emulation vs virtualisation). Hosted versus Standalone Virtualization --------------------------------------- - Hosted Virtualization - Hosted VM Monitor (VMM) runs on top of native OS - VMware WKS, Microsoft VirtualPC, QEMU/KVM, UML - Standalone Virtualization - VMM directly runs on bare hardware - VMware ESX, IBM/VM, Xen - OS run in a VM is named a Guest OS .. note:: - hosted = hebergée - guest = invité - standalone = autonome, plus petit - en général, le "hosted" n'accede pas réellement au hardware mais à des périphériques émulés - le cas kvm est ambigu: le kernel qui tourne en mode root s'execute réellement sur le bare-hardware. Hosted Virtualization --------------------- .. figure:: hosted.svg :width: 100% Example: VMware Workstation (1) ------------------------------- .. figure:: vmware-wks.svg :width: 100% Example: VMware Workstation (2) ------------------------------- - Hosted VM - Unmodified OSes - Specific device drivers - x86 only - Guest OS executed in user mode Standalone Virtualization ------------------------- .. figure:: standalone.svg :width: 100% Example: VMware ESX (1) ----------------------- .. figure:: vmware-esx.svg :width: 100% Example: VMware ESX (2) ----------------------- - Standalone VMM - Supports unmodified OS binaries - Configuration with appropriate device drivers - x86 only - No Intel-VT - Guest OS - runs in user mode Process Level Virtualization: ABI Emulation =========================================== Process level ABI Emulation --------------------------- - Goal: execute binary applications of a given system **X** on the ABI of another system *Y* - Emulate system **X** ABI on top of system *Y* ABI - Emulation done by application-level code - System *Y* must provide services equivalent to those of system **X** (file system, sockets, etc...) - Example: **X** = Windows and *Y* = Linux .. note:: - exemple du CreateFile() de windows qui serait émulé par un open() sur un unix Process Level (ABI) Emulators ----------------------------- - Wine run Windows applications on POSIX-compliant operating systems - Windows API in userland - Adobe Photoshop, Google Picasa, ... - Cygwin: recompile POSIX applications so they can run under Windows - Unix emulation on Windows - POSIX library - Bash shell + many Unix commands - GNU development tool chain (gcc, gdb) - X Window, GNOME, Apache, sshd, ... .. note:: - **DEMO**: lancer un .exe avec wine64 - l'ABI dépend du système d'exploitation mais aussi de l'architecture. - les appels systèmes sont différents entre linux et windows - mais les appels systemes ne s'invoquent pas de la même manière sur 2 architectures différentes. Par exemple, sur un x86, on utilise un INT 0x80 (en fait SYSENTER maintenant), et les arguments sont placés dans des registres particuliers - google picasa for linux inclut une version embarquée de wine Process Level Cross-architecture Emulators ------------------------------------------ - Emulate the Operating System ABI - Emulated OS and native OS are the same (ex: both are linux) - Emulated arch is different than native architecture (ex: x86 and powerpc) - Note: we define what is "emulation" later in the presentation - Example: qemu-user .. code-block:: sh $ gcc hello.c $ ./a.out hello $ powerpc-linux-gnu-gcc -static hello.c $ ./a.out bash: ./a.out: cannot execute binary file $ qemu-ppc ./a.out hello .. note:: - par exemple, vous récupérer une freebox ou un routeur basé sur du mips ou arm, et vous voulez lancer et débugger une application. Process Level Virtualization: virtual servers ============================================= Virtual Servers (1) ------------------- - Single OS kernel / Multiple resource instances - can run several linux distributions on the same kernel - Isolated kernel execution environments - Root file system - Network: Routing table, IP tables, interfaces... - Process signals - Solaris 10 Containers - LXC, Linux-VServer, openVZ: namespaces and cgroups - FreeBSD Jail .. note:: - tous les processus sont vus par le kernel - les processus ont des vues différentes du système d'exploitation et sont cloisonnés. Ils n'ont pas conscience des domaines adjacents et ont des vues différentes du système (FS, réseau, ...). - Les namespaces de linux sont un bon exemple (lxc, openVZ). - XXX reflechir à une demo... ? - expliquer comment ça peut être implémenté dans le kernel: un parametre supplémentaire pour chaque appel systeme - dire que niveau sécurité, c'est pas encore ça pour cloisonner. - voir dessin slide suivant - signal -> table of process ? Virtual Servers (2) ------------------- .. figure:: virtual-servers.svg :width: 100% Virtual Servers (3) ------------------- - Pro's - CPU independent - Lightweight - Low memory footprint - Low CPU overhead - Scalable - Con's - No OS heterogeneity - Single OS binary instance (common point of failure) System Level Virtualization: Transparent Hardware Emulation =========================================================== Transparent Hardware Emulation (1) ---------------------------------- - Run unmodified OS binaries - Includes emulation of physical devices - Cross ISA Emulation - qemu-system - Same ISA Emulation - VirtualBox (Intel x86) Transparent Hardware Emulation (2) ---------------------------------- - Emulate machine **X** on top of machine *Y* - Interpretation: read, decode, execute - 1 instruction of **X** executed by N instructions of *Y* - Huge slow down method - Dynamic Binary Translation - Convert blocs of **X** instructions in *Y* instructions - Conversion is done once per basic block - Advanced: dynamic optimization of 'hot' blocs - The emulator is usually a standard application running on a native OS .. note:: - Expliquer comment un emulateur peut être implémenté, c'est un gros switch/case, chaque instruction doit être parsée et son comportement doit être émulé. L'émulateur doit conserver dans des variables l'état des registres. - Voilà pourquoi on en arrive à faire de la translation de blocs de code. Attention, la translation dynamique ne se fait qu'à la volée, c'est plus difficile de prendre le binaire, le convertir, et l'executer (translation statique). - https://en.wikipedia.org/wiki/Binary_translation - Dynamic binary translation looks at a short sequence of code—typically on the order of a single basic block—then translates it and caches the resulting sequence. - Code is only translated as it is discovered and when possible, and branch instructions are made to point to already translated and saved code (memoization). - Apple Computer implemented a dynamic translating emulator for M68K code in their PowerPC line of Macintoshes, which achieved a very high level of reliability, performance and compatibility - Intel: IA32 over Itanium QEMU: Hosted Hardware Emulator ------------------------------ - Cross ISA Emulation - Emulate machine **X** on top of machine *Y* - Interpretation + translation - Intel x86, PowerPC, ARM, Mips, Sparc, ... - Emulation of SMP architectures - Emulates physical I/O devices - Hard Disk drives, CD-ROM, network controllers, USB controllers, ... - Synchronous emulation of device I/O operations .. note:: - **DEMO**: lancer kid icarus avec mednafen - ``mednafen -vdriver sdl -nes.xscale 4 -nes.yscale 4 ~/cours_ivan/cours_virt/Kid\ Icarus\ \(Europe\)\ \(Rev\ A\).zip`` - voir /usr/share/doc/mednafen/mednafen.html - http://idoc64.free.fr/ASM/instruction.htm - QSDZ = dir, ret=start, tab=select, OP=buttons - Alt-D affiche le debugger - addresse A6 diminue qd on perd des vies - shift-W: write breakpoint, R pour run - on peut essayer de mettre une grosse valeur: Poke A6 30 1 - ne marche pas, car sature - breakpoint à A6 - shift P (poke in rom): ED45 60 1 (on met un RTS) c'est l'endroit qui sature - Poke A6 30 1 - à l'adresse DB6C, c'est l'endroit où on stocke A6 après s'etre fait toucher par un monstre:: LDA A6: charge la valeur SEC: set carry SBC: sub with carry BCS: branch on carry set (on comprend que si ça vaut < 0, on met 0) - 7E42 0 1 -> on met 0 sur le decrement des monstres System Level Virtualization: Transparent Hardware Virtualization ================================================================ Transparent Hardware Virtualization ----------------------------------- - Guest and host architectures are the same - Execute native/unmodified OS binary images - Provide in each VM a complete simulation of hardware - Full CPU instruction set - Interrupts, exceptions - Memory access and MMU - I/O devices - Share machine resources among multiple VMs .. note:: - le slide décrit la problematique qui est la meme que pour l'émulation - peut etre donner aussi les exemples style kqemu ou virtualbox (modules accélération). Dire aussi que ça ne concerne toujours pas les Intel-VT, dire que ça va plus vite que l'émulation - share machine resource: exemple des pages memoires en copy-on-write Full CPU Virtualization (1) --------------------------- - Present same functional CPU to all Guest OSes - VMM manages a CPU context for each vCPU of each VM - saved copy of CPU registers - representation of software-emulated CPU context - VMM shares physical CPUs among all vCPU of VMs - VMM includes a VM scheduler - round-robin - priority-based .. note:: - representation of software-emulated CPU context: exemple, savoir que les IT sont masquées ou non. Full CPU Virtualization (2) --------------------------- - Relationships between a VMM and VMs similar to relationships between native OS and applications - Guarantee mutual isolation between all VMs - Protect VMM from all VMs - Directly execute native binary images of Guest OS's in non-privileged mode - VMM emulates access to protected resources performed by Guest OSs CPU Virtualization ------------------ - Run each Guest OS in non-privileged mode .. figure:: cpu-virt.svg :width: 100% "Hardware-Sensitive" Instructions --------------------------------- - Interact with protected hardware resources - Privileged Instructions (cannot be executed in user mode) - Critical Instructions (can be, but should not be executed by Guest OS) - Must be detected and faked by VMM - Dynamic Binary Translation of kernel code - Done once, saved in Translation Cache - Example: Vmware .. note:: - instructions priviligées: ex, masquage des IT - intruction critiques: ex, read de status flag, de CR3, ... Privileged Instructions Virtualization -------------------------------------- - Only allowed in supervisor mode - Ex: **cli/sti** to mask/unmask interrupts on Intel x86 - When executed in non-privileged mode - CPU automatically detects a privilege violation - Triggers a “privilege-violation” exception - Caught by VMM which fakes the expected effect of the privileged instruction - Ex: **cli/sti** - VMM does not mask/unmask CPU interrupts - records « interrupt mask status » in context of VM Critical Instructions Virtualization (1) ---------------------------------------- - Hardware-sensitive instructions - Ex: Intel IA-32 pushf/popf:: pushf /* save EFLAG reg. to stack */ cli /* mask interrupts => clear EFLAG.IF */ ... popf /* restore EFLAG reg. => unmask interrupts */ - When executed in non-privileged mode - The cli instruction triggers an exception caught by VMM => VMM record interrupts masked for current VM - But no exception for popf => VMM not aware of Guest OS action (unmask interrupts) .. note:: - premier pb: pushf est autorisé et met toujours en pile des flags disant que les IT sont autorisées - popf doit aussi être intercepté car il faut mettre à jour le statut des IT Critical Instructions Virtualization (2) ---------------------------------------- - Must be detected and emulated by VMM - VMM dynamically analyses Guest OS binary code to find critical instructions - VMM replaces critical instructions by a « trap » instruction to enter the VMM - VMM emulates expected effect of critical instruction, if any. .. note:: - **PAUSE** - XXX est-ce que la translation doit être faite uniquement sur le code qui a vocation à tourner en ring 0 ? Full Memory Virtualization -------------------------- - CPU include a Memory Management Unit (MMU) - Isolated memory addressing spaces - Independant of underlying physical memory layout - Run mutually protected applications in parallel - Virtual Memory managed by OS kernel - Provides a virtual address space to each process - 4 GB on most 32-bit architectures (Intel x86, PowerPC) - Manages virtual page → physical page mappings - Manages « swap » space to extend physical memory .. note:: - la MMU est un composant hardware Reminder about MMU (1) ---------------------- - Here is a minimal code example: .. code-block:: sh # a program that takes x and y in memory, and # computes the sum mov %0x200000,eax # retrieve in eax mov %0x200004,ebx # retrieve in ebx add ebx,eax # compute x+y in eax mov eax,%0x200008 # save the result in memory - This program can run on one cpu - If the addresses are physical, it is not possible to run multiple instance of this program as they would modify the same memory .. note:: - une mauvaise solution est de modifier le binaire à chaque execution Reminder about MMU (2) ---------------------- .. figure:: mmu-slide1.svg :width: 70% Reminder about MMU (3) ---------------------- .. figure:: mmu-slide2.svg :width: 95% Reminder about MMU (4): Intel x86 MMU ------------------------------------- .. figure:: mmu2.svg :width: 100% Memory Virtualization (1) ------------------------- .. figure:: mmu-slide3.svg :width: 70% Memory Virtualization (2) ------------------------- - Machine Physical Memory - Physical memory available on the machine - Guest OS Physical Memory - Part of machine memory assigned to a VM by VMM - ∑ Guest Physical Memory can be > Machine Memory - VMM uses « swap » space - Guest OS Virtual Memory - Guest OS manages virtual address spaces of its processes Memory Virtualization (3) ------------------------- - Guest OS manages Guest Physical Pages - Manages MMU with its own page entries - Translates Virtual Addresses into Guest Physical Addresses (GPA) - VMM transparently manages Machine Physical Pages - Guest Physical Address ≠ Machine Physical Address - VMM dynamically translates Guest Physical Pages into Machine Physical Pages Memory Virtualization (4) ------------------------- .. figure:: mmu-slide4.svg :width: 95% .. note:: - passer en dynamique, expliquer comment sont fait les translations, parler du tlb montrer l'ordre chronologique des choses - on dézoome un coup, en statique virtual memory vs VM memory vs host physical memory pas de mmu dans ce cas - montrer en dynamique avec une seule MMU comment l'hyperviseur configure la MMU utiliser les memes couleurs pour les types de memoire on va détourner la mmu pour faire la translation qui nous va bien - mettre un nombre dans CR3 - mettre des barres verticales dans les page tables - TLB sous la forme d'un tableau avec des lignes vides - find -> get - mmu plus large - zoom sur les PTE à droite - faire apparaitre les adresses - dissocier les valeurs des adresses virtuelles et physiques, mettre des couleurs différentes pour ces adresses - voir si on ne peut pas faire apparaitre que les 20 bits significatifs et pas les 12 bits d'offset qd on parle des adresses - Lorsque le guest accede à CR3 (ou un PTE), cela génère une faute, gérée par le VMM. Le VMM va translater l'adresse donnée par l'OS de la VM et remplir le registre CR3 avec l'adresse physique correspondant à la zone utilisée par la VM pour y stocker ses tables de pages. Il faut que tout accès à cette table de page génère une faute pour que le VMM soit notifié de tout changement et puisse configurer la MMU réelle en conséquence (en faisant la translation d'adresse). (slide 47) - La lecture de CR3 ne génère pas de TRAP, il faut donc faire comme pour les pushf et popf, c'est à dire de la translation de code. Memory Virtualization (5) ------------------------- - VMM maintains Shadow Page Tables - Copies of Guest OS translation tables - VMM catches updates operations of translation tables performed by a Guest OS - RW-protect all guest OS page tables - Emulates operation in shadow page table - Updates effective MMU page table entry, if needed Memory Virtualization (6) ------------------------- - PTE entries can be tagged with a context ID - Avoids to flush TLB when switching current address space upon scheduling of a new process - usually PTE tag = OS process identifier - Processes of different Guest OSes can be assigned the same Process ID - VMM must flush TLB when switching VMs Memory Virtualization (7) ------------------------- - VMM must respect Guest OS virtual page faults - Not map virtual pages unmapped by Guest OS - When Guest OS unmaps a virtual page: - VMM must delete the associated real-page/physical page mapping, if any. - Conversely, VMM can transparently: - Introduce & resolve real-page faults for Guest OSes - Share physical pages between Guest OS's - Pages with same content's (e.g. zero-ed pages) Memory Virtualization (8) ------------------------- - VMM can swap real pages of a VM - on "swap" space managed by VMM - VMM can dynamically distribute physical memory among VMs - Needs a specific support in Guest OS (Linux module) - VMM asks Guest OS to release memory - Guest OS self-allocates real pages - no more available for normal kernel allocation service - VMM assigns same amount of physical pages to other VM's .. note:: - ballooning: un module kernel est dans les guests, il communique avec le VMM. Si le VMM a besoin de mémoire physique pour une autre VM, il peut demander au module d'allouer de la mémoire, qui est alors perdue pour les autres services. Cette mémoire est "redonnée" au VMM. - besoin de précisions et sources là dessus System Level Virtualization: Paravirtualization =============================================== CPU Paravirtualization ---------------------- - Still run each Guest OS in non-privileged mode, but with minimal virtualization overhead - OS adaptation to avoid binary translation overhead - Remove Hardware-Sensitive Instructions, use fast VMM system calls - Minimize/avoid usage of Privileged Instructions - Only affect Machine/CPU dependant part of OS - OS portage on new architecture with same CPU, without system ISA - Examples: Xen legacy, User Mode Linux (UML), CoLinux I/O Paravirtualization ---------------------- - Multiplexing VMM physical devices among VMs - Front-end driver in Guest OS - Back-end driver in VMM - Virtual ethernet, virtual disks - Fast virtual devices for VM to VM communications - Example: vmxnet3 - Data transfer through syscalls, shmem, rings, ... - Pros: scalability, VM migration Virtual I/O Devices ------------------- .. figure:: virt-devices.svg :width: 100% Paravirtualization Example: Xen Legacy -------------------------------------- - Objectives - Scalable - Share resources of Server machines - Intel IA-32, x86-64, ARM, ... - Special first Guest OS called Domain 0 - Run in privileged mode - Have access (and manages) all physical devices - Modified version of Linux, FreeBSD .. note:: XXX vérifier le coup de domain 0 System Level Virtualization: Hardware-Assisted Virtualization ============================================================= Hardware Assisted Virtualization (1) ------------------------------------ - Support of Virtualization in Hardware - Run unmodified OS binaries - With minimal virtualization overhead - Simplify VMM development - Examples - KVM - VMware Hardware Assisted Virtualization (2) ------------------------------------ - CPU virtualization - AMD-V - Intel VT-x (x86), Intel VT-i (Itanium) architectures - ARM Cortex-A15 - MMU virtualization - Intel Extended Page Tables (EPT) - AMD Nested Page Tables (NPT) Hardware Assisted Virtualization (3) ------------------------------------ - Directed I/O virtualization - IO-MMU (Intel VT-d) - I/O Device virtualization - Self-Virtualizing devices - Single Root I/O Virtualization and Sharing Specification (SR-IOV) - Extensions to PCIe (PCI Express) Bus standard Intel VT-x Architecture ----------------------- - Support unmodified Guest OS with no need for paravirtualization and/or binary code translation - Simplify VMM tasks & improve VMM performances - Minimize VMM memory footprint - Suppress shadowing of Guest OS page tables - Enable Guest OS to directly manage I/O devices - Without performance lost - While enforcing VM isolation and mutual protection Intel VT-x Architecture Overview -------------------------------- .. figure:: vt-x.svg :width: 100% Intel VT-x CPU Virtualization (1) --------------------------------- - Virtual Machine eXtension (VMX) - Two new meta-modes of CPU operation - VMX root mode - Behaviour similar to IA-32 without VT - Intended for VMM execution - VMX non-root mode - Alternative IA-32 execution environment - Controlled by a VMM - Designed to run unchanged Guest OS in a VM - Both modes support rings 0-3 privilege levels - Allow VMM to use several privilege levels Intel VT-x CPU Virtualization (2) --------------------------------- - Two additional CPU mode transitions - From VMX root-mode to VMX non-root mode - Named VM Enter (VMLaunch instruction) - From VMX non-root mode to VMX root mode - Named VM Exit (event) - VM entries & VM exits use a new data structure - Virtual Machine Control Structure (VMCS) per VM CPU (vCPU) - Referenced with a memory physical address - Format and layout hidden - New VT-x instructions to access a VMCS Intel VT-x CPU Virtualization (3) --------------------------------- - Guest State Area - Saved value of registers before beeing changed by VM Exits (e.g. Segment Registers, CR3, IDTR) - Hidden CPU state (e.g., CPU Interruptibility State) - Host State Area - VM Control Fields - Interrupt Virtualization - Exceptions bitmaps - I/O bitmaps - Model Specific Register R/W bitmaps - Execution rights for CPU Privileged Instructions .. note:: - host state area est l'endroit ou l'état du processeur du VMM est stocké. Il est restauré sur VMExit. - Switching from root mode to non-root mode is called "VM entry", the switch back is "VM exit". The VMCS includes a guest and host state area which is saved/restored at VM entry and exit. Most importantly, the VMCS controls which guest operations will cause VM exits. The VMCS provides fairly fine-grained control over what the guests can and can't do. For example, a hypervisor can allow a guest to write certain bits in shadowed control registers, but not others. This enables efficient virtualization in cases where guests can be allowed to write control bits without disrupting the hypervisor, while preventing them from altering control bits over which the hypervisor needs to retain full control. The VMCS also provides control over interrupt delivery and exceptions. Whenever an instruction or event causes a VM exit, the VMCS contains information about the exit reason, often with accompanying detail. For example, if a write to the CR0 register causes an exit, the offending instruction is recorded, along with the fact that a write access to a control register caused the exit, and information about source and destination register. Thus the hypervisor can efficiently handle the condition without needing advanced techniques such as CSAM and PATM described above. VT-x inherently avoids several of the problems which software virtualization faces. The guest has its own completely separate address space not shared with the hypervisor, which eliminates potential clashes. Additionally, guest OS kernel code runs at privilege ring 0 in VMX non-root mode, obviating the problems by running ring 0 code at less privileged levels. For example the SYSENTER instruction can transition to ring 0 without causing problems. Naturally, even at ring 0 in VMX non-root mode, any I/O access by guest code still causes a VM exit, allowing for device emulation. - Tout l'état du processur visible est sauvé dans ou restauré depuis la memoire: - tous les registres, meme ceux de controle - interruptability state - La VMCS contient ce que la VM a le droit de faire - IO bitmaps = bitfield qui dit quels ports IO (instructions in et out) sont autorisés. Intel VT-x Interrupt Virtualization ----------------------------------- - VMCS External Interrupt Exiting - All external interrupts cause VM Exit - Guest OS cannot mask external interrupts when executing Interrupt Masking instructions - VMCS Interrupt Window Exiting - VM Exit occurs whenever Guest OS ready to serve external interrupts - Used by VMM to control VM interrupts .. note:: - la window permet de délayer l'interruption hardware (et donc le vm exit) tant que le guest n'a pas demasqué ses IT. - VT-x also includes an interrupt-window exiting VM-execution control. When this control is set to 1, a VM exit occurs whenever guest software is ready to receive interrupts. A VMM can set this control when it has a virtual interrupt to deliver to a guest. Similarly, VT-i includes a PAL service that a VMM can use to register that it has a virtual interrupt pending. When guest software is ready to receive such an interrupt, the service transfers control to the VMM via the new virtual external interrupt vector. Intel VT-x MMU Virtualization ----------------------------- - Extended Page Tables (EPT) - Second level of Page Tables in MMU - Translate Guest OS Physical Address into Machine Physical Address - Controlled by VMM - Virtual Processor IDentifier (VPID) - Used to tag TLB entries - Avoid to flush TLB upon VM switch Virtual Memory Virtualization ----------------------------- .. figure:: vt-x-mem.svg :width: 100% Intel VT-x Extended Page Tables (1) ----------------------------------- - VMM controls Extended Page Tables - EPT used in VMX non-root operation - Activated on VM Enter - Desactivated on VM exit - EPTP register points to Extended Page Tables - Instanciated by VMM - Saved in VMCS - Loaded from VMCS on VM entry Intel VT-x Extended Page Tables (2) ----------------------------------- .. figure:: vt-x-mmu.svg :width: 100% .. note:: - le TLB contient cache les 2 translations VA->GPA et GPA->MPA - There is only one downside: nested paging or EPT makes the virtual to real physical address translation a lot more complex if the TLB does not have the right entry. For each step we take in the blue area, we need to do all the steps in the orange area. Thus, four table searches in the "native situation" have become 16 searches (for each of the four blue steps, four orange steps). http://www.anandtech.com/show/2480/10 TLB Flush Issue --------------- .. figure:: tlb-flush-issue.svg :width: 100% .. note:: - 2 processes dans dess VMs différentes peuvent utiliser la même adresse virtuelle Intel VT-x Virtual Processor Identifier --------------------------------------- - 16-bit VPID used to tag TLB entries - Enabled by VMM in VMCS - Unique VPID is assigned by VMM to each VM - VPID 0 reserved for VMM - Current VPID is 0x0000 when - Outside VMX operation - In VMX root mode operation - In VMX non-root mode if VPID disabled in VMCS - VPID loaded from VMCS on VM Enter .. note:: - faire la demo de Windows dans un KVM, on peut parcourir le gestionnaire de périphérique pour voir que ce n'est pas du tout ce que j'ai sur mon PC. En plus ça fait bien la transition avec la virtualisation DMA. .. Intel Virtualization Technology for Directed I/O ================================================ Intel VT-d Principles --------------------- - Enable Guest OS to directly manage physical I/O devices - Guest I/O operations bypass VMM - In full transparent mode - Use native device drivers of Guest OS - Guest OS unaware of underlying physical memory virtualization - Enforce isolation between Guest VMs - Guest OS can only access I/O ressources (ports, PCI devices) assigned to it - PCI I/O device can only perform DMA to machine physical pages assigned to Guest VM owning that device. Intel Directed IO ----------------- .. figure:: dma-virt.svg :width: 100% DMA Principles -------------- .. figure:: dma.svg :width: 100% DMA Virtualization Issue ------------------------ - Guest OS driver setup I/O registers of device with Guest Physical Address of I/O buffers - Guest Physical Address must be translated into its corresponding Machine Physical Address when used for DMA operations by device - GPA -> MPA translation cannot be done by VMM - VMM cannot catch device-specific driver operations to setup I/O buffers addresses - GPA -> MPA translation done by IOMMU on the Bus Controller Intel VT-d Protection Domains ----------------------------- - Intel VT-d provides DMA Protection Domain - Extension of IOMMU translation mechanism - Isolated context of a subset of the Machine Physical Memory - Corresponds to the portion of Machine Physical Memory allocated to a VM - I/O devices associated by VMM with a DMA Protection Domain - Achieves DMA isolation by restricting memory access of I/O devices through DMA address translation Intel VT-d DMA Translation -------------------------- - VT-d hardware treats address specified in DMA request as DMA Virtual Address (DVA) - DVA = GPA of the VM to which the I/O device is assigned - VT-d translates the DVA into its corresponding Machine Physical Address - Support of multiple Protection Domains - DVA to MPA translation table per Protection Domain - Must identify the device issuing a DMA request VT-d PCI Express North Bridge ----------------------------- .. figure:: vt-d.svg :width: 100% PCI DMA Requester Identification -------------------------------- - Mapping between PCI Device and Protection Domains - 16-bit PCI DMA Requester Identifier .. figure:: dma-req-id.svg :width: 80% - Assigned by PCI configuration software - Bus # indexes Bus Context Table in Root Context Table - (Device #, Function #) indexes Device Protection Domain in Bus Context Table Device / Protection Domain Mapping ---------------------------------- .. figure:: device-domain-mapping.svg :width: 100% Virtual DMA Address Translation ------------------------------- - DVA ↔ MPA Page Tables similar to IA-32 processor Page Tables - 4KB or larger page size granularity - Read/Write permissions - Protection Domains managed by VMM - Initialized at VM creation time - With same translations of the VM Extended Page Table VMM and Directed I/O -------------------- - Unplugs assigned PCI device from VMM driver and reset it - Associates PCI device with VT-d Protection Domain of the Guest VM - Maps device memory BARs in Guest VM physical space - Arranges for OS of Guest VM to probe PCI device(s) assigned to it - Handles device interrupts and redirect them to Guest VM - Reset assigned PCI device upon Guest VM shut down .. Device Virtualization ===================== Device Virtualization Principles -------------------------------- - Share I/O device among multiple Guest VMs - With no performance lost - While enforcing VM isolation and protection - Move device virtualization from the VMM to the device itself - PCIe extension - PF/VF requires support from the device Ethernet Device Virtual Functions --------------------------------- .. figure:: ethernet-dev-virt.svg Single Root I/O Virtualization ------------------------------ - SR-IOV capable PCI Device can be partitionned into multiple Virtual Functions - SR-IOV Device appears in PCI configuration space as multiple PCI Virtual Functions - Virtual Functions are "lightweight" PCI functions including - PCI probing capabilities - DMA streams - Interrupts - Requires VT-d for DMA virtualization - Virtual Functions have no configuration resources SR-IOV Device Management ------------------------ - VMM manages the physical PCI device - VMM creates a PCI Virtual Function for each VM - Includes it into VM PCI configuration space to be probed by OS kernel of Guest VM - Associates VF with VT-d Protection Domain of the Guest VM - VMM programs the sharing of physical devices ressources between VFs - Virtual Functions managed by specific VF-aware drivers in kernel of Guest OS (kind of Para-Virtualization) Intel Niantic Virtual Functions (1) ----------------------------------- .. figure:: eth-sr-iov.svg :width: 80% Intel Niantic Virtual Functions (2) ----------------------------------- - Virtual Devices on Intel Niantic (10GB) NICs - Layer-2 packet filtering based on destination MAC address - Filters multiple unicast MAC addresses / VLAN identifiers - Can duplicate Broadcast / Multicast packets for all VFs - Multiple RX queues per VF (RSS) - Load balancing between TX packets sent by VFs - Anti-Spoofing mechanism on transmission - Source MAC address - VLAN identifier Pro/Cons of I/O hardware virtualization --------------------------------------- - Improves I/O performances on physical devices directly managed by Guest VMs - Only useful in specific configurations - PCI device Virtual Functions intended to scale, but require locking of total VM physical memory into machine physical memory - Not compatible with transparent VM migration Conclusion / Evolution of Virtualization ======================================== Conclusion ---------- - Emulation : slow, multi-arch, simulates ISA (full machine) or ABI (process level) - Accelerated emulation : faster, code is executed natively, overhead for privilegied actions - Virtual servers : fast and scalable, but same OS and one kernel - Paravirtualization : fast, needs a modified OS (or drivers) - HW-assisted virtualization : solves most of the issues .. note:: - needs a modified OS is not true for devices Evolutions of Virtualization ---------------------------- - Cloud computing - Big amount of data - Virtualization brings flexibility to data center - Operating systems in browsers ? - State of OS is stored remotely - Virtualization on desktops and small devices - Security (isolates work and personal area) Thanks ------ - Any question ?