Abstract: A system and method to enable PCIe device migration during Virtual Machine (VM) live migration is disclosed. The system includes a management server, a source computing machine, a destination computing machine, a Peripheral Component Interconnect express (PCIe) switch and a PCIe endpoint device. The source computing machine and the destination computing machines includes a Virtual Machine Monitor (VMM) to perform VM live migration. A VM running on the source computing machine is migrated to the destination computing machine along with the PCIe endpoint device. Hence, the PCIe endpoint device is emulated and is presented to the destination computing machine on behalf of the PCIe endpoint device by a PCIe switch module running on the management server by using a connection relationship data. After creating the emulated PCIe device, the connection relationship data is modified with respect to the emulated PCIe device and the destination computing machine.
Claims:1. A processor-implemented method for migrating a Peripheral component Interconnect express (PCIe) endpoint device connected to a downstream port of a PCIe switch during Virtual Machine (VM) live migration from a source computing machine to a destination computing machine, the PCIe endpoint device connected to the source computing machine via the PCIe switch, wherein a guest OS running on the source computing machine uses the PCIe endpoint device, further the PCIe switch connected to a management server having a PCIe switch module running thereon,
wherein the management server exposes a first emulated PCIe device to the source computing machine, a set of PCIe configuration registers and Base Address Register (BAR) registers of the first emulated PCIe device comprises at least a subset of the PCIe configuration registers and BAR registers of the PCIe endpoint device, and
wherein the management server configures a connection relationship, within the PCIe switch, between the first emulated PCIe device and the PCIe endpoint device by using address trap tables, ID routing tables and interrupt redirection tables associated with the PCIe switch, the method comprising:
receiving a VM migration request at a VMM running on the source computing machine;
notifying, by the VMM, receipt of the VM migration request to the PCIe switch module, and suspending the guest OS;
requesting, by the PCIe switch module to the PCIe endpoint device, to stop a Direct Memory Access (DMA) operation of the PCIe endpoint device;
capturing and saving, by the VMM, an execution-related data of the guest OS and a connection relationship data between the first emulated PCIe device and the guest OS of the source computing machine, wherein the connection relationship data between the first emulated PCIe device and the guest OS comprises BAR mapping information, DMA mapping information, and an Interrupt mapping information associated with the source computing machine;
sending the execution-related data and the connection relationship data to a VMM running on the destination computing machine;
emulating and exposing, by the PCIe switch module, a second emulated PCIe device to the destination computing machine, the second emulated PCIe device having a similar configuration as that of the first emulated PCIe device;
initializing the BAR register of the second emulated PCIe device, by the destination computing machine;
creating, by the switch module, a connection relationship between the second emulated PCIe device and the PCIe endpoint device using the address trap tables and the ID routing tables associated with the PCIe switch;
creating a DMA remapping in an IO Memory Management Unit (IOMMU) of the destination computing machine with requester ID of the second emulated PCIe device and guest physical address domain of the migrated guest OS;
requesting, by the PCIe switch module running on the management server to the VMM running on the destination computing machine, to reinitialize a plurality of interrupt vectors associated with interrupt redirection table of the second emulated PCIe device exposed to the destination computing machine;
creating, by using the connection relationship data captured from the source computing machine, entries in the interrupt redirection table in the destination computing machine for the second emulated PCIe device, the entries in the interrupt redirection table comprising the plurality of interrupt vectors used by the guest OS running on the source computing machine;
creating, by the PCIe switch module, entries in an interrupt redirection table of the PCIe switch to establish a connection relationship between interrupt message addresses used by the second emulated PCIe device and the Interrupt message addresses used by the PCIe endpoint device;
establishing, by the VMM running on the destination computing machine, mapping between the second emulated PCIe device and the PCIe endpoint device configurations exposed to the guest OS by the VMM and
notifying, by the PCIe switch module running on the management sever to the PCIe endpoint device, to re-¬enable the DMA operation and completion of VM migration to the VMM running on the destination computing machine.
2. The method as claimed in claim 1, wherein the VM migration comprises migration of the PCIe endpoint device from the source computing machine to the destination computing machine, and migration of the guest OS running on the source computing machine to the destination computing machine.
3. The method as claimed in claim 1, further comprising continuing, access of the guest OS running on the source computing machine prior to the VM migration, to the PCIe endpoint device on the destination computing machine after the completion of the VM migration.
4. The method as claimed in claim 1, wherein the address trap tables are used to trap and redirect the memory mapped/IO mapped BAR access request received at one of the first emulated PCIe device and the second emulated PCIe device to address regions of the PCIe endpoint device specified by the PCIe switch module.
5. The method as claimed in claim 1, further comprising establishing mapping of a PCI requestor ID of the PCIe endpoint device and the PCIe requestor ID of the second emulated PCIe device, after the second emulated PCIe device is enumerated by the PCIe subsystem within the destination computing machine, wherein this mappings are used to provide the upstream routes from the PCIe endpoint device to the destination computing machine during the DMA operation.
6. The method as claimed in claim 5, wherein establishing the mapping comprises modifying a PCI requestor ID within a PCI Transaction Layer Packets (TLP) while forwarding the TLPs between the downstream port and upstream ports of the PCIe switch by using the ID routing tables.
7. The method as claimed in claim 1, further comprising, initializing by the PCIe switch module, the message address register and the message data register residing in the MSI capability structure of the PCIe endpoint device within an address range, wherein the PCIe switch intercepts memory write TLPs targeting said address range.
8. The method as claimed in claim 1, wherein initializing by the PCIe switch module, the message address register and message data register available in the MSIX table of the PCIe endpoint device within an address range, wherein the PCIe switch intercepts memory write TLPs targeting the said address range.
9. The method as claimed in claim 1, further comprising, on generating each one of the MSI and MSIX based interrupts by the PCIe endpoint device, modifying and forwarding, by the PCIe switch the MSI/MSIX based interrupt TLPs towards the destination computing machine on behalf of the second emulated PCIe device, using the interrupt redirection table present within the memory space of the PCIe switch.
10. The method as claimed in claim 8, wherein each entry in the interrupt redirection table comprises PCIe requester ID of the PCIe endpoint device and the PCIe requester ID of respective emulated PCIe device active for access of a corresponding computing machine, MSI/MSIX address register value of the respective emulated PCIe device corresponding to each interrupt vector used by the corresponding computing machine, and MSI/MSIX data register value of the respective emulated PCIe device corresponding to each interrupt vector used by the corresponding computing machine, and wherein the respective emulated PCIe device for access to the source computing machine comprises the first emulated PCIe device, and the respective emulated PCIe device for access to the destination computing machine comprises the second emulated PCIe device.
11. A system for migrating a PCIe endpoint device connected to a downstream port of a Peripheral component Interconnect express (PCIe) switch during Virtual Machine (VM) live migration from a source computing machine to a destination computing machine, comprising:
the source computing machine;
the destination computing machine;
the PCIe switch;
a management server; and
the PCIe endpoint device, the PCIe endpoint device connected to the source computing machine and the destination computing machine via the PCIe switch, wherein a guest OS running on the source or destination computing machine uses the PCIe endpoint device, further the PCIe switch connected to the management server having a PCIe switch module running thereon,
wherein the management server exposes a first emulated PCIe device to the source computing machine, a set of PCIe configuration registers and Base Address Register (BAR) registers of the first emulated PCIe device comprises at least a subset of the PCIe configuration registers and BAR registers of the PCIe endpoint device, and
wherein the management server configures a connection relationship within the PCIe switch between the first emulated PCIe device and the PCIe endpoint device by using address trap tables, ID routing tables and interrupt redirection tables associated with the PCIe switch, and
wherein to migrate the PCIe endpoint device from the source computing machine to the destination computing machine, the system is caused to:
receive a VM migration request at a VMM running on the source computing machine;
notify, by the VMM, receipt of the VM migration request to the PCIe switch module, and suspend the guest OS;
request, by the PCIe switch module to the PCIe endpoint device, to stop a Direct Memory Access (DMA) operation of the PCIe endpoint device;
capture and save, by the VMM, a connection relationship data between the first emulated PCIe device and the guest OS of the source computing machine, wherein the connection relationship data between the first emulated PCIe device and the guest OS comprises BAR mapping information, DMA mapping information, and an Interrupt mapping information;
send an execution-related data and the connection relationship data to a VMM running on the destination computing machine;
emulate and expose, by the PCIe switch module, a second emulated PCIe device to the destination computing machine, the second emulated PCIe device having a similar configuration as that of the first emulated PCIe device;
initialize the BAR register of the second emulated PCIe device by the destination computing machine;
create, by the switch module, the connection relationship between the second emulated PCIe device and the PCIe endpoint device using the address trap tables and the ID routing tables associated with the PCIe switch;
create a DMA remapping in an IO memory management unit (IOMMU) of the destination computing machine with requester ID of the second emulated PCIe device and guest physical address domain of the migrated guest OS;
request, by the PCIe switch module running on the management server to the VMM running on the destination computing machine, to reinitialize a plurality of the interrupt vectors associated with interrupt redirection table of the second emulated PCIe device exposed to the destination computing machine;
create entries in the interrupt redirection table in the destination computing machine for the second emulated PCIe device using the connection relationship data captured from the source computing machine, the entries in the interrupt redirection table comprising the plurality of interrupt vectors used by the guest OS running on the source computing machine;
create, by the PCIe switch module, entries in an interrupt redirection table of the PCIe switch to establish a connection relationship between interrupt message addresses used by the second emulated PCIe device and the Interrupt message addresses used by the PCIe endpoint device;
establish, by the VMM running on the destination computing machine, mapping between the second emulated PCIe device and the PCIe endpoint device configurations exposed to the guest OS by the VMM and
notify, by the PCIe switch module running on the management sever to the PCIe endpoint device, to re¬-enable the DMA operation and completion of VM migration to the VMM running on the destination computing machine.
12. The system as claimed in claim 11, wherein the VM migration comprises migration of the PCIe endpoint device from the source computing machine to the destination computing machine, and migration of the guest OS running on the source computing machine to the destination computing machine.
13. The system as claimed in claim 11, wherein the guest OS running on the source computing machine to access the PCIe endpoint device prior to the VM migration continues to access the PCIe endpoint device from the destination computing machine after the completion of the VM migration.
14. The system as claimed in claim 11, wherein the address trap tables are used to trap and redirect the memory mapped/IO mapped BAR access request received at one of the first emulated PCIe device and the second emulated PCIe device to the address regions of the PCIe endpoint device specified by the PCIe switch module.
15. The system as claimed in claim 11, wherein the PCIe switch module establishes mapping of a PCI requestor ID of the PCIe endpoint device and the PCIe requestor ID of the second emulated PCIe device, after the second emulated PCIe device is enumerated by the PCIe subsystem within the destination computing machine, wherein this mappings are used to provide the upstream routes from the PCIe endpoint device to the destination computing machine during the DMA operation.
16. The system as claimed in claim 15, wherein to establish the mapping, the PCIe switch module is configured to modify a PCI requestor ID within a PCI transaction layer packets (TLP) while forwarding the TLPs between the downstream port and upstream ports of the PCIe switch by using the ID routing tables.
17. The system as claimed in claim 11, wherein the PCIe switch module is further configured to initialize the message address register and the message data register residing in the MSI capability structure of the PCIe endpoint device within an address range, and wherein the PCIe switch intercepts memory write TLPs targeting said address range.
18. The system as claimed in claim 11, wherein the PCIe switch module initializes the message address register and message data register available in the MSIX table of the PCIe endpoint device within an address range, and wherein the PCIe switch intercepts memory write TLPs targeting the said address range.
19. The system as claimed in claim 11, wherein on generation of each one of the MSI and MSIX based interrupts by the PCIe endpoint device, the PCIe switch is configured to modify and forward the MSI/MSIX based interrupt TLPs towards the destination computing machine on behalf of the second emulated PCIe device, using the interrupt redirection table present within the memory space of the PCIe switch.
20. The system as claimed in claim 18, wherein each entry in the interrupt redirection table comprises PCIe requester ID of the PCIe endpoint device and the PCIe requester ID of the corresponding emulated PCIe device active for access of a corresponding computing machine , MSI/MSIX address register value of the emulated PCIe device corresponding to each interrupt vector used by the corresponding computing machine, and MSI/MSIX data register value of the emulated PCIe device corresponding to each interrupt vector used by the corresponding computing machine, wherein the emulated PCIe device for access to the source computing machine comprises the first emulated PCIe device, and the emulated PCIe device for access to the destination computing machine comprises the second emulated PCIe device.
, Description:FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENTS RULES, 2003
COMPLETE SPECIFICATION
(See section 10 and rule 13)
Title of invention:
SYSTEM AND METHOD TO ENABLE PCIE DEVICE MIGRATION DURING VM LIVE MIGRATION
Applicant:
Tata Consultancy Services Limited
A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th Floor,
Nariman Point, Mumbai 400021,
Maharashtra, India
The following specification particularly describes the invention and the manner in which it is to be performed.
TECHNICAL FIELD
[001] The present subject matter relates, in general, to resource sharing and, in particular, to a system and method to enable PCIe device migration during Virtual Machine (VM) live migration.
BACKGROUND
[002] In computing technology, platform virtualization refers to sharing a platform among two or more Operating Systems (OSs) for efficient utilization of resources, for example hardware resources. Examples of the hardware resources associated with platform virtualization may include processor, storage, networking devices, video adapter serial port and so on. The platform virtualization is enabled by a Virtual Machine Monitor (VMM), also referred to as hypervisor. The VMM is configured to run multiple Virtual Machines (VMs). During virtualization, the VMs are migrated from one computing machine to another computing machine. Said migration of VMs amongst computing machines, for example from a source computing machine to a destination computing machine is referred to as VM migration.
[003] Conventionally, hardware resources such as the CPU, storage and networking devices can be conveniently virtualized as compared to the virtualization of video adapter and serial ports. However, the hardware such as video adapter, serial port can be virtualized by utilizing a Peripheral Component Interconnect (PCI) pass-through technology. The PCI pass-through technology allows passing control of hardware devices to VMs. Typically, PCI pass-through devices can provide near native performance in the VM. But the VM migration with the PCI pass-through devices is challenging as the register state of the PCI pass-through device cannot be replicated in the destination computing machine with a different device.
[004] The virtual machine live migration can be performed for various purposes, for instance, for server load balancing, server maintenance, server disaster recovery, server backup and the like. In virtual machine live migration technology, an execution-status and a connection relationship data of a VM associated with the source computing machine is captured and saved. After saving the execution-status and the connection relationship data, the virtual machine can be quickly migrated to the destination computing machine. The migrated VM resumes operation thereof upon completion of migration.
[005] Typically, the virtual machine live migration is associated with performance characteristics, such as migration time and a downtime associated with said live migration. The migration time refers to time duration between initiating of the virtual machine live migration and deactivating the VM associated with the source computing machine before migrating the VM to the destination computing machine. The downtime refers to the time duration between suspending the execution of VM by the source computing machine and resuming the suspended VM by the destination computing machine. Minimizing the migration time and the down time during virtual machine live migration may result in an efficient migration.
SUMMARY
[006] The following presents a simplified summary of some embodiments of the disclosure in order to provide a basic understanding of the embodiments. This summary is not an extensive overview of the embodiments. It is not intended to identify key/critical elements of the embodiments or to delineate the scope of the embodiments. Its sole purpose is to present some embodiments in a simplified form as a prelude to the more detailed description that is presented below. In view of the foregoing, an embodiment herein provides a method and system to enable PCIe device migration during virtual machine live migration
[007] In one aspect, a processor-implemented method to for migrating a Peripheral component Interconnect express (PCIe) endpoint device connected to a downstream port of a PCIe switch during Virtual Machine (VM) live migration from a source computing machine to a destination computing machine is provided. The PCIe endpoint device is connected to the source computing machine via the PCIe switch, wherein a guest OS running on the source computing machine uses the PCIe endpoint device, further the PCIe switch is connected to a management server having a PCIe switch module running thereon. Herein, the management server exposes a first emulated PCIe device to the source computing machine, a set of PCIe configuration registers and Base Address Register (BAR) registers of the first emulated PCIe device comprises at least a subset of the PCIe configuration registers and BAR registers of the PCIe endpoint device. Herein, the management server configures a connection relationship, within the PCIe switch, between the first emulated PCIe device and the PCIe endpoint device by using address trap tables, ID routing tables and interrupt redirection tables associated with the PCIe switch. The method includes receiving a VM migration request at the VMM running on the source computing machine. Further, the method includes notifying, by the VMM, receipt of the VM migration request to a PCIe switch module, and suspending the guest OS. Furthermore, the method includes requesting, by the PCIe switch module to a PCIe endpoint device, to stop a Direct Memory Access (DMA) operation of the PCIe device. Also, the method includes capturing and saving, by the VMM, an execution-related data of the guest OS and a connection relationship data between a first emulated PCIe device and the guest OS of the source computing machine, wherein the connection relationship data between the first emulated PCIe device and the guest OS comprises BAR mapping information, DMA mapping information, and an Interrupt mapping information associated with the source computing machine. Also, the method sends the execution-related data and the connection relationship data to the VMM running on the destination computing machine. Moreover the method emulates and exposes, by the PCIe switch module, a second emulated PCIe device to the destination computing machine, the second emulated PCIe device having a similar configuration as that of the first emulated PCIe device. Further, the method initializes the BAR register of the second emulated PCIe device, using the destination computing machine. Furthermore, the method creates, the connection relationship between the second emulated PCIe device and the PCIe endpoint device using the address trap tables and the ID routing tables associated with the PCIe switch. Also, the method creates a DMA remapping in an IO Memory Management Unit (IOMMU) of the destination computing machine with requester ID of the second emulated PCIe device and guest physical address domain of the migrated guest OS. Furthermore, the method requests, by the PCIe switch module running on a management server to the VMM running on the destination computing machine, to reinitialize a plurality of interrupt vectors of the second emulated PCIe device. Furthermore, the method creates, by using the connection relationship data captured from the source computing machine, entries in the interrupt redirection table in the destination computing machine for the second emulated PCIe device, the entries in the interrupt redirection table comprising the plurality of interrupt vectors used by the guest OS running on the source computing machine. Also, the method creates, by the PCIe switch module, entries in an interrupt redirection table of the PCIe switch to establish a connection relationship between interrupt message addresses used by the second emulated PCIe device and the Interrupt message addresses used by the PCIe endpoint device. Further, the method establishes, by the VMM running on the destination computing machine, mapping between the second emulated PCIe device and the PCIe endpoint device configurations exposed to the guest OS by the VMM and notifying, by the PCIe switch module running on the management sever to the PCIe endpoint device, to re-¬enable the DMA operation and completion of VM migration to the VMM running on the destination computing machine.
[008] In another aspect, a system for migrating a PCIe endpoint device connected to a downstream port of a Peripheral component Interconnect express (PCIe) switch during Virtual Machine (VM) live migration from a source computing machine to a destination computing machine is provided. The system includes the source computing machine; the destination computing machine; the PCIe switch; a management server; and the PCIe endpoint device. The PCIe endpoint device is connected to the source computing machine and the destination computing machine via the PCIe switch. A guest OS running on the source or destination computing machine uses the PCIe endpoint device. Further the PCIe switch is connected to the management server having a PCIe switch module running thereon. The management server exposes a first emulated PCIe device to the source computing machine. A set of PCIe configuration registers and Base Address Register (BAR) registers of the first emulated PCIe device includes at least a subset of the PCIe configuration registers and BAR registers of the PCIe endpoint device. Herein, the management server configures a connection relationship within the PCIe switch between the first emulated PCIe device and the PCIe endpoint device by using address trap tables, ID routing tables and interrupt redirection tables associated with the PCIe switch. To migrate the PCIe endpoint device from the source computing machine to the destination computing machine, the system is caused to receive a VM migration request at a VMM running on the source computing machine. The VMM notifies receipt of the VM migration request to the PCIe switch module, and suspends the guest OS. In response, the PCIe switch module requests the PCIe endpoint device to stop a Direct Memory Access (DMA) operation of the PCIe endpoint device. The VMM captures and saves a connection relationship data between the first emulated PCIe device and the guest OS of the source computing machine, wherein the connection relationship data between the first emulated PCIe device and the guest OS comprises BAR mapping information, DMA mapping information, and an Interrupt mapping information. The VMM running on the source computing machine send the execution-related data and the connection relationship data to a VMM running on the destination computing machine. The PCIe switch module emulates and exposes a second emulated PCIe device to the destination computing machine, where the second emulated PCIe device has a similar configuration as that of the first emulated PCIe device. The destination computing machine initializes the BAR register of the second emulated PCIe device. The switch module creates the connection relationship between the second emulated PCIe device and the PCIe endpoint device using the address trap tables and the ID routing tables associated with the PCIe switch. A DMA remapping is created in an IO memory management unit (IOMMU) of the destination computing machine with requester ID of the second emulated PCIe device and guest physical address domain of the migrated guest OS. The PCIe switch module running on the management server requests the VMM running on the destination computing machine, to reinitialize a plurality of the interrupt vectors associated with interrupt redirection table of the second emulated PCIe device exposed to the destination computing machine. The PCIe switch modules creates entries in the interrupt redirection table in the destination computing machine for the second emulated PCIe device using the connection relationship data captured from the source computing machine, where the entries in the interrupt redirection table comprising the plurality of interrupt vectors used by the guest OS running on the source computing machine. The PCIe switch module creates entries in an interrupt redirection table of the PCIe switch to establish a connection relationship between interrupt message addresses used by the second emulated PCIe device and the Interrupt message addresses used by the PCIe endpoint device. The VMM running on the destination computing machine establishes mapping between the second emulated PCIe device and the PCIe endpoint device configurations exposed to the guest OS by the VMM. The PCIe switch module running on the management sever notifies to the PCIe endpoint device, to re¬-enable the DMA operation and completion of VM migration to the VMM running on the destination computing machine.
[009] In yet another aspect, a non-transitory computer-readable medium having embodied thereon a computer program for executing a method to for migrating a Peripheral component Interconnect express (PCIe) endpoint device connected to a downstream port of a PCIe switch during Virtual Machine (VM) live migration from a source computing machine to a destination computing machine is provided. The PCIe endpoint device is connected to the source computing machine via the PCIe switch, wherein a guest OS running on the source computing machine uses the PCIe endpoint device, further the PCIe switch is connected to a management server having a PCIe switch module running thereon. Herein, the management server exposes a first emulated PCIe device to the source computing machine, a set of PCIe configuration registers and Base Address Register (BAR) registers of the first emulated PCIe device comprises at least a subset of the PCIe configuration registers and BAR registers of the PCIe endpoint device. Herein, the management server configures a connection relationship, within the PCIe switch, between the first emulated PCIe device and the PCIe endpoint device by using address trap tables, ID routing tables and interrupt redirection tables associated with the PCIe switch. The method includes receiving a VM migration request at the VMM running on the source computing machine. Further, the method includes notifying, by the VMM, receipt of the VM migration request to a PCIe switch module, and suspending the guest OS. Furthermore, the method includes requesting, by the PCIe switch module to a PCIe endpoint device, to stop a Direct Memory Access (DMA) operation of the PCIe device. Also, the method includes capturing and saving, by the VMM, an execution-related data of the guest OS and a connection relationship data between a first emulated PCIe device and the guest OS of the source computing machine, wherein the connection relationship data between the first emulated PCIe device and the guest OS comprises BAR mapping information, DMA mapping information, and an Interrupt mapping information associated with the source computing machine. Also, the method sends the execution-related data and the connection relationship data to the VMM running on the destination computing machine. Moreover the method emulates and exposes, by the PCIe switch module, a second emulated PCIe device to the destination computing machine, the second emulated PCIe device having a similar configuration as that of the first emulated PCIe device. Further, the method initializes the BAR register of the second emulated PCIe device, using the destination computing machine. Furthermore, the method creates, the connection relationship between the second emulated PCIe device and the PCIe endpoint device using the address trap tables and the ID routing tables associated with the PCIe switch. Also, the method creates a DMA remapping in an IO Memory Management Unit (IOMMU) of the destination computing machine with requester ID of the second emulated PCIe device and guest physical address domain of the migrated guest OS. Furthermore, the method requests, by the PCIe switch module running on a management server to the VMM running on the destination computing machine, to reinitialize a plurality of interrupt vectors of the second emulated PCIe device. Furthermore, the method creates, by using the connection relationship data captured from the source computing machine, entries in the interrupt redirection table in the destination computing machine for the second emulated PCIe device, the entries in the interrupt redirection table comprising the plurality of interrupt vectors used by the guest OS running on the source computing machine. Also, the method creates, by the PCIe switch module, entries in an interrupt redirection table of the PCIe switch to establish a connection relationship between interrupt message addresses used by the second emulated PCIe device and the Interrupt message addresses used by the PCIe endpoint device. Further, the method establishes, by the VMM running on the destination computing machine, mapping between the second emulated PCIe device and the PCIe endpoint device configurations exposed to the guest OS by the VMM and notifying, by the PCIe switch module running on the management sever to the PCIe endpoint device, to re-¬enable the DMA operation and completion of VM migration to the VMM running on the destination computing machine.
BRIEF DESCRIPTION OF THE FIGURES
[0010] The detailed description is described with reference to the accompanying figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The same numbers are used throughout the drawings to reference like features and modules.
[0011] FIG. 1A illustrates a block diagram of a system prior to PCIe device migration, in accordance with an example embodiment of the present disclosure
[0012] FIG 1B illustrates a block diagram of the system of FIG. 1A showing PCIe device migration during Virtual Machine (VM) live migration, in accordance with an example embodiment of the present disclosure;
[0013] FIG 1C illustrates a block diagrams of a system of FIGS. 1A and 1B after the PCIe device migration during Virtual Machine (VM) live migration, in accordance with an example embodiment of the present disclosure;
[0014] FIG. 2A illustrates an example representation of address trapping to enable PCIe device migration during Virtual Machine (VM) live migration, in accordance with an example embodiment of the present disclosure;
[0015] FIG. 2B illustrates an example representation of interrupt redirection to enable PCIe device migration during Virtual Machine (VM) live migration, in accordance with an example embodiment of the present disclosure;
[0016] FIG. 3A and FIG. 3B illustrates an example flow diagram to enable PCIe device migration during virtual machine live migration, in accordance with an example embodiment of the present disclosure;
[0017] FIG. 4 illustrates an example flow diagram for emulating a PCIe device inside the PCIe switch during virtual machine live migration, in accordance with an example embodiment of the present disclosure; and.
[0018] FIG. 5 illustrates a block diagram of an exemplary computer system for implementing embodiments consistent with the present disclosure.
[0019] It should be appreciated by those skilled in the art that any block diagrams herein represent conceptual views of illustrative systems and devices embodying the principles of the present subject matter. Similarly, it will be appreciated that any flow charts, flow diagrams, and the like represent various processes which may be substantially represented in computer readable medium and so executed by a computer or processor, whether or not such computer or processor is explicitly shown.
DETAILED DESCRIPTION
[0020] The Virtual Machine live migration refers to migrating a running VM between different physical computing machines without disconnecting a client application running on the VM. Typically, while migrating a running VM between different physical computing machines the memory (for example, RAM), the storage (for example hard-disk), and a connection relationship data associated with one physical computing machine are migrated to another physical computing machine. Herein, the connection relationship data refers to a plurality of redirection data associated with migrating a running VM between different physical machines. Additionally, the VM is migrated from one computing machine to another during the live migration of the virtual machine. Typically, during migration process, the VM has to be suspended for a fraction of time before migrating to another computing machine. The suspending of the VM results in degradation of performance parameters such as migration time, downtime of the VM undergoing migration, and so on.
[0021] The present subject matter provides various embodiments that facilitates in overcoming the performance degradation of the VM undergoing migration caused in the migration time and the downtime. In various embodiments, a same PCIe endpoint device is shared by different computing machines by migrating the PCIe endpoint device without disconnecting an application running on the guest OS. An implementation of the virtual machine live migration is described further in detail with reference to FIGS. 1A through 5.
[0022] A glossary of technical terms used for defining the embodiments is provided below.
[0023] PCI device: A PCI device is a hardware plugged into a PCI slot of a motherboard through a PCI bus. Examples of a PCI device includes, but are not limited to audio, video, networking devices and the like.
[0024] PCI express: A PCI express (PCIe) is a high speed serial computer expansion bus standard designed to replace PCI bus standard. The PCIe provides low latency and high data rate transfers. The devices connected to the motherboard through the PCIe link have a dedicated point to point link. A plurality of peripheral devices connected through PCIe includes graphic cards, Network Interface Cards (NIC) and the like.
[0025] Requester: A function that first introduces a transaction sequence into the PCIe domain.
[0026] Requester ID: A combination of a requester’s Bus number, Device number and Function number that uniquely identifies the requester.
[0027] Configuration space: The PCI device includes a set of registers referred to as configuration space and the configuration space registers are mapped to a plurality of memory locations. The PCIe extends configuration space for devices and the devices are associated with device drivers. A plurality of device drivers and diagnostic software can access configuration space and Operating Systems using Application Programming Interface (API).
[0028] MSIX table: The MSIX table includes a message address register, a message data register and a mask associated with a PCIe device. The MSIX table is stored in a memory mapped Base Address Register (BAR) space or in an IO mapped BAR space.
[0029] PCIe hierarchy: A simple PCIe hierarchy is composed of a Root Complex (RC), a plurality of end points (I/O devices), PCIe switches, PCI/PCI-X bridges and a plurality of PCIe links. The PCIe bridges provide an interface to other PCI or PCIe links. For brevity of description, hereafter, the PCIe link can be alternatively called as PCIe bus. Also, for brevity of description, hereafter, the PCIe devices can be alternatively called as PCIe endpoint devices.
[0030] Root Complex (RC): An interface between a Central Processing Unit (CPU) and the PCIe buses. The RC can support a plurality of PCIe ports and every port defines a separate hierarchy domain. Moreover, the RC aggregates a plurality of PCIe hierarchy domains into a single PCIe hierarchy.
[0031] PCIe switch: The PCIe switch have a plurality of upstream and downstream PCIe ports and allows connecting a plurality of devices to the plurality of PCIe ports. Additionally, the PCIe switch routes the connection between different physical computing machines.
[0032] Enumeration: An enumeration process enables configuration software to discover a PCIe device and assign Bus Device Function (BDF) identifier in the PCIe hierarchy. Additionally, the required device drivers are initialized to enable functioning of the identified PCIe devices.
[0033] Management server: A management server is a physical computing machine configured to manage a plurality of connection relationship data associated with the virtual machine live migration using a PCIe switch module. Additionally, the management server is configured to run the PCIe switch module to manage the PCIe switch and the plurality of connection relationship data.
[0034] Virtual Machine (VM): A Virtual Machine (VM) is configured to run an Operating System (OS), a plurality of application programs are configured to access a plurality of hardware resources.
[0035] Virtual Machine Monitor (VMM): The VMM is computer software or a firmware used to create and run a plurality of VMs.
[0036] Referring collectively to FIGS. 1A-1C, the process of Virtual Machine live migration to migrate a VM from a source computing machine to a destination computing machine, is illustrated. Particularly, FIGS. 1A-1C illustrates different states of a system 100 during Virtual machine live migration. For example, FIG. 1A illustrates the state of the system 100 before Virtual machine live migration, FIG. 1B illustrates the state of the system 100 during Virtual machine live migration and FIG. 1C illustrates the state of the system 100 after Virtual machine live migration is complete.
[0037] The system 100 includes a PCIe switch 108, a management server 102, a source computing machine 104, a destination computing machine 106 and a PCIe endpoint device 114. The management server 102 is connected to management port of the PCIe switch 108 through a PCIe bus 120A. The source computing machine 104 and the destination computing machine 106 are connected to corresponding host ports of the PCIe switch 108 through the PCIe buses 120B and 120C accordingly. The PCIe endpoint device 114 to be migrated is connected to a downstream port of the PCIe switch 108 through the PCIe bus 120D and both the PCIe switch and the PCIe endpoint device 114 are enumerated by the management server. The source computing machine 104 and the destination computing machine 106 are configured to operate the Virtual Machine Monitor (VMM) and the plurality of Virtual Machines (VM). For the brevity of description, hereafter the VM can be alternatively called as the guest OS. Also, for the brevity of description, the source computing machine and the destination computing machine running the guest OS can be alternatively called as host machines. Also, for brevity of description a first emulated PCIe device and a second emulated PCIe device can be alternatively and collectively referred to as emulated PCIe devices.
[0038] In an embodiment, the VM live migration includes migration of the PCIe endpoint device 114 from the source computing machine 104 to the destination computing machine 106, and migration of the guest OS 124 running on the source computing machine 104 to the destination computing machine 106.
[0039] In an embodiment, the PCIe endpoint device 114 and the host computing machine communicate through a plurality of Transaction Layer Packets (TLPs). The plurality of TLPs includes memory TLPs, Input / Output (IO) TLPs, configuration TLPs and message TLPs. Memory TLPs are utilized to transfer data to or from a location in the system memory map. IO TLPs are used to transfer data to and from a location in the system IO map. The configuration TLPs are used to transfer data to or from a location in the configuration space of a PCIe device. The message TLPs are used for event reporting.
[0040] Referring now to FIG. 1A, the state of the system 100 before initiating Virtual machine live migration is illustrated. Here, the VMM 126 runs the guest OS 124 on the source computing machine 104. The management server 102 creates the first emulated PCIe device 110 by capturing and responding to the configurations TLPs originating from the source computing machine 104 towards the host port of the PCIe switch 108. The configuration space registers present with the first emulated PCIe device 110 is created and manipulated by the management software, and the capabilities present with the first emulated PCIe device 110 depends on the capabilities of the PCIe endpoint device 114. The first emulated PCIe device 110 is associated with a set of configurations that are at least a subset of configurations of the PCIe endpoint device 114.
[0041] The VMM 126 running on the source computing machine 104 exposes the first emulated PCIe device 110 to the guest OS 124 through PCIe direct device assignment method. The PCIe direct device assignment allows the guest OS 124 to directly control the PCIe device with minimal intervention of the VMM 126. While performing the direct PCIe device assignment, the VMM 126 creates a set of configurations for each PCIe endpoint device and exposes this configuration as a PCIe endpoint device to the guest OS 124. The set of configurations include BAR mapping information, DMA mapping information and interrupt mapping information associated with the PCIe endpoint device 114, when directly getting exposed to the Guest OS 124.
[0042] The host computing machines accesses the PCIe endpoint device 114 through virtualization. Here, the management server 102 creates a virtual connection between the host machine and the PCIe endpoint device 114 by creating a first emulated PCIe device 110 in the PCIe switch 108. The first emulated PCIe device 110 is associated with a set of configurations that are at least a subset of configurations of the PCIe endpoint device 114. For example, if the PCIe endpoint device 114 is having both MSIX capability and power management capability, then the PCIe switch module, for example, a PCIe switch module 132 may have an option to select the capabilities of the PCIe endpoint device 114 and expose to the host computing machines through the emulated PCIe device. In one implementation, the management server 102 can opt to include only MSIX capability with the emulated PCIe device.
[0043] The source computing machine 104 communicates with the PCIe endpoint device 114 and vice-versa by utilizing a plurality of redirection tables 112 associated with the PCIe switch 108. The plurality of redirection tables 112 includes an address trap table, ID routing table and an interrupt redirection table.
[0044] Address traps can appear in the ingress of host ports and downstream ports of the PCIe switch 108, and can be used either redirect or trap the memory access requests coming from the host ports and downstream ports. The entries in the address trap table will be as follows: Base address (Address for which the address traps has to configured), Base address mask (defines the size of the address trap aperture, this aperture starts from the base address), remap address (Address to which the base addresses has to be remapped while forwarding the TLPs), PCIe requester ID (the PCIe requester ID be used for forwarding the TLPs) and Port ID (specifies the target Port ID while forwarding the TLPs). Here the PCIe requester ID is the BDF number of the PCIe device requesting a transaction. One more additional flag (for example, a trap flag) is associated with each entry in the address trap table to set a trapping behavior by the PCIe switch module 132. The trapping behavior means, whenever a TLP is hitting the address trap, the trapping behavior allows the PCIe switch module 132 to get notified through interrupts.
[0045] The ID routing table associated with the downstream port of the PCIe switch 108 can be used to provide upstream routes from the PCIe endpoint device 114 to the source or destination computing machine. The ID routing table includes the BDF number associated with the PCIe endpoint device 114 and the BDF number associated with the emulated PCIe device.
[0046] The interrupt redirection table present in the PCIe switch 108 can be used to route the MSI/MSIX based interrupts generated from the PCIe endpoint device 114 to the source or destination computing machine on behalf of the emulated PCIe device. The interrupt redirection table includes the BDF number of the emulated PCIe device, MSIX address register of the emulated PCIe device and MSIX data register of the emulated PCIe device.
[0047] Referring now to FIG. 1B, the state of the system 100 during Virtual machine live migration is illustrated. As an example, the VMM 126 running on the source computing machine 104 of the system 100 receives a request for VM migration. On receiving the request for VM migration, the VMM 126 running on the source computing machine 104 notifies the initiation of migration process to the PCIe switch module 132 and suspends the guest OS 124 associated with the source computing machine 104. A request is sent to the PCIe endpoint device 114 by the PCIe switch module 132 to stop a Direct Memory Access (DMA) operation associated with the PCIe endpoint device 114. The DMA allows an Input / Output device to send or receive data directly to or from a physical address space of the computing machine by bypassing a Central Processing Unit (CPU), thereby speeding up memory operations.
[0048] In an embodiment, the guest OS 124 associated with the source computing machine 104 includes an execution related data. As an example, the execution-related data includes CPU state, registers, physical memory pages used by the guest OS, and the like. Additionally, the guest OS 124 maintains a connection relationship with the emulated PCIe device. Hereinafter, the connection relationship between the emulated PCIe device and the guest OS 124 may be referred to as first connection relationship, and the connection relationship between the emulated PCIe device and the PCIe endpoint device 114 may be referred to as second connection relationship. A first set of connection relationship data associated with the first connection relationship includes a Base Address Register (BAR) mapping information, a DMA mapping information, and an Interrupt mapping information associated with the source computing machine 104.
[0049] The physical memory pages used by the guest OS 124 are transferred iteratively from the source computing machine 104 to the destination computing machine 106 before suspending the guest OS 124 running on the source computing machine 104. Additionally, the first set of connection relationship data and the remaining set of execution-related data are captured and stored in the source computing machine 104 by the VMM 126 associated with the source computing machine 104. The saved first set of connection relationship data and the remaining set of execution-related are sent to the VMM 128 running on the destination computing machine 106 after suspending the guest OS 124.
[0050] Before resuming the guest OS on the destination computing machine 106 a second PCIe device 130 is emulated and exposed to the destination computing machine 106 by the PCIe switch module 132. In an embodiment, the configuration of the second emulated PCIe device 130 is same as the first emulated PCIe device 110.
[0051] In an embodiment, the second emulated PCIe device 130 is exposed to the destination computing machine 106 on behalf of the PCIe endpoint device 114. Also, the second emulated PCIe device 130 is enumerated in the PCIe hierarchy of the destination computing machine 106 with new Bus/Device/Function (BDF) identifiers and BAR register values.
[0052] During emulation and enumeration of the second PCIe device 130, the destination computing machine 106 initializes the BAR address registers associated with the second emulated PCIe device 130. The VMM 128 running on the destination computing machine 106 creates the mapping between the new BAR address register values of the second emulated PCIe device 130 with the device configurations exposed by the VMM 128 to the guest OS 124, when directly assigning a PCIe device to the guest OS 124. The device configurations exposed by the VMM 128 to the guest OS 124 includes BAR register values. The migration of the guest OS 124 from the source computing machine 104 to the destination computing machine 106 is happening in parallel during this process. The system 100 after migration of the guest OS 124 is illustrated FIG. 1C. Here, the redirection tables 112 are updated with respect to the second emulated PCIe device 130 and the destination computing machine 106. It will be noted herein that the guest OS 124 running on the source computing machine 104 (to access the PCIe endpoint device 114 prior to the VM migration) continues to access the PCIe endpoint device 114 from the destination computing machine 106 after the completion of the VM migration.
[0053] In an embodiment, the second emulated PCIe device 130 is exposed to the destination computing machine 106. After the second emulated PCIe device 130 is exposed to the destination computing machine 106, a new mapping entry is established (or created) in the remapping hardware between BDF (Bus/Device/Function) number of the second emulated PCIe device 130 and Guest Physical address paging structures of the migrated guest OS 124. The guest physical address paging structures are used by the IOMMU to convert the Guest physical address to host physical address during DMA operations. The guest physical paging structures associated with each guest OS 124 contains the address mappings between the guest physical addresses and the host physical addresses. As an example, the remapping hardware refers to an IO Memory Management Unit (IOMMU) associated with the destination computing machine 106.
[0054] In an embodiment, when the second emulated PCIe device 130 is enumerated in the destination computing machine 106, the PCIe switch module 132 establishes the second connection relationship between the second emulated PCIe device 130 and the PCIe endpoint device 114 by using the second set of connection relationship data associated with first emulated PCIe device 110. The second set of connection relationship includes a set of redirection tables 112. The set of redirection tables 112 includes the address trap table, the ID routing table and the interrupt redirection table. The address trap tables are utilized to trap and redirect the memory mapped / IO mapped BAR access request targeting one of the emulated PCIe device to the BAR address regions of the PCIe endpoint device 114 specified by the PCIe switch module 132. The memory mapped or IO mapped BAR access requests coming towards the emulated PCIe device are configured by the PCIe switch module 130 to be trapped and redirected towards the BAR address regions of the PCIe endpoint device 114. While establishing an address trap setting between the BAR address regions of the emulated PCIe device and the BAR regions of the PCIe endpoint device 114, the MSIX table region of the emulated PCIE device are excluded, and a separate address trap settings is created by the management server 102 in order to trap and capture the read/write requests that targets the MSIX table region of the emulated PCIe device. The address trapping is further explained with reference to FIG. 2A.
[0055] Referring now to FIG. 2A, the address trap settings includes a PCIe configuration space 204 of the second emulated PCIe device 130 and a PCIe configuration space 206 of the PCIe endpoint device 114. For the purpose of explanation and clarity of description, references will be made to FIGS. 1A-1C while explaining the address trap settings in FIG. 2A. The PCIe configuration space of the second emulated PCIe device 130 includes a BAR address space 208 and the PCIe configuration space of the PCIe endpoint device 114 includes a BAR address space 210. The BAR address spaces includes a plurality of device specific registers, a MSI/MSIX table and a PBA table. Every entry in the MSIX table 202 includes a message address register, a message data register and a mask register. The address traps are performed by the PCIe switch 108 in the following scenarios: (i) to redirect a plurality of memory TLPs coming from the host computing machine towards the emulated PCIe device to the corresponding BAR regions in the PCIe endpoint device 114. The MSIX tables are excluded from this mapping 212 (ii) to intercept and redirect the memory accesses coming towards the MSIX table region of the emulated PCIe device to the management server 102 and (iii) to redirect the emulated PCIe device PBA access by the host machine towards the PCIe endpoint device PBA table region 216.
[0056] The ID routing table includes the BDF number associated with the emulated PCIe device and the BDF number associated with the PCIe endpoint device 114. The BDF mapping is established between the BDF identifiers of the second emulated PCIe device 130 (refer, FIG. 1C) and the BDF identifiers of the PCIe endpoint device 114 (refer, FIG 1C). For brevity of description, the BDF mapping tables hereinafter can be alternatively called as ID routing tables. As an example, the DMA traffic generated by the PCIe endpoint device 114 is redirected to the destination computing machine 106 by modifying the BDF identifier in the TLPs. Here, the BDF identifier of the PCIe endpoint device 114 is replaced with the BDF identifier of the second emulated PCIe device 130. Said redirection and modification of the BDF identifiers are carried out using the ID routing table associated with the PCIe switch 108.
[0057] In an embodiment, after establishing the second connection relationship, the first connection relationship associated with the migrated guest OS 124 and the second emulated PCIe device 130 is established. The first set of connection relationship data includes the DMA remapping and Interrupt remapping. The DMA remapping is performed using a remapping hardware associated with a hardware platform of the destination computing machine 106. As an example, the remapping hardware refers to an IO Memory Management Unit (IOMMU) associated with the destination computing machine 106. The remapping hardware translates the DMA requests received from the PCIe endpoint devices with the Guest Physical address ranges to the corresponding host physical addresses ranges. Here, the DMA remapping is performed using a BDF number of the second emulated PCIe device 130 and the guest physical address domain of the migrated guest OS 124.
[0058] In an embodiment, the physical memory pages that are utilized by the guest OS 124 can be modified by the DMA operation of the PCIe endpoint device 114. The usage of the guest physical address ranges for the DMA operation can be tracked using an emulated virtual IOMMU that is present on the guest OS. When the device driver running on the guest OS request for a DMA memory region, the request is forwarded to the virtual IOMMU driver and the corresponding mapping is created in the actual IOMMU hardware present in the destination computing machine 106. The IOMMU utilizes an address remapping table in order to translate the guest physical address to the host physical address during the DMA operation. A plurality of dirty bits present in the paging structures of the IOMMU on the host machine can be used by the VMM for tracking the modified memory pages during the DMA operation.
[0059] In an embodiment, the DMA remapping is followed by the Interrupt remapping associated with the first set of connection relationship. Here, the switch module 132 requests the VMM 128 of the destination computing machine 106 to reinitialize the interrupt vectors associated with the second emulated PCIe device 130. Here, the reinitialized interrupt vectors are same as the interrupt vectors used by the guest OS 124 running on the source computing machine 104 before live migration.
[0060] In an embodiment, entry in an interrupt redirection table associated with the second set of connection relationship is performed by the switch module 132. The interrupt redirection table associated with the PCIe switch can be used to map the interrupt vectors used by the PCIe endpoint device with the corresponding interrupt vectors used by the second emulated PCIe device 130 on the destination computing machine 106. The entries in the interrupt redirection table includes PCIe requester ID of the second emulated PCIe device 130, an interrupt message addresses used by the second emulated PCIe device 130 and interrupt message data used by the second emulated PCIe device 130. An interrupt redirection using the interrupt redirection table is further explained with reference to FIG. 2B.
[0061] Now, referring to FIG. 2B, the interrupt redirection involves the PCIe configuration space 204 of the second emulated PCIe device 130, the PCIe configuration space 206 of the PCIe endpoint device 114 and an interrupt redirection table 226. The PCIe configuration space 204 of the second emulated PCIe device 130 is associated with the MSIX capability structure of the second emulated PCIe device 130. Also, the PCIe configuration space of the PCIe endpoint device 210 is associated with MSIX capability structure of the PCIe endpoint device 114. The MSIX capability structure of the second emulated PCIe device 130 includes the address of the MSIX table and the Pending Bit Array (PBA) table associated with the second emulated PCIe device 130. The MSIX capability structure of the PCIe endpoint device 114 includes the address of the MSIX table and the Pending Bit Array (PBA) table associated with the PCIe endpoint device 114. Every entry in the MSIX table 202 of the second emulated PCIe device 130 includes an address, a data and a mask for each interrupt vector associated with the second emulated PCIe device 130. Each entry in the MSIX tables 218 of the PCIe endpoint device 114 includes the Address, the data and the mask for each interrupt vector associated with the PCIe endpoint device 114. The interrupt redirection table 226 includes the BDF number of the emulated PCIe device 220, MSIX address register of the emulated PCIe device 222 and MSIX data register of the emulated PCIe device 224.
[0062] In an embodiment, the PBA table is used in the context where the interrupt vectors associated with the emulated PCIe device and the PCIe endpoint device 114 is masked by the corresponding VM. The PCIe endpoint device 114 can set bits in PBA table indicating a requirement for the interrupts. During virtual machine live migration, the PBA table associated with the PCIe endpoint device 114 is mapped 216 with the PBA table associated with the emulated PCIe device.
[0063] In an embodiment, the MSIX table entries of the first emulated PCIe device 110 are initialized by the guest OS 124 running on the source computing machine 104. The PCIe switch module 132 initialises the MSIX table entries of the PCIe endpoint device 114. The PCIe switch module 132 creates mapping between the MSIX table entries of the PCIe endpoint device 114 with the MSIX table entries of the first emulated PCIe device 110 using the interrupt redirection table present in the PCIe switch.
[0064] In an embodiment, the MSIX table entries of the second emulated PCIe device 130 are re-initialized by the VMM 128 running on the destination computing machine 106 during the migration time. The PCIe switch module 132 creates the mapping between the MSIX table entries of the PCIe endpoint device 114 with the MSIX table entries of the second emulated PCIe device 130 using the interrupt redirection table present in the PCIe switch. The MSIX address and data field entries of the PCIe endpoint device 114 are initialized to the an address range within the PCIe switch such that the memory write TLPs originating from the PCIe endpoint device targeting the special address region can be captured and redirected to the destination computing machine by utilizing the interrupt redirection table present in the PCIe switch 108. After the initialization of each entry in MSIX table of the emulated PCIe device, the PCIe switch module 132 creates a mapping between entries of the MSIX table of the PCIe endpoint device 114 with the matching vector entry of the MSIX table of the emulated PCIe device using interrupt redirection table. When an entry in the MSIX table of the second emulated PCIe device 130 is re-initialized, the PCIe switch module 132 identifies the corresponding interrupt vector entry of the PCIe endpoint device 114 to which the re-initialized entry has to be mapped, and fill the entries in the interrupt redirection table. The address field associated with the MSIX table of the PCIe endpoint device 114 has a plurality of unused LSB bits depending on the address region to which these address field registers are initialized. In one embodiment, the unused LSB bits of the address field can be initialized in such a way that the LSB bits can index the interrupt redirection table. In another embodiment, the data field associated with the MSIX table can be initialized such that the data field can index 228 the interrupt redirection table 226.
[0065] In an embodiment, the interrupt vectors associated with the second emulated PCIe device 130 are re-initialized by the VMM 128 running on the destination computing machine 106. Here, the destination computing machine 106, assigns the same interrupt vectors to the guest OS as it was used by the guest OS running on the source computing machine by using the interrupt redirection table or interrupt posting table present in the IOMMU of the destination computing machine 106.
[0066] In an embodiment, the MSI interrupt redirection is same as MSIX interrupt redirection. The MSI address register and MSI data register associated with the MSI capability structure of the first emulated PCIe device 110 are initialized by the guest OS 124 running on the source computing machine 104. Also, the PCIe switch module 132 initializes the MSI address register and MSI data register in the MSI capability structure of the PCIe endpoint device 114. Here, the VMM 128 running on the destination computing machine 106 re-initializes the address and data field in the MSI capability structure of the second emulated PCIe device 130, during the emulation of the second emulated PCIe device 130. The MSI address and data field entries of the PCIe endpoint device 114 are initialized to the special address range within the PCIe switch 108 such that the memory write TLPs originating from the PCIe endpoint device 114 targeting the special address region can be captured and redirected to the destination computing machine 106 with the help of the interrupt redirection table 226. The address field associated with the MSI capability structure has a plurality of unused LSB bits. In one embodiment, the unused LSB bits of the address field can be initialized in such a way that the LSB bits can index the interrupt redirection table 226. In another embodiment, the data field associated with the MSI capability structure can be initialized such that the data field can index the interrupt redirection table 226.
[0067] After the initialization of the interrupt redirection table 226 associated with the PCIe switch, the interrupt TLPs generated from the PCIe endpoint device 114 will get redirected to the host machine on behalf of the emulated PCIe device.
[0068] After the re-initialization of the MSI capability structure associated with the second emulated PCIe device 130 for the destination computing machine 106, a new MSI/MSIX mapping is established using interrupt redirection table 226. Here, the MSI/MSIX redirection is established between MSI/MSIX address registers of the PCIe endpoint device 114 and the MSI/MSIX address registers of the second emulated PCIe device 130. The interrupts generated by the PCIe endpoint device 114 are redirected to the destination computing machine 106 on behalf of the second emulated PCIe device 130 by using the MSI/MSIX redirection table 226.
[0069] In an embodiment, the PCIe switch module 132 running on the management server 102 captures and responds to the configuration reads and configuration writes TLPs targeting the MSI/MSIX capability register space of the emulated PCIe device. In the case of MSI based interrupts, the second emulated PCIe device 130 acts as an intermediary between the PCIe endpoint device 114 and the host computing machine where the VM is running. Here, the PCIe switch module 132 captures the MSI based interrupts generated by the PCIe endpoint device 114 in the form of the memory write TLPs, targeting the special address range of the PCIe switch 108. The interrupt redirection table 226 is utilized by the PCIe switch 108 to map the MSI address and MSI data range of the PCIe endpoint device 114 to the MSI address and MSI data range of the emulated PCIe device.
[0070] In an embodiment, the PCIe endpoint device 114 creates an MSIX interrupt by generating a memory write TLP targeting the address that is specified in the MSIX address register, the PCIe switch 108 intercepts the memory write TLPs and modifies the TLP’s target address/data field values and the BDF register values with the corresponding field values in the interrupt redirection table 226 entry. The modified TLPs are forwarded to the host machine on behalf of the emulated PCIe device.
[0071] In an embodiment, after establishing the second set of connection relationship, the VMM 128 running on the destination computing machine 106, performs a mapping between the second emulated PCIe device 130 and the emulated PCIe device configurations is exposed to the guest OS by the VMM 128. The device configurations exposed to the guest OS 128 by the VMM 128 includes BAR address values, interrupt vectors and DMA mapping information. While establishing the first set of connection relationship between the second emulated PCIe device 130 and the PCIe endpoint device 114, configurations are exposed to the guest OS by the VMM 128. The VMM 128 maps the new PCIe configuration register values of the second emulated PCIe device 130 with the old configurations exposed to the guest OS 124 for the first emulated PCIe device 110.
[0072] In an embodiment, after establishing the first and the second connection relationships, the PCIe switch module 132 notifies the PCIe endpoint device 114 to re-enable the suspended DMA operation.
[0073] In an embodiment, after live migration of the guest OS 124 to the destination computing machine 106, the second emulated PCIe device 130 acts as a data movement engine between the PCIe endpoint device 114 and the destination computing machine 106. The switch module 132 can selectively redirect the configuration TLPs targeting the emulated PCIe device to the PCIe endpoint device 114. Also, the memory TLPs that are targeting the emulated PCIe device which is originating from the host machine to which the emulated PCIe device exposed, are redirected selectively to the PCIe endpoint device 114 according to the BAR mapping tables specified inside the PCIe switch 108.
[0074] FIG. 3A and FIG. 3B illustrates an example flow diagram 300 to enable PCIe device migration during virtual machine live migration, in accordance with an example embodiment of the present disclosure. The method 300 depicted in the flow chart may be executed by a system, for example, the system 100 of FIG. 1. Operations of the flowchart, and combinations of operation in the flowchart, may be implemented by various means, such as hardware, firmware, processor, circuitry and/or other device associated with execution of software including one or more computer program instructions. For example, one or more of the procedures described in various embodiments may be embodied by computer program instructions. In an example embodiment, the computer program instructions, which embody the procedures, described in various embodiments may be stored by at least one memory device of a system and executed by one or more hardware processor in the system. Any such computer program instructions may be loaded onto a computer or other programmable system (for example, hardware) to produce a machine, such that the resulting computer or other programmable system embody means for implementing the operations specified in the flowchart. It will be noted herein that the operations of the method 300 are described with help of system 100. However, the operations of the method 300 can be described and/or practiced by using any other system.
[0075] At 302, the method 100 includes receiving a VM migration request at the VMM running on the source computing machine. At 304, the method 300 includes suspending the guest OS running on the source computing machine and notify the receipt of VM migration request to the PCIe switch module. At 306, the method 300 includes requesting the PCIe endpoint device, to stop a Direct Memory Access (DMA) operation of the PCIe device. At 308, the method 300 includes capturing and saving an execution-related data of the guest OS and a connection relationship data between the first emulated PCIe device and the guest OS of the source computing machine. At 310, the method 300 includes sending the execution-related data and the connection relationship data to the VMM running on the destination computing machine. At 312, the method 300 includes emulating and exposing using the PCIe switch module, the second emulated PCIe device to the destination computing machine by using the saved register values associated with the first emulated PCIe device. Here, the second emulated PCIe device having a similar configuration as that of the first emulated PCIe device. At 314, the method 300 includes initializing the BAR register of the second emulated PCIe device to create connection relationship between the second emulated PCIe device and the PCIe endpoint device. At 316, the method 300 includes creating the connection relationship between the second emulated PCIe device and the PCIe endpoint device using the address trap tables and the ID routing tables associated with the PCIe switch. At 318, the method 300 includes creating a DMA remapping in an IO Memory Management Unit (IOMMU) of the destination computing machine with a requester ID of the second emulated PCIe device and guest physical address domain of the migrated guest OS. At 320, the method 300 includes requesting by the switch module running on the management server, the VMM running on the destination computing machine, to reinitialize all the interrupt vectors of the second emulated PCIe device. At 322, the method 300 includes creating entries in the interrupt redirection table in the destination computing machine for the second emulated PCIe device by using the connection relationship data captured from the source computing machine. Here, the entries in the interrupt redirection table comprising the interrupt vectors used by the guest OS running on the source computing machine before live migration. At 324, the method 300 includes creating entries in an interrupt redirection table of the PCIe switch by the PCIe switch module to establish a connection relationship between interrupt message addresses used by the second emulated PCIe device and the Interrupt message addresses used by the PCIe endpoint device. At 326, the method 300 includes establishing a connection relationship between the second emulated PCIe device and PCIe device configuration exposed to the VM by the VMM. At 328, the method 300 includes notifying the PCIe endpoint device to re-enable DMA operation using the switch module and the completion of the VM live migration.
[0076] FIG. 4 illustrates an example flow diagram 400 for emulating a PCIe device inside the PCIe switch, in accordance with an example embodiment of the present disclosure. At 402, the method 400 includes synthesizing an emulated PCIe device with configuration space and BAR address space on behalf of the PCIe endpoint device using the PCIe switch module. At 404, the method 400 includes presenting the emulated PCIe device to the host computing machine by the management server by capturing and responding to egress configuration TLPs coming from the host computing machine towards the PCIe switch. At 406, the method 400 includes creating a connecting relationship within the PCIe switch between the emulated PCIe device and the PCIe endpoint device using address trap tables, ID routing tables and interrupt redirection tables.
[0077] Various embodiments disclose method and system for enabling PCIe device migration during VM live migration. In an embodiment, an application running on a guest OS is directly exposed to a PCIe endpoint device and hence performance and the efficiency of the PCIe device migration is enhanced. Additionally, in an embodiment, the process of virtual machine live migration is controlled and coordinated via a PCIe switch module embodied in a management server. Moreover, the various embodiments disclose utilizing various redirection tables that may include data for enabling efficient redirection of data between the different computing machines. For instance, an interrupt redirection table is built in an efficient manner so that the redirection of data between the different computing machines can be performed faster.
[0078] FIG. 5 is a block diagram of an exemplary computer system 501 for implementing embodiments consistent with the present disclosure. The computer system 501 may be implemented in alone or in combination of components of the system 100 (refer, FIGS. 1A, 1B and 1C). Variations of computer system 501 may be used for implementing the devices included in this disclosure. Computer system 501 may comprise a central processing unit (“CPU” or “processor”) 502. Processor 502 may comprise at least one data processor for executing program components for executing user- or system-generated requests. The processor may include specialized processing units such as integrated system (bus) controllers, memory management control units, floating point units, graphics processing units, digital signal processing units, etc. The processor may include a microprocessor, such as AMD AthlonTM, DuronTM or OpteronTM, ARM’s application, embedded or secure processors, IBM PowerPCTM, Intel’s Core, ItaniumTM, XeonTM, CeleronTM or other line of processors, etc. The processor 502 may be implemented using mainframe, distributed processor, multi-core, parallel, grid, or other architectures. Some embodiments may utilize embedded technologies like application specific integrated circuits (ASICs), digital signal processors (DSPs), Field Programmable Gate Arrays (FPGAs), etc.
[0079] Processor 502 may be disposed in communication with one or more input/output (I/O) devices via I/O interface 503. The I/O interface 503 may employ communication protocols/methods such as, without limitation, audio, analog, digital, monoaural, RCA, stereo, IEEE-1394, serial bus, universal serial bus (USB), infrared, PS/2, BNC, coaxial, component, composite, digital visual interface (DVI), high-definition multimedia interface (HDMI), RF antennas, S-Video, VGA, IEEE 802.11 a/b/g/n/x, Bluetooth, cellular (e.g., code-division multiple access (CDMA), high-speed packet access (HSPA+), global system for mobile communications (GSM), long-term evolution (LTE), WiMax, or the like), etc.
[0080] Using the I/O interface 503, the computer system 501 may communicate with one or more I/O devices. For example, the input device 504 may be an antenna, keyboard, mouse, joystick, (infrared) remote control, camera, card reader, fax machine, dongle, biometric reader, microphone, touch screen, touchpad, trackball, sensor (e.g., accelerometer, light sensor, GPS, gyroscope, proximity sensor, or the like), stylus, scanner, storage device, transceiver, video device/source, visors, etc.
[0081] Output device 505 may be a printer, fax machine, video display (e.g., cathode ray tube (CRT), liquid crystal display (LCD), light-emitting diode (LED), plasma, or the like), audio speaker, etc. In some embodiments, a transceiver 506 may be disposed in connection with the processor 502. The transceiver may facilitate various types of wireless transmission or reception. For example, the transceiver may include an antenna operatively connected to a transceiver chip (e.g., Texas Instruments WiLink WL1283, Broadcom BCM4750IUB8, Infineon Technologies X-Gold 618-PMB9800, or the like), providing IEEE 802.11a/b/g/n, Bluetooth, FM, global positioning system (GPS), 2G/3G HSDPA/HSUPA communications, etc.
[0082] In some embodiments, the processor 502 may be disposed in communication with a communication network 508 via a network interface 507. The network interface 507 may communicate with the communication network 508. The network interface may employ connection protocols including, without limitation, direct connect, Ethernet (e.g., twisted pair 10/100/1000 Base T), transmission control protocol/internet protocol (TCP/IP), token ring, IEEE 802.11a/b/g/n/x, etc. The communication network 508 may include, without limitation, a direct interconnection, local area network (LAN), wide area network (WAN), wireless network (e.g., using Wireless Application Protocol), the Internet, etc. Using the network interface 507 and the communication network 508, the computer system 501 may communicate with devices 509 and 510. These devices may include, without limitation, personal computer(s), server(s), fax machines, printers, scanners, various mobile devices such as cellular telephones, smartphones (e.g., Apple iPhone, Blackberry, Android-based phones, etc.), tablet computers, eBook readers (Amazon Kindle, Nook, etc.), laptop computers, notebooks, gaming consoles (Microsoft Xbox, Nintendo DS, Sony PlayStation, etc.), or the like. In some embodiments, the computer system 501 may itself embody one or more of these devices.
[0083] In some embodiments, the processor 502 may be disposed in communication with one or more memory devices (e.g., RAM 513, ROM 514, etc.) via a storage interface 512. The storage interface may connect to memory devices including, without limitation, memory drives, removable disc drives, etc., employing connection protocols such as serial advanced technology attachment (SATA), integrated drive electronics (IDE), IEEE-1394, universal serial bus (USB), fiber channel, small computer systems interface (SCSI), etc. The memory drives may further include a drum, magnetic disc drive, magneto-optical drive, optical drive, redundant array of independent discs (RAID), solid-state memory devices, solid-state drives, etc. Variations of memory devices may be used for implementing, for example, any databases utilized in this disclosure.
[0084] The memory devices may store a collection of program or database components, including, without limitation, an operating system 516, user interface application 517, user/application data 518 (e.g., any data variables or data records discussed in this disclosure), etc. The operating system 516 may facilitate resource management and operation of the computer system 501. Examples of operating systems include, without limitation, Apple Macintosh OS X, Unix, Unix-like system distributions (e.g., Berkeley Software Distribution (BSD), FreeBSD, NetBSD, OpenBSD, etc.), Linux distributions (e.g., Red Hat, Ubuntu, Kubuntu, etc.), IBM OS/2, Microsoft Windows (XP, Vista/7/8, etc.), Apple iOS, Google Android, Blackberry OS, or the like. User interface 517 may facilitate display, execution, interaction, manipulation, or operation of program components through textual or graphical facilities. For example, user interfaces may provide computer interaction interface elements on a display system operatively connected to the computer system 501, such as cursors, icons, check boxes, menus, scrollers, windows, widgets, etc. Graphical user interfaces (GUIs) may be employed, including, without limitation, Apple Macintosh operating systems’ Aqua, IBM OS/2, Microsoft Windows (e.g., Aero, Metro, etc.), Unix X-Windows, web interface libraries (e.g., ActiveX, Java, Javascript, AJAX, HTML, Adobe Flash, etc.), or the like.
[0085] In some embodiments, computer system 501 may store user/application data 518, such as the data, variables, records, etc. as described in this disclosure. Such databases may be implemented as fault-tolerant, relational, scalable, secure databases such as Oracle or Sybase. Alternatively, such databases may be implemented using standardized data structures, such as an array, hash, linked list, structured text file (e.g., XML), table, or as object-oriented databases (e.g., using ObjectStore, Poet, Zope, etc.). Such databases may be consolidated or distributed, sometimes among the various computer systems discussed above in this disclosure. It is to be understood that the structure and operation of any computer or database component may be combined, consolidated, or distributed in any working combination.
[0086] Additionally, in some embodiments, the server, messaging and instructions transmitted or received may emanate from hardware, including operating system, and program code (i.e., application code) residing in a cloud implementation. Further, it should be noted that one or more of the systems and methods provided herein may be suitable for cloud-based implementation. For example, in some embodiments, some or all of the data used in the disclosed methods may be sourced from or stored on any cloud computing platform.
[0087] The specification has described systems and methods for enabling migration of PCIe devices during VM line migration. The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope and spirit of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
[0088] Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), readonly memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
[0089] It is intended that the disclosure and examples be considered as exemplary only, with a true scope and spirit of disclosed embodiments being indicated by the following claims.
| # | Name | Date |
|---|---|---|
| 1 | Form 3 [24-05-2017(online)].pdf | 2017-05-24 |
| 2 | Form 20 [24-05-2017(online)].jpg | 2017-05-24 |
| 3 | Form 18 [24-05-2017(online)].pdf_451.pdf | 2017-05-24 |
| 4 | Form 18 [24-05-2017(online)].pdf | 2017-05-24 |
| 5 | Drawing [24-05-2017(online)].pdf | 2017-05-24 |
| 6 | Description(Complete) [24-05-2017(online)].pdf_452.pdf | 2017-05-24 |
| 7 | Description(Complete) [24-05-2017(online)].pdf | 2017-05-24 |
| 8 | Form 26 [03-07-2017(online)].pdf | 2017-07-03 |
| 9 | 201721018238-Proof of Right (MANDATORY) [22-08-2017(online)].pdf | 2017-08-22 |
| 10 | Abstract1.jpg | 2018-08-11 |
| 11 | 201721018238-ORIGINAL UNDER RULE 6 (1A)-310817.pdf | 2018-08-11 |
| 12 | 201721018238-ORIGINAL UNDER RULE 6 (1A)-050717.pdf | 2018-08-11 |
| 13 | 201721018238-OTHERS [27-02-2021(online)].pdf | 2021-02-27 |
| 14 | 201721018238-FER_SER_REPLY [27-02-2021(online)].pdf | 2021-02-27 |
| 15 | 201721018238-COMPLETE SPECIFICATION [27-02-2021(online)].pdf | 2021-02-27 |
| 16 | 201721018238-CLAIMS [27-02-2021(online)].pdf | 2021-02-27 |
| 17 | 201721018238-FER.pdf | 2021-10-18 |
| 18 | 201721018238-US(14)-HearingNotice-(HearingDate-11-10-2023).pdf | 2023-09-14 |
| 19 | 201721018238-FORM-26 [05-10-2023(online)].pdf | 2023-10-05 |
| 20 | 201721018238-FORM-26 [05-10-2023(online)]-1.pdf | 2023-10-05 |
| 21 | 201721018238-Correspondence to notify the Controller [05-10-2023(online)].pdf | 2023-10-05 |
| 22 | 201721018238-RELEVANT DOCUMENTS [21-10-2023(online)].pdf | 2023-10-21 |
| 23 | 201721018238-PETITION UNDER RULE 137 [21-10-2023(online)].pdf | 2023-10-21 |
| 24 | 201721018238-Written submissions and relevant documents [23-10-2023(online)].pdf | 2023-10-23 |
| 25 | 201721018238-PatentCertificate30-10-2023.pdf | 2023-10-30 |
| 26 | 201721018238-IntimationOfGrant30-10-2023.pdf | 2023-10-30 |
| 1 | 2021-03-1514-47-38AE_15-03-2021.pdf |
| 2 | 2020-08-2318-51-14E_23-08-2020.pdf |