Claims:WE CLAIM:
1. A system comprising:
a last level cache (LLC) having a plurality of ways, each way allocated to one of multiple priorities, each priority associated with a class of service (CLOS) register specifying minimum and maximum number of ways to occupy;
a cache control circuit (CCC) to:
when an invalid cache line (CL) exists in the LLC, store an incoming cache line (CL) having a requestor priority being one of the multiple priorities to the an invalid CL;
when the requestor priority is a lowest of the multiple priorities and has an occupancy of one or more, or when the occupancy is at a maximum for the requestor priority, store the incoming CL in place of a least recently used (LRU) CL of the requestor priority;
when the occupancy is between a minimum and the maximum for the requestor priority, store the incoming CL in place of a LRU CL of the requestor or a lower priority;
when the occupancy is less than the minimum and a CL having the lower priority exists, store the incoming CL in place of a LRU CL having the lower priority; and
when no invalid CL or CL with the requestor or lower priority exists, store the incoming CL in place of a LRU CL of a higher priority.
2. The system of claim 1, wherein the LLC comprises multiple sets of ways, wherein the plurality of ways are part of one of the multiple sets, and wherein the CCC, before determining where to store the incoming CL, is to determine, based on a hashing algorithm performed on a logical address of the incoming CL, which of the multiple sets include the incoming CL.
3. The system of claim 1, further comprising a cache monitor circuit to maintain heuristics related to LLC cache evictions, wherein, when greater than a threshold percentage of cache lines having a high priority are evicted to make room to fill an incoming CL with a lower priority, the CLOS register for the high priority is to be updated to increase the minimum and maximum ways to occupy.
4. The system of claim 1, wherein each of the plurality of ways comprises N CLs, wherein N is a positive integer being greater than or equal to one.
5. The system of claim 1, wherein the CCC, when storing the incoming CL in place of the LRU CL, if any, having the lower priority, is to cause flushing of other CLs, if any, in a way containing the LRU CL.
6. The system of claim 1, further comprising a processor incorporating the LLC and the CCC and comprising one or more cores, each of the one or more cores to implement a virtual machine, and wherein the CCC comprises a hypervisor.
7. The system of claim 6, wherein the processor is one of a plurality of processors in a data center of a cloud service provider.
8. A method performed by a cache control circuit (CCC) in a system comprising a last level cache (LLC) having a plurality of ways, each way allocated to one of multiple priorities, each priority associated with a class of service (CLOS) register specifying minimum and maximum number of ways to occupy, the method comprising:
receiving a request to store an incoming cache line (CL) having a requestor priority of the multiple priorities into the LLC;
when an invalid CL exists in the LLC, storing the incoming CL to the invalid CL;
when the requestor priority is a lowest of the multiple priorities and has an occupancy of one or more, or when the occupancy is at a maximum for the requestor priority, storing the incoming CL in place of a least recently used (LRU) CL of the requestor priority;
when the occupancy is between a minimum and the maximum for the requestor priority, storing the incoming CL in place of a LRU CL of the requestor or a lower priority;
when the occupancy is less than the minimum and a CL having the lower priority exists, storing the incoming CL in place of a LRU CL having the lower priority; and
when no invalid CL or CL with the requestor or lower priority exists, storing the incoming CL in place of a LRU CL of a higher priority.
9. The method of claim 8, wherein the LLC comprises multiple sets of ways, wherein the plurality of ways are part of one of the multiple sets, and wherein the CCC, before determining where to store the incoming CL, is to determine, based on a hashing algorithm performed on a logical address of the incoming CL, which of the multiple sets includes the incoming CL.
10. The method of claim 8, maintaining, using a LLC cache monitor circuitry, heuristics related to LLC cache evictions, wherein, when greater than a threshold percentage of cache lines having a high priority are evicted to make room to fill an incoming CL with a lower priority, the updating a CLOS register for the high priority to increase the minimum and maximum ways to occupy.
11. The method of claim 8, wherein each of the plurality of ways comprises N CLs, wherein N is a positive integer being greater than or equal to one.
12. The method of claim 8, wherein the CCC, when storing the incoming CL in place of the LRU CL, if any, having the lower priority, is to cause flushing of other CLs, if any, in a way containing the LRU CL.
13. The method of claim 8, wherein the system further comprises a processor incorporating the LLC and the CCC and having one or more cores, each of the one or more cores implementing a virtual machine, and wherein the CCC comprises a hypervisor.
14. The method of claim 13, wherein the processor is one of a plurality of processors in a data center of a cloud service provider.
15. A non-transitory computer-readable medium containing instructions to which a cache control circuit (CCC) in a system comprising a last level cache (LLC) having a plurality of ways, each way allocated to one of multiple priorities, each priority associated with a class of service (CLOS) register specifying minimum and maximum number of ways to occupy, is to respond by:
receiving a request to store an incoming cache line (CL) having a requestor priority of the multiple priorities into the LLC;
when an invalid CL exists in the LLC, storing the incoming CL to the invalid CL;
when the requestor priority is a lowest of the multiple priorities and has an occupancy of one or more, or when the occupancy is at a maximum for the requestor priority, storing the incoming CL in place of a least recently used (LRU) CL of the requestor priority;
when the occupancy is between a minimum and the maximum for the requestor priority, storing the incoming CL in place of a LRU CL of the requestor or a lower priority;
when the occupancy is less than the minimum and a CL having the lower priority exists, storing the incoming CL in place of a LRU CL having the lower priority; and
when no invalid CL or CL with the requestor or lower priority exists, storing the incoming CL in place of a LRU CL of a higher priority.
16. The non-transitory computer-readable medium of claim 15, wherein the LLC comprises multiple sets of ways, wherein the plurality of ways are part of one of the multiple sets, and wherein the CCC, in further response to the instructions, is to, before determining where to store the incoming CL, is to determine, based on a hashing algorithm performed on a logical address of the incoming CL, which of the multiple sets include the incoming CL.
17. The non-transitory computer-readable medium of claim 15, wherein the system further comprises a processor incorporating the LLC and the CCC, and wherein the processor is one of a plurality of processors in a data center of a cloud service provider.
18. The non-transitory computer-readable medium of claim 15, wherein each of the plurality of ways comprises N CLs, wherein N is a positive integer being greater than or equal to one.
19. The non-transitory computer-readable medium of claim 15, wherein the CCC, when storing the incoming CL in place of the LRU CL, if any, having the lower priority, is to cause flushing of other CLs, if any, in a way containing the LRU CL.
20. The non-transitory computer-readable medium of claim 15, wherein the system further comprises a processor incorporating the LLC and the CCC, and comprising one or more cores, the one or more cores each implementing a virtual machine, and wherein the CCC comprises a hypervisor.
Dated this 11th day of September 2020
(Narendra Reddy Thappeta)
Registration Number: IN/PA-392
Of Law Firm of Naren Thappeta
Agent for Applicant
, Description:RELATED APPLICATION
[0001] The present application claims priority to U.S. Non-Provisional Patent Application No. 16/685,224 filed November 15, 2019 and titled “PARALLEL DECOMPRESSION MECHANISM” the entire disclosure of which is hereby incorporated by reference.
FIELD OF THE INVENTION
[0002] The field of invention relates generally to computer processor architecture, and, more specifically, to an improved flexible cache allocation technology (Flex-CAT) priority-based eviction algorithm for cache partitioning.
BACKGROUND
[0003] Multi-tenancy is a proven solution to achieve high system utilization and cost savings by space sharing. In cloud environments, multi-tenancy can be achieved by means of virtualization where each core hosts a virtual machine (VM) executing user applications. Emerging computing paradigms such as Function as a Service (FaaS) employ containerization to execute multiple independent light weight functions in containers. In a typical multitenant environment, high priority (HP) jobs coexist with low priority (LP) jobs on the same computing resource such as a multicore processor or a core. HP jobs are latency-sensitive jobs while LP jobs tend to have soft deadlines. Some HP jobs demand performance determinism in addition to low latency. Users submitting jobs enter into quality of service (QoS) service-level agreements (SLAs) with cloud service providers (CSPs) and depend on them to meet latency or performance determinism guarantees. CSPs need to meet SLAs by limiting performance variation or even degradation of QoS or HP jobs caused by other co-located LP jobs.
BRIEF DESCRIPTION OF THE DRAWINGS
[0004] The present invention is illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements and in which:
[0005] Figure 1 is a block diagram illustrating processing components for executing instructions, according to some embodiments;
[0006] Figure 2 is a block diagram illustrating a system including a multi-core system to execute virtual machines, according to some embodiments;
[0007] Figure 3 is an exemplary cache partitioning scheme, according to some embodiments;
[0008] Figure 4 is a block diagram illustrating cache line evictions, according to some embodiments;
[0009] Figure 5 is a block flow diagram illustrating a process performed by a cache control circuit in response to a cache fill request, according to some embodiments;
[00010] Figure 6 is a flow diagram illustrating a cache control circuit processing a cache fill request, according to some embodiments;
[0010] Figures 7A-7B are block diagrams illustrating a generic vector friendly instruction format and instruction templates thereof according to some embodiments of the invention;
[0011] Figure 7A is a block diagram illustrating a generic vector friendly instruction format and class A instruction templates thereof according to some embodiments of the invention;
[0012] Figure 7B is a block diagram illustrating the generic vector friendly instruction format and class B instruction templates thereof according to some embodiments of the invention;
[0013] Figure 8A is a block diagram illustrating an exemplary specific vector friendly instruction format according to some embodiments of the invention;
[0014] Figure 8B is a block diagram illustrating the fields of the specific vector friendly instruction format that make up the full opcode field according to one embodiment;
[0015] Figure 8C is a block diagram illustrating the fields of the specific vector friendly instruction format that make up the register index field according to one embodiment;
[0016] Figure 8D is a block diagram illustrating the fields of the specific vector friendly instruction format that make up the augmentation operation field according to one embodiment;
[0017] Figure 9 is a block diagram of a register architecture according to one embodiment;
[0018] Figure 10A is a block diagram illustrating both an exemplary in-order pipeline and an exemplary register renaming, out-of-order issue/execution pipeline according to some embodiments;
[0019] Figure 10B is a block diagram illustrating both an exemplary embodiment of an in-order architecture core and an exemplary register renaming, out-of-order issue/execution architecture core to be included in a processor according to some embodiments;
[0020] Figures 11A-B illustrate a block diagram of a more specific exemplary in-order core architecture, which core would be one of several logic blocks (including other cores of the same type and/or different types) in a chip;
[0021] Figure 11A is a block diagram of a single processor core, along with its connection to the on-die interconnect network and with its local subset of the Level 2 (L2) cache, according to some embodiments;
[0022] Figure 11B is an expanded view of part of the processor core in Figure 11A according to some embodiments;
[0023] Figure 12 is a block diagram of a processor that may have more than one core, may have an integrated memory controller, and may have integrated graphics according to some embodiments;
[0024] Figures 13-16 are block diagrams of exemplary computer architectures;
[0025] Figure 13 shown a block diagram of a system in accordance with some embodiments;
[0026] Figure 14 is a block diagram of a first more specific exemplary system in accordance with some embodiment;
[0027] Figure 15 is a block diagram of a second more specific exemplary system in accordance with some embodiments;
[0028] Figure 16 is a block diagram of a System-on-a-Chip (SoC) in accordance with some embodiments; and
[0029] Figure 17 is a block diagram contrasting the use of a software instruction converter to convert binary instructions in a source instruction set to binary instructions in a target instruction set according to some embodiments.
DETAILED DESCRIPTION OF THE EMBODIMENTS
[0030] In the following description, numerous specific details are set forth. However, it is understood that some embodiments may be practiced without these specific details. In other instances, well-known circuits, structures, and techniques have not been shown in detail in order not to obscure the understanding of this description.
[0031] References in the specification to “one embodiment,” “an embodiment,” “an example embodiment,” etc., indicate that the embodiment described may include a feature, structure, or characteristic, but every embodiment may not necessarily include the feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a feature, structure, or characteristic is described about an embodiment, it is submitted that it is within the knowledge of one skilled in the art to affect such feature, structure, or characteristic about other embodiments if explicitly described.
[0032] As mentioned above, cloud service providers (CSPs) need to meet service-level agreements (SLAs) by limiting performance variation and degradation of quality of service (QoS) of high-performance (HP) jobs caused by co-located low-performance (LP) jobs. In particular, disclosed embodiments describe Flexible Cache Allocation Technology (Flex-CAT), an architectural solution that limits HP last level cache (LLC) evictions caused by LP jobs. The Flex-CAT approach dynamically determines the optimal number of ways per set depending on the utilization of cache lines (CLs) in each cache set by various priorities. The bounds (minimum and maximum number of ways) are specified in model-specific registers (MSRs) that provide hints for selecting a victim for eviction at the time of LLC cache-fill.
[0033] Advantageously, Flex-CAT provides an easy-to-configure, yet flexible interface to specify cache partitions. Flex-CAT supports a dynamic cache partitioning scheme that makes priority-driven LLC eviction decisions based on real-time data to manage partitions at fine granularity. Flex-CAT helps to satisfy cloud service providers’ demands for architectural features to satisfy QoS guarantees such as performance determinism and isolation between high-performance and low-performance jobs. As used herein, low-performance jobs are sometimes referred to as “noisy neighbors.”
[0034] Alternative, inferior approaches have attempted to address the disparity between HP and LP jobs sharing resources by assigning disjoint sets of ways to cores, and limiting HP and LP jobs to specific cores. But such approaches suffer from several disadvantages. For example, there is no notion of priority in such mechanisms. Some such approaches isolate HP jobs from LP jobs by allocating a dedicated set of ways across all cache sets and to HP cores, but such dedicated resources, when not working on HP workloads, cannot be utilized by LP workloads. Furthermore, some such approaches saturate certain limited cache sets (say, x sets) more than others (e.g., N-x sets where N=total number of cache sets). Uniform way allocation to HP jobs driven by the saturation of these x sets can lead to overprovisioning across N-x sets, and underutilization of those N-x. Also, assigning fewer than maximum ways to cores reduces associativity which causes higher conflict misses and leads to performance degradation. Approaches that use static allocation of fixed cache ways to cores leave no room for flexibility at the time of cache evictions and fills.
[0035] Disclosed embodiments, on the other hand, provide a flexible interface to dynamically specify priorities and cache partitions. Priorities are enumerated in ascending order. Flexible cache partitions can be specified in terms of the maximum and the minimum number of ways per priority. Unlike some other approaches, flex-CAT does not require software to specify the exact cache ways allocated to each partition.
[0036] Supporting such dynamic priority and cache partition specifications are class of service (CLOS) registers that hold the following values per CLOS:
• CLOS Priority P: Pn bits
• Maximum number of ways occupied by priority P: mxwn bits
• Minimum number of ways occupied by priority P: mnwn bits
[0037] For example, let the maximum number of priorities be 4. Therefore, Pn = log(4) = 2 ; Let the maximum number of ways be 16. Therefore, mxwn=mnwn=log(16) = 4.
Let CLOS[0]: (0,12,4) and CLOS[1]: (1,4,1)
CLOS Priority MaxW (mxwn bits) MinW (mnwn bits)
0 0 12 4
1 1 4 1
[0038] According to embodiments disclosed herein, requestor is the owner of the CL that’s expected to be filled in LLC. Let requestor’s priority be PF. PL is the lowest priority and PH is the highest priority in the system. Let loc be the final location for the requestor’s CL determined by Flex-CAT. Requestor’s occupancy O[PF] is the number of CLs occupied by the requestor in the indexed cache set.
[0039] Flex-CAT is a new eviction algorithm that performs priority-driven cache partitioning at cache-set granularity. It prioritizes LP CLs for eviction as long as requestor’s occupancy is under maximum way allocation (mxw). When requestor’s occupancy reaches its maximum allocation, Flex-CAT prioritizes self-evictions over other priority evictions to stay within the partition bounds. In the limited scenarios when no victims can be found in the earlier two steps, Flex-CAT picks a HP CL for eviction to find a home for the incoming cache fill. This foundational idea of Flex-CAT is depicted in Figure 4.
[0040] The detailed algorithm is depicted in the flowcharts of Figures 4-6, and described below. At the time of LLC fill, disclosed embodiments rely on a conventional hashing algorithm to determine the cache set index for the requestor’s incoming cache line.
[0041] After indexing into the appropriate cache set, Flex-CAT first searches for an invalid LLC entry in the indexed cache set. If the cache set if full and no invalid location is found, Flex-CAT determines a victim CL that needs to be evicted out of LLC. This ensures that Flex-CAT is only enabled for cache-sets that are saturated and no workloads are unnecessarily penalized when there is no contention. Flex-CAT scans the entire cache set and determines the LRU CL’s index, its age, and the occupancy for each priority in the system.
[0042] When requestor’s occupancy is under minimum allocation (O[PF]