Method For Multiple Store Buffer Forwarding In A System With A

< Back

Method For Multiple Store Buffer Forwarding In A System With A Restrictive Memory Model And System And Apparatus Therefor

A system and method for multiple store buffer forwarding in a system with a restrictive memory model is disclosed. The system comprises: a processor [310(1)-310 (n)]; a system memory [340] coupled to the processor; and a non-volatile memory [370] coupled to the processor in which is stored an article of manufacture comprising instructions adapted to be executed by the processor, the instructions which, whenexecuted, encode instructions in an instruction set to enable multiple store buffer forwarding in a system with a restrictive memory model. The method comprises: executing a plurality of store instructions; executing a load instruction; determining that a memory region addressed by the load instruction matches a cacheline address in a memory; determining that data stored by the plurality of store instructions completely covers the memory region addressed by the load instruction; and transmitting a store forward is OK signal.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

08 May 2003

Publication Number

22/2006

Publication Type

Invention Field

ELECTRONICS

Status

Parent Application

Patent Number

Legal Status

Grant Date

2009-07-01

Renewal Date

Applicants

INTEL CORPORATION

2200 MISSION COLLEGE BOULEVARD, SANTA CLARA, CA

Inventors

1. THATCHER LARRY E

11507 DK RANCH ROAD, AUSTIN, TX 78759

2. BOATRIGHT BRYAN D

12828 WITHERS WAY, AUSTIN, TX 78727

3. PATEL RAJESH

9313 SILK OAK COVE, AUSTIN, TX 78748

Specification

METHOD FOR MULTIPLE STORE BUFFER FORWARDING IN A SYSTEM
WITH A RESTRICTIVE MEMORY MODEL AND SYSTEM
AND APPARATUS THEREFOR
FIELD OF INVENTION
The present invention relates to method for multiple store buffer forwarding in a system
with a restrictive memory model and system and apparatus therefor.
Background
Many modern microprocessors implement store buffer forwarding which is a mechanism that
improves microprocessor performance by completing a younger dependent load operation by using data
from an older, completely overlapping store operation. This forwarding can occur while the store operation
is speculative or has passed the point of speculation and is part of the committed machine state. In either
case, the load operation's execution is delayed minimally when it can read its data directly from the buffer
without waiting for that data to become globally observed (GO). For multiple store operations, prior
processors, such as the Intel® Pentium® III, and the related instruction set architectures (ISAs) that run on
these processors have stalled the execution of the load operation until the older multiple store operations
become globally observed. The Intel® Pentium® III is manufactured by Intel Corporation of Santa Clara,
California.
Because store buffer forwarding has implications on the order in which all processes in a multi-
threaded or multi-processor system observe store operations from the other processes, a processor
architecture must carefully specify the rules under which store buffer forwarding may occur. For example,
the Intel® Architecture 32-bit ISA (IA-32) product family has essentially implemented the Scalable
Processor Architecture (SPARC®) total store order (TSO) memory model from SPARC International Inc.™
of Santa Clara, California.
The TSO memory model has two restrictions related to store buffer forwarding:
1. A younger load operation may only receive forwarded data
from a single older store buffer entry; and
2. The older store buffer entry must completely cover the region

of memory being read by the younger load operation.
Many existing IA-32 code sequences produce situations in which these two TSO restrictions
considerably degrade the performance of the processor. When a typical IA-32 processor executes a load
operation that encounters one of the conditions listed above, the processor stalls the load operation's
5 execution until the offending condition clears. While waiting for the contents of the store buffer entry to
become GO, the load operation and all instructions that are dependent on the load operation are stalled, thus
reducing processor performance.
Accordingly, the present invention provides a method for multiple store buffer forwarding in
a system with a restrictive memory model, the method comprising: executing a plurality of store
instructions; executing a load instruction; determining that a memory region addressed by the load
instruction matches a cacheline address in a memory; determining that data stored by the plurality
of store instructions completely covers the memory region addressed by the load instruction; and
transmitting a store forward is OK signal.
The present invention also provides a processor system, comprising: a processor; a
system memory coupled to the processor; and a non-volatile memory coupled to the processor in
which is stored an article of manufacture comprising instructions adapted to be executed by the
processor, the instructions which, when executed, encode instructions in an instruction set to
enable multiple store buffer forwarding in a system with a restrictive memory model, the article of
manufacture comprising instructions to: execute a plurality of store instructions; execute a load
instruction; determine that a memory region addressed by the load instruction matches a cacheline
address in a memory; determine that data stored by the plurality of store instructions completely
covers the memory location in the memory specified by the load instruction; and transmit a store
forward is OK signal.
The present invention further provides a multiple store buffer forwarding apparatus,
comprising: a processor having a write combining buffer, and a non-volatile memory coupled to the
processor, said non-volatile memory storing instructions which when executed by the processor
cause the processor to: execute a plurality of store instructions referencing a first memory region;
execute a load instruction referencing a second memory region; determine that the second
memory region matches a cacheline address; determine that the first memory region completely
covers the second memory region; and transmit a store forward is OK signal.

The present invention still further provides a multiple store buffer forwarding apparatus,
comprising: a memory; a processor coupled to said memory and having a write combining buffer,
said processor to execute a plurality of store instructions referencing a first memory region of said
memory; execute a load instruction referencing a second memory region of said memory;
determine that the second memory region matches a cacheline address; determine that the first
memory region completely covers the second memory region; and transmit a signal indicating that
store buffer forwarding is authorized.
The present invention still further provides a machine-readable medium having stored
thereon a plurality of executable instructions for multiple store buffer forwarding in a system with a
restrictive memory model, the plurality of instructions adapted to cause a processor to perform
instructions to: execute a plurality of store instructions; execute a load instruction; determine that a
memory region addressed by the load instruction matches a cacheline address in a memory;
determine that data stored by the plurality of store instructions completely covers the memory
location in the memory specified by the load instruction; and transmit a store forward is OK signal.
BRIEF DESCRIPTION OF THE ACCOMPANYING DRAWINGS.
FIG. 1 is a flow diagram of a method for providing multiple store buffer forwarding, in accordance
with an embodiment of the present invention.
FIG. 2 is a schematic block diagram of a write combining buffer configuration in which multiple
store buffer forwarding can be implemented, in accordance with an embodiment of the present invention.
FIG. 3 is a block diagram of a computer system in which multiple store buffer forwarding can be
implemented, in accordance with an embodiment of the present invention.
Detailed Description
In accordance with an embodiment of the present invention, the system and method provide a novel
mechanism to allow load operations that are completely covered by two or more store operations to receive
data via store buffer forwarding in such a manner as to retain the side effects of the two TSO restrictions
thereby increasing processor performance without violating the restrictive memory model. For greater
clarity, the problem can be described using some sample program executions.
Archetypical Store Buffer Forwarding Example
The following sample code segment demonstrates the type of store buffer forwarding that the TSO
memory model specifically allows:
MOV EBX, 0x1000 ; set up the base address
MOV [EBX], 0x00001234 ; store 4 bytes (0x00001234) to location
0x1000
; zero or more other instructions

MOV EAX, [EBX] ; load 4 bytes from location 0x1000
In this example, store buffer forwarding hardware is permitted to bypass the store data value (0x00001234)
directly to the younger load operation even before other processors observe the store data value. Note that
the above single store operation's data completely overlaps, that is, covers, the memory region that the load
operation references. In other words, in this example, the same four bytes in the memory that are used by
the store operation to store the data value are used by the load operation.
Disallowed Store Buffer Forwarding Example
This sample code segment demonstrates a type of store buffer forwarding that is not allowed in the
IA-32 restrictive memory model (TSO):
MOV EBX, 0x1000 ; set up the base address
MOV [EBX], 0x0000 ; store 2 bytes (0x0000) to location 0x1000
MOV [EBX] +2, 0x1234 ; store 2 bytes (0x1234) to location 0x1002
... ; zero or more other instructions
MOV EAX, [EBX] ; load 4 bytes, from location 0x1000
In this example, it would be desirable to forward store data from the two store operations to the single load
operation, since each store operation provides half of the data required by the load operation. Unfortunately,
this multiple store buffer forwarding is specifically disallowed in the TSO memory model, and, therefore,
the load operation must wait to complete until the two store operations become globally observed (GO).
Multiple store buffer forwarding is disallowed by the TSO memory model to ensure a reliable and consistent
order in which store operations can become globally observed by other processors in the system.
The problem that store forwarding from multiple store buffer entries can cause is illustrated with
the following code sequences, which are executed on two separate processors in a multi-processor system:

As shown above, load operation P0_5 now sees a result consistent with the TSO memory model.
Embodiments of the present invention, therefore, provide a mechanism to forward data from multiple store
buffer entries in such a way that the invention guarantees that those store buffer entries update system
memory (that is, become globally observed) at a single point in time.
5 In accordance with an embodiment of the present invention, the system and method can make use
of a write combining buffer (WCB). A WCB is a structure that usually contains buffering for several
cachelines of data and control logic that allows individual store operations to "combine" into a larger unit of
data, which is generally equal in size to the size of the cachelines. When this unit of data is written back to
system memory or to a processor cache, it is done so atomically and with a single transaction. This improves

processor performance by conserving bus transaction and cache bandwidth, because, without a WCB, each
individual store operation would have to access the system bus or cache port separately. This would use
system resources less efficiently than if a number of store operations could be pre-grouped into a single
transaction.
FIG. 1 is a flow diagram of a method for providing multiple store buffer forwarding, in accordance
with an embodiment of the present invention. In FIG. 1, multiple store instructions can be executed 110,
generally, to combine data values from the multiple store instructions into a single value. In accordance with
an embodiment of the present invention, this single value can be combined into an entry in a processor
memory, such as an entry in a WCB, which can remain invisible to all other processors in the system until
all of the multiple store instructions have completed executing. A load instruction can be executed 120 to
load data from a memory coupled to the system. A check can be performed to determine whether a memory
region addressed by the load instruction matches any cacheline addresses stored in the WCB 130. If a match
is found, then, a check can be performed to determine whether all of the memory region addressed by the
load instruction, is covered by data that has been stored by the multiple store instructions 140. If all of the
memory region is covered by the data from the multiple store instructions, then, a store forward is "OK"
signal can be generated 150. At this point the load instruction can complete executing by reading the data
from the WCB entry.
If a match is not found, then, the method is not needed and the method can end. If all of the
memory region is not covered by data from the multiple store instructions, then, the method can return to
execute 120 the load instruction again. In general, the previous load instruction execution 120 and all
subsequent dependent instructions are flushed from the system before the load instruction can be executed
again.
In accordance with an embodiment of the present invention, new hardware can be added to the
WCB to implement the functionality required to allow store fonvarding from multiple store operations
i without violating the spirit and intent of the IA-32 TSO memory model. In general, the two requirements
that are necessary to achieve this functionality include:
1. The store operation(s) must completely cover the memory region addressed by
the younger load operation; and

The store operation(s) that forward data to the load operation must become
globally observed at the same time (that is, atomically).
In accordance with an embodiment of the present invention, the system can achieve the first
requirement by adding hardware to the WCB buffer that compares, on a byte-by-byte basis, the region of
memory addressed by the load operation to the available store data in the WCB. If it is acceptable to
forward the data, the data is read to satisfy the load operation. In accordance with other embodiments of the
present invention, the data can be read from any memory containing the data, for example, the WCB, a store
buffer, an other buffer, a register, a memory and a cache memory. FIG. 2 is a schematic block diagram of a
write combining buffer configuration in which multiple store buffer forwarding can be implemented, in
accordance with an embodiment of the present invention.
In FIG. 2, a cacheline address comparison component 205 can be coupled to a WCB Address and
Data Buffer 210. The cacheiine address comparison component 205 can be configured to receive an
incoming load operation 202 and compare a cacheiine address in the incoming load operation 202 with the
cacheiine addresses of existing entries in the WCB Address and Data Buffer 210. The WCB Address and
Data Buffer 210 can be coupled to a WCB data valid bit vector buffer 220, which indicates the data bytes in
the WCB Address and Data Buffer 210 that have been written by the store operations. In an embodiment of
the present invention, the data entries in both the WCB Address and Data Buffer 210 and the WCB data
valid bit vector buffer 220 are generally of equal size, for example, 64 data bytes per entry. The WCB data
valid bit vectors buffer 220 can be connected to a Multiplexer 230, which can be configured to receive a data
valid bit vector from the WCB data valid bit vector buffer 220 and configured to receive load operation
address bits from the incoming load operation 202. For example, the Multiplexer 230 can be a 4:1
multiplexer, which can be configured to select a group of 16 store byte valid bits in the data valid bit vector
from an 8-bit boundary specified by the load operation address bits. The Multiplexer 230 can be coupled to
a comparison circuit 240, which can be configured to receive the group of 16 store byte valid bits on line 232
and an incoming load operation byte mask on line 235 and generate a "store forward OK" signal if the store
instruction data completely covers the WCB entry referenced by the incoming load operation. The heavy
lines in the comparison circuit 240 indicate a 16-bit data path. The comparison circuit 240 can include a 16-
bit inverter logic 242 which can be coupled to a first 16-bit AND-OR logic 250. The inverter 242 can be

configured to receive the group of 16 store byte valid bits selected by the Multiplexer 230 on line 232 and
output an inverted version of the group of 16 store byte valid bits onto line 244. The 16-bit AND-OR logic
can be implemented with 16 AND gates, which can be configured to receive one bit from either the group of
16 store byte valid bits on line 232 or an inverted bit from the group of 16 store byte valid bits on line 244
and one bit from the incoming load operation byte mask on line 235. The first 16-bit AND-OR logic 250
can be configured to receive the inverted 16 store byte valid bits on line 244 and can be coupled to a NAND
gate 270 and the first 16-bit AND-OR logic 250 can be configured to receive the incoming load operation
byte mask on line 235. A second AND-OR 16-bit logic 260 also can be coupled to the NAND gate 270 and
the second 16-bit AND-OR logic 260 can be configured to receive the group of 16 store byte valid bits
selected by the Multiplexer 230 on line 232 and the incoming load operation byte mask on line 235. The
NAND gate 270 is configured to receive a signal from each of the first and second 16-bit AND-OR logics
260 and 250 on lines 262 and 252, respectively, and, generate the "store forward OK" signal if the store
instruction data completely covers the WCB entry referenced by the incoming load operation.
The present invention satisfies the second requirement by taking advantage of the fact that each
WCB is sized to match the system coherence granularity (that is, the cacheline size). All individual store
operations that combine into a singular WCB entry become globally observed at the same time.
Embodiments of the present invention make use of this fact to enable multiple store buffer forwarding in a
system with a restrictive memory model, thus, improving the performance of the system.
Embodiments of the present invention can significantly improve the performance of code that has a
high occurrence of instruction sequences in which a load operation is dependent on two or more temporally
close store operations. Existing IA-32 processors, as well as other processors with restrictive memory
models, traditionally stall the execution of the load operation until the offending store operations leave the
store buffer and become globally observed. In an embodiment of the present invention improves on this
method by only stalling the load operation until the store operations are accepted into the WCB, which
removes the restriction that the store operations must become globally observed (a process that can take
many hundreds of cycles) before the load operation can be satisfied.
This is not just a theoretical performance improvement since many compilers produce situations in
which the present invention is likely to be beneficial. For example, compilers can usually provide an opti-

mization to remove as many instances of stalling the load operations as possible. Unfortunately, these
optimizations usually result in a larger memory footprint for the program, as well as slower execution times,
which result in the compiler optimizations not always being used. Therefore, using embodiments of the
present invention, can result in faster, more efficient and compact programs.
One apparently common situation in which the compiler cannot optimize the problem away is in
multimedia code which regularly generates code sequences having two 32-bit store operations that are
immediately followed by one completely overlapping 64-bit load operation. This sequence is used to move
data from the integer register file to the MMX (floating point) register file. This invention has the potential
to greatly improve the user's "Internet experience" and to enhance other multimedia applications as well.
FIG. 3 is a block diagram of a computer system 100 that is suitable for implementing the present
invention. In FIG. 3, the computer system 100 includes one or more processors 310(l)-310(n) coupled to a
processor bus 320, which can be coupled to a system logic 330. Each of the one or more processors 310(1) -
310(n) are N-bit processors and can include one or more N-bit registers (not shown). The system logic 330
can be coupled to a system memory 340 through bus 350 and coupled to a non-volatile memory 370 and one
or more peripheral devices 380(l)-380(m) through a peripheral bus 360. The peripheral bus 360 represents,
for example, one or more Peripheral Component Interconnect (PCI) buses, PCI Special Interest Group (SIG)
PCI Local Bus Specification, Revision 2.2, published December 18, 1998; industry standard architecture
(ISA) buses; Extended ISA (EISA) buses, BCPR Services Inc. EISA Specification, Version 3.12, 1992.
published 1992; universal serial bus (USB), USB Specification, Version 1.1, published September 23, 1998;
and comparable peripheral buses. Non-volatile memory 370 may be a static memory device such as a read
only memory (ROM) or a flash memory. Peripheral devices 380(l)-380(m) include, for example, a
keyboard; a mouse or other pointing devices; mass storage devices such as hard disk drives, compact disc
(CD) drives, optical disks, and digital video disc (DVD) drives; displays and the like.
In accordance with an embodiment the present invention, a method for multiple store buffer
forwarding in a system with a restrictive memory model includes executing multiple store instructions,
executing a load instruction, determining that a memory region addressed by the load instruction matches a
cacheline address in a memory, determining that data stored by the multiple store instructions completely
covers the memory region addressed by the load instruction, and transmitting a store forward is OK signal.

In accordance with an embodiment the present invention, a machine-readable medium having
stored thereon multiple executable instructions for multiple store buffer forwarding in a system with a
restrictive memory model, the multiple instructions include instructions to: execute multiple store
instructions, execute a load instruction, determine that a memory region addressed by the load instruction
matches a cacheline address in a memory, determine that data stored by the multiple store instructions
completely covers the memory location in the memory specified by the load instruction, and transmit a store
forward is OK signal.
In accordance with an embodiment the present invention, a processor system includes a processor, a
system memory coupled to the processor, and a non-volatile memory coupled to the processor in which is
stored an article of manufacture including instructions adapted to be executed by the processor, the
instructions which, when executed, encode instructions in an instruction set to enable multiple store buffer
forwarding in a system with a restrictive memory model. The article of manufacture includes instructions to:
execute multiple store instructions, execute a load instruction, determine that a memory region addressed by
the load instruction matches a cacheline address in a memory, determine that data stored by the multiple
store instructions completely covers the memory location in the memory specified by the load instruction,
and transmit a store forward is OK signal.
It should, of course, be understood that while the present invention has been described mainly in
terms of microprocessor- and multi-processor-based personal computer systems, those skilled in the art will
recognize that the principles of the invention may be used advantageously with alternative embodiments
involving other integrated processor chips and computer systems. Accordingly, all such implementations
which fall within the spirit and scope of the appended claims will be embraced by the principles of the
present invention.

WE CLAIM :
1. A method for multiple store buffer forwarding in a system with a restrictive
memory model, the method comprising:
executing a plurality of store instructions;
executing a load instruction;
determining that a memory region addressed by the load instruction matches a
cacheline address in a memory;
determining that data stored by the plurality of store instructions completely covers
the memory region addressed by the load instruction; and
transmitting a store forward is OK signal.
2. The method as claimed in claim 1, wherein executing the plurality of store
instructions comprises:
performing a plurality of store operations to store a plurality of data values in
contiguous memory locations in the memory, wherein the size of the contiguous memory
locations equals the size of the memory region addressed by the load instruction.
3. The method as claimed in claim 2, wherein executing a load instruction
comprises:
loading the data from the contiguous memory locations in the memory; and
generating the store forward is OK signal.
4. The method as claimed in claim 3, wherein loading the data from the
contiguous memory locations in the memory begins after performing the plurality of store
operations begins, and loading the data from the contiguous memory locations in the
memory completes before the plurality of store operations become globally observed in
the system.
5. The method as claimed in claim 1, wherein executing the load instruction
comprises:
loading the data from a write combining buffer; and

generating the store forward is OK signal.
6. The method as claimed in claim 1, wherein determining that a memory
region addressed by the load instruction matches a cacheline address in a memory
comprises:
comparing an address of the memory region and the cacheline address; and
determining that the address of the memory region is the same address as the
cacheline address.
7. The method as claimed in claim 1, wherein determining that the data stored
by the plurality of store instructions completely covers the memory region addressed by
the load instruction comprises:
determining that a size of the data stored by the plurality of store instructions equals
a size of the memory region addressed by the load instruction.
8. The method as claimed in claim 1, comprising:
terminating, if an address of the memory region and the cacheline address in the
memory are different.
9. The method as claimed in claim 1, comprising:
re-executing the load instruction, if the memory region is incompletely covered by
the data stored by the plurality of store instructions.
10. The method as claimed in claim 1, wherein intermediate results from the
plurality of store instructions are invisible to other concurrent processes.
11. The method as claimed in claim 1, wherein the method operates within the
restrictive memory model.
12. A machine-readable medium having stored thereon a plurality of executable
instructions for multiple store buffer forwarding in a system with a restrictive memory
model, the plurality of instructions adapted to cause a processor to perform instructions to:

execute a plurality of store instructions;
execute a load instruction;
determine that a memory region addressed by the load instruction matches a
cacheline address in a memory;
determine that data stored by the plurality of store instructions completely covers
the memory location in the memory specified by the load instruction; and
transmit a store forward is OK signal.
13. The machine-readable medium as claimed in claim 12, wherein the execute
the plurality of store instructions comprises an instruction to:
perform a plurality of store operations to store a plurality of data values in
contiguous memory locations in the memory, wherein the size of the contiguous memory
locations equals the size of the memory region addressed the load instruction.
14. The machine-readable medium as claimed in claim 13, wherein the execute
a load instruction comprises instructions to:
load the data from the contiguous memory locations in the memory; and
generate the store forward is OK signal.
15. The machine-readable medium as claimed in claim 14, wherein the load
data from the contiguous memory locations in the memory instruction begins executing
after the perform the plurality of store operations instruction begins executing, and the
load data from the contiguous memory locations in the memory instruction completes
executing before the plurality of store operations become globally observed in the system.
16. The machine-readable medium as claimed in claim 12, wherein the execute
the load instruction comprises instructions to:
load the data from a write combining buffer; and
generate the store forward is OK signal.

17. The machine-readable medium as claimed in claim 12, wherein the
determine that a memory region addressed by the load instruction matches a cacheline
address in a memory instruction comprises instructions to:
compare the address of the memory region and the cacheline address; and
determine that the address of the memory region is the same address as the
cacheline address.
18. The machine-readable medium as claimed in claim 12, wherein the
determine that data stored by the plurality of store instructions completely covers the
memory location in the memory specified by the load instruction comprises an instruction
to:
determine that a size of the data stored by the plurality of store instructions equals
a size of the memory region addressed by the load instruction.
19. The machine-readable medium as claimed in claim 12, comprising an
instruction to:
terminate, if an address of the memory region and the cacheline address in the
memory are different.
20. The machine-readable medium as claimed in claim 12, comprising an
instruction to:
re-execute the load instruction, if the memory region is incompletely covered by the
data stored by the plurality of store instructions.
21. The machine-readable medium as claimed in claim 12, wherein the execute
the plurality of store instructions comprises an instruction to:
execute the plurality of store instructions to produce intermediate results that are
invisible to other concurrent processes.
22. The machine-readable medium as claimed in claim 12, wherein the plurality
of executable instructions operate within the restrictive memory model.

23. A processor system, comprising:
a processor;
a system memory coupled to the processor; and
a non-volatile memory coupled to the processor in which is stored an article of
manufacture comprising instructions adapted to be executed by the processor, the
instructions which, when executed, encode instructions in an instruction set to enable
multiple store buffer forwarding in a system with a restrictive memory model, the article of
manufacture comprising instructions to:
execute a plurality of store instructions;
execute a load instruction;
determine that a memory region addressed by the load instruction matches a
cacheline address in a memory;
determine that data stored by the plurality of store instructions completely
covers the memory location in the memory specified by the load instruction; and
transmit a store forward is OK signal.
24. The processor system as claimed in claim 23, the processor comprising:
a write combining buffer, the write combining buffer comprising:
a comparator, the comparator being configured to receive and compare an
incoming load operation target address with all cacheline addresses of existing write
combining buffer entries;
an address and data buffer coupled to the comparator;
a data valid bits buffer coupled to the address and data buffer;
a multiplexer coupled to the data valid bits buffer; and
a comparison circuit coupled to the multiplexer.
25. The processor system as claimed in claim 24, the multiplexer being
configured to:
receive a byte valid vector from the data valid bits buffer;
receive address bits from the load operation and output valid bits;
select a group of valid bits from the byte valid vector; and
output the group of valid bits.

select a group of valid bits from the byte valid vector; and
output the group of valid bits.
26. The processor system for multiple store buffer forwarding as claimed in
claim 24, the comparison circuit being configured to:
receive the group of valid bits;
receive an incoming load operation byte mask;
determine that it is acceptable to forward the data using the group of valid bits
and the incoming load operation byte mask; and
produce a forward OK signal.
27. The article of manufacture as claimed in claim 23, wherein the execute
the plurality of store instructions comprises an instruction to:
perform a plurality of store operations to store a plurality of data values in
contiguous memory locations in the memory, wherein the size of the contiguous
memory locations equals the size of a the memory region addressed the load
instruction.
28. The machine-readable medium as claimed in claim 27, wherein the
execute a load instruction comprises instructions to:
load the data from the contiguous memory locations in the memory; and
generate the store forward is OK signal.
29. The machine-readable medium as claimed in claim 28, wherein the load
data from the contiguous memory locations in the memory instruction begins executing
after the perform the plurality of store operations instruction begins executing, and the
load data from the contiguous memory locations in the memory instruction completes
executing before the plurality of store operations become globally observed in the
system.
30. The processor system for multiple store buffer forwarding as claimed in
claim 23, wherein said processor is implemented as a multi-processor having
associated with each said multi-processor a separate set of hardware resources.
31. A multiple store buffer forwarding apparatus, comprising:
a processor having a write combining buffer, and

a non-volatile memory coupled to the processor, said non-volatile memory storing
instructions which when executed by the processor cause the processor to:
execute a plurality of store instructions referencing a first memory region;
execute a load instruction referencing a second memory region;
determine that the second memory region matches a cacheline address;
determine that the first memory region completely covers the second
memory region; and
transmit a store forward is OK signal.
32. The multiple store buffer forwarding apparatus as claimed in claim 31,
wherein the write combining buffer comprises:
a comparator to receive and compare an address of the second memory region
with all existing cacheline addresses in the write combining buffer,
an address and data buffer coupled to the comparator,
a data valid bits buffer coupled to the address and data buffer,
a multiplexer coupled to the data valid bits buffer, and
a comparison circuit coupled to the multiplexer.
33. The multiple store buffer forwarding apparatus as claimed in claim 32,
wherein the multiplexer is to:
receive a byte valid vector from the data valid bits buffer,
receive address bits from the load instruction,
select a group of valid bits from the byte valid vector, and
output the group of valid bits.
34. The multiple store buffer forwarding apparatus as claimed in claim 33,
wherein the comparison circuit is to:
receive the group of valid bits;
receive an incoming load instruction byte mask;
determine that it is acceptable to forward the data using the group of valid bits and
the incoming load instruction byte mask; and
produce a forward OK signal.

35. The multiple store buffer forwarding apparatus as claimed in claim 31,
wherein said processor is implemented as a multi-processor having associated with each
said multi-processor a separate set of hardware resources.
36. A multiple store buffer forwarding apparatus, comprising:
a memory;
a processor coupled to said memory and having a write combining buffer, said
processor to
execute a plurality of store instructions referencing a first memory region of
said memory;
execute a load instruction referencing a second memory region of said
memory;
determine that the second memory region matches a cacheline address;
determine that the first memory region completely covers the second
memory region; and
transmit a signal indicating that store buffer forwarding is authorized.
37. The multiple store buffer forwarding apparatus as claimed in claim 36,
wherein the write combining buffer comprises:
a comparator to receive and compare an address of the second memory region
with all existing cacheline addresses in the write combining buffer,
an address and data buffer coupled to the comparator,
a data valid bits buffer coupled to the address and data buffer,
a multiplexer coupled to the data valid bits buffer, and
a comparison circuit coupled to the multiplexer.
38. The multiple store buffer forwarding apparatus as claimed in claim 36,
wherein the multiplexer is to:
receive a byte valid vector from the data valid bits buffer,
receive address bits from the load instruction,
select a group of valid bits from the byte valid vector, and
output the group of valid bits.

39. The multiple store buffer forwarding apparatus as claimed in claim 38,
wherein the comparison circuit is to:
receive the group of valid bits;
receive an incoming load instruction byte mask;
determine that it is acceptable to forward the data using the group of valid bits and
the incoming load instruction byte mask; and
produce a signal indicating that it is acceptable to forward the data.
40. The multiple store buffer forwarding apparatus as claimed in claim 36,
wherein said processor is implemented as multiple processors wherein a separate set of
hardware resources is associated with each of said multiple processors.

A system and method for multiple store buffer forwarding in a system with a restrictive memory model is disclosed. The system comprises: a processor [310(1)-310 (n)]; a system memory [340] coupled to the processor; and a non-volatile memory [370] coupled to the processor in which is stored an article of manufacture comprising instructions adapted to be executed by the processor, the instructions which, when
executed, encode instructions in an instruction set to enable multiple store buffer forwarding in a system with a restrictive memory model. The method comprises: executing a plurality of store instructions; executing a load instruction; determining that a memory region addressed by the load instruction matches a cacheline address in a memory; determining that data stored by the plurality of store instructions completely covers the memory region addressed by the load instruction; and transmitting a store forward is OK signal.

Documents

Application Documents

#	Name	Date
1	592-kolnp-2003-specification.pdf	2011-10-06
2	592-kolnp-2003-reply to examination report.pdf	2011-10-06
3	592-kolnp-2003-granted-translated copy of priority document.pdf	2011-10-06
4	592-kolnp-2003-granted-specification.pdf	2011-10-06
5	592-kolnp-2003-granted-reply to examination report.pdf	2011-10-06
6	592-kolnp-2003-granted-gpa.pdf	2011-10-06
7	592-kolnp-2003-granted-form 5.pdf	2011-10-06
8	592-kolnp-2003-granted-form 3.pdf	2011-10-06
9	592-kolnp-2003-granted-form 18.pdf	2011-10-06
10	592-kolnp-2003-granted-form 13.pdf	2011-10-06
11	592-kolnp-2003-granted-form 1.pdf	2011-10-06
12	592-kolnp-2003-granted-examination report.pdf	2011-10-06
13	592-kolnp-2003-granted-drawings.pdf	2011-10-06
14	592-kolnp-2003-granted-description (complete).pdf	2011-10-06
15	592-kolnp-2003-granted-correspondence.pdf	2011-10-06
16	592-kolnp-2003-granted-claims.pdf	2011-10-06
17	592-kolnp-2003-granted-abstract.pdf	2011-10-06
18	592-kolnp-2003-form 5.pdf	2011-10-06
19	592-kolnp-2003-form 3.pdf	2011-10-06
20	592-kolnp-2003-form 18.pdf	2011-10-06
21	592-kolnp-2003-form 13.pdf	2011-10-06
22	592-kolnp-2003-form 1.pdf	2011-10-06
23	592-kolnp-2003-examination report.pdf	2011-10-06
24	592-kolnp-2003-drawings.pdf	2011-10-06
25	592-kolnp-2003-description (complete).pdf	2011-10-06
26	592-kolnp-2003-correspondence.pdf	2011-10-06
27	592-kolnp-2003-claims.pdf	2011-10-06
28	592-kolnp-2003-abstract.pdf	2011-10-06
29	592-KOLNP-2003-(06-07-2012)-FORM-27.pdf	2012-07-06
30	592-KOLNP-2003-FORM-27.pdf	2012-07-19
31	592-KOLNP-2003-(25-03-2013)-FORM-27.pdf	2013-03-25
32	592-KOLNP-2003-(26-03-2013)-FORM-27.pdf	2013-03-26
33	592-KOLNP-2003-(24-03-2015)-CORRESPONDENCE.pdf	2015-03-24
34	235377-FORM 27-210316.pdf	2016-06-22
35	592-KOLNP-2003-RELEVANT DOCUMENTS [30-03-2018(online)].pdf	2018-03-30
36	592-KOLNP-2003-25-01-2023-ALL DOCUMENTS.pdf	2023-01-25
37	592-KOLNP-2003-01-02-2023-RELEVANT DOCUMENTS.pdf	2023-02-01