"Self Healing Core"

< Back

"Self Healing Core"

Abstract: To support self healing of the cores, each core contains various registers. These registers may be self heal registers which are accessible to the OS and the firmware. The self heal registers may contain a control register, a threshold register and two sets of bank registers. The bank registers may be used for logging. These bank registers may also be known as context MCB and Health MCB registers. The OS or Firmware may program these registers to setup self healing. The self healing core does not require OS intervention for periodic checking and tracking. The core itself will monitor the statistics and periodically performs a health check.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

29 December 2006

Publication Number

31/2008

Publication Type

INA

Invention Field

ELECTRICAL

Status

Parent Application

Applicants

INTEL CORPORATION

2200 MISSION COLLEGE BOULEVARD, SANTA CLARA, CA 95052, USA

Inventors

1. SINGARAVELAN NALLASELLAN

FLAT 102, NIGAM HOMES, 21/1, 5TH CROSS, 12TH MAIN HAL 3RD STAGE, KODIHALLI, BANGALORE-560017, INDIA

2. HARINARAYANAN SESHADRI

APT 301, HARITHA APT. 112, 11 CROSS, MALLESWARAM, BANGALORE-560003,INDIA

Specification

SELF HEALING CORE
BACKGROUND INFORMATION
[0001] Each generation of computing is able to do more useful work by supporting more applications. With each generation, there is also a need to improve the availability of computing services. One such service is self healing.
[0002] Self healing systems is about making significant progress in reducing recovery time and implementing systems that can diagnose, react to, and even predict failures. Currently, no existing processors support self healing cores.
BRIEF DESCRIPTION OF THE DRAWINGS
[0003] Various features of the invention will be apparent from the following description of preferred embodiments as illustrated in the accompanying drawings, in which like reference numerals generally refer to the same parts throughout the drawings. The drawings are not necessarily to scale, the emphasis instead being placed upon illustrating the principles of the inventions.
[0004] Figure 1 illustrates a block diagram of multi-core system.
[0005] Figure 2 illustrates a flow chart for core reset flow in accordance with one embodiment.
[0006] Figures 3a and 3b illustrate a flow chart for a self healing core on corrected error.
[0007] Figure 4 illustrates a flow chart of an OS flow on corrected error handling in accordance with one embodiment.
[0008] Figure 5 illustrates a core state diagram.
[0009] Figures 6 is a block diagram of computer system in accordance with an embodiment of the invention.
[0010] Figure 7 is a block diagram of a computing system arranged in a point-to-point configuration, according to one embodiment of the invention.
DETAILED DESCRIPTION
[0011] In the following description, for purposes of explanation and not limitation, specific details are set forth such as particular structures, architectures, interfaces, techniques, etc. in order to provide a thorough understanding of the various aspects of the invention. However, it will be apparent to those skilled in the art having the benefit of the present disclosure that the various aspects of the invention may be practiced in other examples that depart from these specific details. In certain instances, descriptions of well-known devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.
[0012] Processors typically raise machine check architecture (MCA) exceptions to notify the OS or firmware about hardware errors. To support this MCA event management architecture, processors provides a set of model specific registers that are accessible to OS and firmware. The registers are used by OS to setup machine checking and logging of errors.
[0013] When the processor hardware detects the error, logs the error information to bank registers and signals the exception to OS by generating MC exception, the OS handler may initiate correction action depending on the severity of the errors.
[0014] However, the processor (or core) by itself does not take any self corrective action on errors and is entirely dependent on the OS or firmware to initiate any corrective action. The present disclosure proposes a new architecture wherein the processor may be intelligent enough to trigger self error
corrective action itself based on the error conditions.
[0015] Figure 1 illustrates a block diagram of a multicore systeml00. The system 100 includes an operating system (OS) 105 which communicates with a BIOS 110. The BIOS 110 includes specification that may be specific to ACPI 115. The system 100 also includes various processors 120. Each processor may contain one or more cores CO to Cn. Each processor 120 communicates with both the BIOS 110 and the OS 105.
[0016] To support self healing of the cores, the cores also contain various registers 125. These registers may be self heal registers which are accessible to the OS 105 and the firmware. The self heal registers 125 may contain a control register, a threshold register and two sets of bank registers. The bank registers may be used for logging. These bank registers may also be known as context MCB and Health MCB registers. The OS 105 or Firmware may program these registers 125 to setup self healing. It should be noted that the number of registers and the type of registers may vary based on implementation of the system.
[0017] Each core may have a set of registers that are required to control self healing of the core. Each core may have one control register, one threshold register for each type of event tracking required by the core, context MCB registers and health MCB registers.
[0018] The control register enables self healing. This register may provide one bit for each threshold. A separate bit for each threshold provides the flexibility to the core for tracking only particular types of errors. The core by

default upon powering up will power up with the self healing mode disabled.

[0019] Illustrated above are the bits for the self healing control register. In particular, Bit 0 is for self healing enabled; Bit 1 is for corrected error threshold enable; and Bit 2 is for correctable error threshold enable.
[0020] The self healing corrected error threshold register and the self healing correctable error threshold register reflect the value at which the self healing action should be triggered or activated. The OS or firmware may program this register with a positive corrected error threshold value.
[0021] The context MCB registers may contain the reason code and the core state that are required for the OS 105 to reliably transfer the process running on one core to any other core in the system 100. The content of the registers are required to be preserved across cold reset.
[0022] The health MCB registers may contain the health check result code and the health of the core. The health heck code may provide the in result of self healing action such as recovered, failed, etc. The health may also provide the health of the core itself.
[0023] Figure 2 is a flowchart of a method 200 for core reset flow in accordance with one embodiment. When a system is powered up, the core has to perform certain functions before it goes into self healing mode. The method 200 of Fig. 2 is setting up the registers and checking the status of the registers.
[0024] Initially, when the core powers up it runs a core built-in self test (BIST) 210. When the BIST is run, the OS obtains the status of the health data
of the core. Upon running the test, the core reports its health through the health MCB and architecture register 220. Next, the system determines the health of the core based on the data 230.
[0025] If the health of the core is good, then the self healing bit enabled is cleared, the threshold registers are cleared and the threshold match counters are cleared 240. The core does not clear the bank registers so that previous error log in the bank registers are saved for the OS to track the core errors. The health MCB registers are populated 250 and the core comes on line to function normally. The core then waits for SI PI 260.
[0026] If the health of the core is not good, then the health MCB is populated with the error 270 and OS is notified.
[0027] Figures 3a and 3b illustrate a flow chart of a method 300 for a self healing core on corrected error. When the self healing feature is enabled, the core may track the threshold for each corrected error. Upon reaching a threshold (preset by user), the core may unload its context machine check bank registers, such as the context MCB registers and generate MCERR# signal. This flow occurs within each core itself.
[0028] Initially, the core tracks the corrected error handling entry register 305. The core may have an internal counter to increment the corrected error counter 310. The core then populates the MCB 315. By populating the machine check bank, the OS may later use the MCB to know that the core has gone into self healing mode.
[0029] Upon populating the MCB, the core checks to see if self healing bit
is enabled 320. If the self healing bit is not enabled then the core may generate corrected error interrupt 325 and continue execution 330.
[0030] If the self healing bit is enabled, then the core determines if it has reached its threshold 335. If it has not reached its threshold, the core generates corrected error interrupt 325 and continues execution 330.
[0031] If the core determines it has reached its threshold, then the core will place its entire context into the context MCB 340 and generate MCERR# signal 345. Then the core will run the self healing procedure 350 as shown in Fig. 3b.
[0032] In Fig. 3b, the system determines the health of the core based on data in the registers 355.
[0033] If the health of the core is good, then the self healing bit enabled is cleared, the threshold registers are cleared and the threshold match counters are cleared 360. The core does not clear the bank registers so that previous error log in the bank registers are saved for the OS to track the core errors. The health MCB registers are populated and a MCERR# signal is generated 365 and the core comes on line to function normally. The core then waits for SIPI 370.
[0034] If the health of the core is not good, then the health MCB is populated with the error 375 and OS is notified.
[0035] The OS or firmware MCA handler is invoked due to the MCERR# interrupt. MCERR# is a machine check error interrupt. The handler may read the context MCB registers of each core and check whether the MCB contains valid data. When the data is valid, the OS handler may take appropriate action such as switching the resources associated with the core to other cores. The OS
may then clear the context MCB registers, set the processor state to self healing mode and invoke ACPI methods to update the processor data in ACPI tables. When the MCA handler finds a valid health MCB, meaning some core has reported its health, it may check the health of the core from the MCB register. If the health is good, the core sends SIPI to the core and brings it online.
[0036] Figure 4 illustrates a flow chart of a method 400 of an OS flow on corrected error handling in accordance with one embodiment. To keep the processor information in the BIOS tables consistent with the system, the OS needs to invoke ACPI methods to update the processor information tables. The method 400 of Fig. 4 illustrates the self healing OS flow.
[0037] Initially, the OS is running 405 and polls to check for error events. If the corrected error interrupt is disabled, the OS polls the MCB registers for corrected error event 410. If there are no error events the OS keeps running.
[0038] If, upon polling, there are error events then the OS performs a machine check exception 415. In addition, a processor corrected event may also occur 413. The processor event also checks error handler entry 415.
[0039] The system then determines if self healing is enabled 420. If enabled, the OS reads the context MCBs of each processor 425. The OS next determines if the context MCBs of each processor is valid 430.
[0040] If it's valid, the OS reallocates the resources to other processors 435 and clears the context MCB registers 440. The OS also invokes ACPI interface to update processor state 445. The processor state is in self healing mode and OS continue execution 450.
[0041] If it's not valid, the OS may read health MCBs of each processor 455 and determine if they are valid 460. If the health MCBs are not valid, the OS will continue error handling execution 465.
[0042] If the health MCBs are valid, OS determines if the health is good 470. If the health is not good, OS invokes ACPI interface to update processor state 475. The processor state is offline and then OS continues execution 480.
[0043] If the health MCBs is good, then OS sends SIPI and brings the processor online 485. Next, OS invokes ACPI interface to update processor state 490. The processor state is online and OS clears the health MCB 495 to continue execution.
[0044] Figure 5 illustrates a core state diagram. Initially the core is online 505, and gets reset 510 to check if health is good during self heal 515. If the health is good 520 the core waits for SPIP 525. The core then goes online.
[0045] If the health of the core is not good 530, the core goes offline 535. The core may reset 540 to check the status of its health during self heal 515.
[0046] When the core is online, it may also check if its threshold has been reached 545 during self heal 515.
[0047] In the emerging world, each system will be able to do more useful work by supporting more application services with more memory, computer power and I/O. One way to accomplish this is to disable a particular processor core. This provides a more effective use of computing resources and dollars and reduces downtime for end users.
[0048] Systems with high availability may be possible only when the
processor can periodically monitor its health and correct its health issues proactively. With the number of cores increasing, the OS and BIOS utilization to handle the periodic checking also increases. The self healing core does not require OS intervention for periodic checking and tracking. The core itself will monitor the statistics and periodically performs a health check.
[0049] Advantageously, the system described in reference to Figs. 1-7 provides an important core RAS feature, availability. This feature is extensible, backward compatible and easily implemented. The self healing core does not require BIOS intervention to offline or online cores. The BIOS is invoked to update its ACPI table with the current state. Thus the self healing core reduces the time to check the core's health.
[0050] Figure 6 illustrates a block diagram of a computing system 600 in accordance with an embodiment of the invention. The computing system 600 may include one or more central processing units(s) (CPUs) 31 or processors that communicate via an interconnection network (or bus) 49. The processors 31 may be any type of a processor such as a general purpose processor, a network processor (that processes data communicated over a computer network 48, or other types of a processor (including a reduced instruction set computer (RISC) processor or a complex instruction set computer (CISC). Moreover, the processors 31 may have a single or multiple core design. The processors 31 with a multiple core design may integrate different types of processor cores on the same integrated circuit (1C) die. Also, the processors 31 may utilize the embodiments discussed with references to Figs. 1 and 2. For example, one or
more of the processors 31 may include one or more processor cores 32. Also, the operations discussed with reference to Figs 1 and 2 may be performed by one or more components of the system 600.
[0051] A chipset 33 may also communicate with the interconnection network 49. The chipset 33 may include a memory control hub (MCH) 34. The MCH 34 may include a memory controller 36 that communicates with a memory 41. The memory 41 may store data and sequences of instructions that are executed by the CPU 31, or any other device included in the computing system 300. In one embodiment of the invention, the memory 41 may include one or more volatile storage (or memory) devices such as random access memory (RAM), dynamic RAM, (DRAM), synchronous DRAM (SDRAM), static RAM (SRAM), or the like. Nonvolatile memory may also be utilized such as a hard disk. Additional devices may communicate via the interconnection network 49, such as multiple CPUs and/or multiple system memories.
[0052] The MCH 34 may also include a graphics interface 37 that communicates with a graphics accelerator 42. In one embodiment of the invention, the graphics interface 37 may communicate with the graphics accelerator 42 via an accelerated graphics poet (AGP). In an embodiment of the invention, a display (such as a flat panel display) may communicate with the graphics interface 37 through, for example, a signal converter that translates a digital representation of an image stored in a storage device such as a video memory or system memory into display signals that are interpreted and displayed by the display. The display signals produced by the display device may pass
through various control devices before being interpreted by and subsequently displayed on the display.
[0053] A hub interface 51 may allow the MCH 34 to communicate with an input/output control hub (ICH) 38. The ICH 38 may provide an interface to I/O devices that communicate with components of the computing system 600. The ICH 38 may communicate with a bus 47 through a peripheral bridge (or controller) 39, such as a peripheral component interconnect (PCI) bridge, a universal serial bus (USB) controller, or the like. The bridge 39 may provide a data path between the CPU 31 and peripheral devices. Other types of topologies may be utilized. Also, multiple buses may communicate with the ICH 38, e.g. through multiple bridges or controllers. Moreover, other peripheral in communication with the ICH 38 may include, in various embodiments of the invention, integrated drive electronics (IDE) or small computer system interfaces (SCSI) hard drive(s), USB ports, a keyboard, a mouse, parallel port(s), serial port(s), floppy disk drive(s), digital output support (e.g., digital video interface (DVI)), or the like.
[0054] The bus 47 may communicate with an audio device 43, one or more disk drive(s) 44, and a network interface device 46 (which communicates with the computer network 48). Other devices may be in communication with the bus 47. Also, various components (such as the network interface device 46) may be in communication with the MCH 34 in some embodiments of the invention. In addition, the processor 31 and the MCH 34 may be combined to form a single chip. Furthermore, the graphics accelerator 42 may be included within the MCH
34 in other embodiments of the invention.
[0055] Furthermore, the computing system 600 may include volatile and/or nonvolatile memory (or storage). For example, nonvolatile memory may include one or more of the following: read-only memory (ROM), programmable ROM (PROM), erasable PROM (EPROM), electrically EPROM (EEPROM), a disk drive (e.g., 44) a floppy disk, a compact disk ROM (CD-ROM), a digital versatile disk (DVD), flash memory, a magneto-optical disk, or other types of nonvolatile machine-readable media capable of storing electronic instructions and/or data.
[0056] Figure 7 illustrates a computing system 700 that is arranged in a point-to-point (PtP) configuration, according to an embodiment of the invention. In particular, Fig. 7 shows a system where processors, memory, and input/output devices are interconnected by a number of point to point interfaces. The operations discussed with reference to Figs. 1-6 may be performed by one or more components of the system 700.
[0057] As illustrated in Fig. 7, the system 700 may include several processors, of which only two, processors 5, 10 are shown for clarity. The processors 5, 10 may each include a local memory controller hub (MCH) 15, 20 to allow communication with memories 15, 20. The memories 15, and/or 20 may store various data such as those discussed with reference to the memory 512.
[0058] The processors 5, 10 may be any type of a processor such as those discussed with reference to the processors 31 of Fig. 6. The processors 5, 10 may exchange data via a point-to-point interface 93 using PtP interface circuits 40 and 45, respectively. The processors 5, 10 may each exchange data
with a chipset 50 via individual PtP interfaces 55, 60 using point to point interface circuits 65, 70, 75, 80. The chipset 50 may also exchange data with a high-performance graphics circuit 37 via a high performance graphic interface 97, using a PtP interface circuit 90.
[0059] At least one embodiment of the invention may be provided within the processors 5, 10. For example, one or more of the processor core (s) 32 may be located within the processors 5, 10. Other embodiments of the invention, however, may exist in other circuits, logic units, or devices with the system 400 of fig. 6. Furthermore, other embodiments of the invention may be distributed throughout several circuits, logic units, or devices illustrated in Fig. 7.
[0060] The chipset 50 may communicate with a bus 16 using a PtP interface circuit 95. The bus 16 may have one or more devices that communicate with it, such as a bus bridge 18 and I/O devices 14. Via a bus 20, the bus bridge 14 may be in communication with other devices such as a keyboard/mouse 22, communication devices 26 (such as modems, network interface devices, etc. that may be in communication with the computer network 48), audio I/O devices, and/or a data storage device 28. The data storage device 28 may store code 30 that may be executed by the processors 5 and/or 10.
[0061] In various embodiments of the invention, the operations discussed herein, e.g., with reference to Figs. 1-7 may be implemented by hardware (e.g., circuitry), software, firmware, microcode, or combinations thereof, which may be provided as a computer program product, e.g., including a machine readable or computer readable medium having stored thereon instructions (or software
procedures) used to program a computer to perform a process discussed herein. Also, the term "logic" may include, by way of example, software, hardware, or combinations of software and hardware. The machine readable medium may include a storage device such as those discussed with respect to Figs. 1-7. Additionally, such computer readable media may be downloaded as a computer program product, wherein the program may be transferred from a remote computer (e.g., a server) to a requesting computer (e.g., client) by way of data signals embodied in a carrier wave or other propagation medium via a communication link (.g., a bus, a modem, or a network connection). Accordingly, herein, a carrier wave shall be regarded as comprising a machine readable medium.
[0062] Reference in the specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least an implementation. The appearances of the phrase "in one embodiment" in various places in the specification may or may not be all referring to the same embodiment.
[0063] Also, in the description and claims, the term "coupled" and "connected", along with their derivatives, may be used. In some embodiments of the invention, "connected" may be used to indicate that two or more elements are in direct physical or electrical contact with each other. "Coupled" may mean that two or more elements are in direct physical or electrical contact. However, "coupled" may also mean that two or more elements may not be in direct contact
with each other, but may still cooperate or interact with each other.
[0064] Thus, although embodiments have been described in language specific to structural features and/or methodological acts, it is to be understood that claimed subject matter may not be limited to the specific features or acts described. Rather, the specific features and acts are disclosed as sample forms of implementing the claimed subject matter.

WHAT IS CLAIMED IS:
1. A system comprising:
an operating system;
an I/O logic coupled to the operating system; and
one or more processors in communication with the operation system and the I/O logic, wherein the one or more processors comprising:
one or more cores, wherein the one or more cores perform error correction to the one or more cores based on error conditions.
2. The system of claim 1 wherein the one or more cores comprise of various
registers to control error correction of the one or more cores.
3. The system of claim 2 wherein the registers are accessible to the operating
system and the I/O logic.
4. The system of claim 2 wherein the registers are a control register, a threshold
register and first and second set of bank registers.
5. The system of claim 4 wherein the control register to enable error correction
to the one or more cores.
6. The system of claim 4 wherein the threshold register to determine value at
which the error correction is activated.
7. The system of claim 4 wherein the two set of bank registers is a context
machine check bank register and a health machine check bank register.
8. The system of claim 7 wherein the context machine bank register to contain
code.
9. The system of claim 7 wherein the health machine check bank register to
provide health of the core.
10. A processor comprising:
one or more cores, wherein the one or more cores perform error correction to the one or more cores based on various conditions and wherein the one or more cores further comprises:
one or more registers to control error correction to the cores.
11. The processor of claim 10 wherein the one or more registers are a control
register, a threshold register and first and second set of bank registers.
12. The processor of claim 11 wherein the control register to enable error
correction to the one or more cores.
13. The processor of claim 11 wherein the threshold register to determine value
at which the error correction is activated.
14. The processor of claim 11 wherein the two set of bank registers is a context
machine check bank register and a health machine check bank register.
15. The processor of claim 14 wherein the context machine bank register to
contain code.
16. The processor of claim 14 wherein the health machine check bank register
to provide health of the core.
17. A method comprising:
performing an error correction test on a core;
determining health of the core after performing the test; and
clearing registers in the core if the health of the core is good.
18. The method of claim 17 further comprising keeping core online if health is
good.
19. The method of claim 17 further comprising determining if the core has
reached its threshold.
20. The method of claim 17 further comprising taking core offline if health is not
good.

Documents

Application Documents

#	Name	Date
1	2828-DEL-2006-Form-18-(22-11-2010).pdf	2010-11-22
2	2828-DEL-2006-Correspondence-Others-(22-11-2010).pdf	2010-11-22
3	2828-DEL-2006-GPA-(29-06-2011).pdf	2011-06-29
4	2828-DEL-2006-Correspondence Others-(29-06-2011).pdf	2011-06-29
5	2828-DEL-2006-Petition-137-(09-08-2011).pdf	2011-08-09
6	2828-DEL-2006-Form-1-(09-08-2011).pdf	2011-08-09
7	2828-DEL-2006-Correspondence Others-(09-08-2011).pdf	2011-08-09
8	2828-del-2006-petition-138.pdf	2011-08-21
9	2828-del-2006-form-5.pdf	2011-08-21
10	2828-del-2006-form-3.pdf	2011-08-21
11	2828-del-2006-form-2.pdf	2011-08-21
12	2828-del-2006-form-1.pdf	2011-08-21
13	2828-del-2006-drawings.pdf	2011-08-21
14	2828-del-2006-description (complete).pdf	2011-08-21
15	2828-DEL-2006-Correspondence-Others.pdf	2011-08-21
16	2828-del-2006-claims.pdf	2011-08-21
17	2828-del-2006-abstract.pdf	2011-08-21
18	2828-DEL-2006-FER.pdf	2017-04-28
19	Form 3 [23-06-2017(online)].pdf	2017-06-23
20	2828-DEL-2006-FORM 4(ii) [28-10-2017(online)].pdf	2017-10-28
21	2828-DEL-2006-AbandonedLetter.pdf	2018-02-15

Search Strategy

1	search_2828del2006_20-03-2017.pdf