System And Method Of Rebuilding Read Cache For A Rebooted Node Of A

System And Method Of Rebuilding Read Cache For A Rebooted Node Of A Multiple Node Storage Cluster

Abstract: The disclosure is directed to a system and method for managing cache memory of at least one node of a multiple-node storage cluster. According to various embodiments, a first cache data and a first cache metadata are stored for data transfers between a respective node and regions of a storage cluster receiving at least a first selected number of data transfer requests. When the node is rebooted, a second (new) cache data is stored to replace the first (old) cache data. The second cache data is compiled utilizing the first cache metadata to identify previously cached regions of the storage cluster receiving at least a second selected number of data transfer requests after the node is rebooted. The second selected number of data transfer requests is less than the first selected number of data transfer requests to enable a rapid build of the second cache data.

Patent Information

Application #

Filing Date

24 May 2013

Publication Number

48/2014

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Email

Parent Application

Applicants

LSI CORPORATION

1320 RIDDER PARK DRIVE, SAN JOSE, CALIFORNIA 95131 UNITED STATES OF AMERICA

Inventors

1. SAMANTA SUMANESH

GLOBAL TECHNOLOGY PARK, BLOCK-C, MARATHAHALLI OUTER RING RD, BANGALORE INDIA

2. BISWAS SUJAN

GLOBAL TECHNOLOGY PARK, BLOCK-C, MARATHAHALLI OUTER RING RD, BANGALORE INDIA

3. SIMIONESCU HORIA CHRISTIAN

1320 RIDDER PARK DRIVE, SAN JOSE, CA UNITED STATES OF AMERICA

4. BERT LUCA

4165 SHACKLEFORD RD, ANORC, NORCROSS, GA UNITED STATES OF AMERICA

5. ISH MARK

1320 RIDDER PARK DRIVE, SAN JOSE, CA UNITED STATES OF AMERICA

Specification

SYSTEM AND METHOD OF REBUILDING READ CACHE FOR A
REBOOTED NODE OF A MULTIPLE-NODE STORAGE CLUSTER
FIELD OF INVENTION
[0001] The disclosure relates to the field of cache management in multiple-
node direct attached data storage systems.
BACKGROUND
[0002] Data storage systems, such as redundant array of independent
disks (RAID) systems typically provide protection against disk failures.
However, direct attached storage (DAS) RAID controllers have little to no
defense against server failure because they are typically embedded within
a server. Two or more nodes (i.e. servers) are often used for high
availability storage clusters to mitigate consequences of a failure.
[0003] In multiple-node storage clusters, cache is frequently maintained on
a local server. This local cache, often running from Gigabytes to
Terabytes in size, helps in low latency and high performance completion of
data transfers from regions of the storage cluster experiencing high activity
or "hot" input/output (IO) data transfer requests. The local READ cache of
a temporarily disabled node can become stale or invalid because other
nodes continue to actively transfer data to cached and non-cached regions
of the storage cluster. Thus, when the node is rebooted, old cache data is
typically purged and new local cache is built for the rebooted node, which
can be very time-consuming and degrading to node performance.

SUMMARY
[0004] Various embodiments of the disclosure include a system and
method for managing local cache memory of at least one node of a
multiple-node storage cluster to improve ramp up of cache data in the local
cache memory after a respective node is rebooted. According to various
embodiments, the storage cluster includes a plurality of storage devices,
such as one or more JBOD complexes, accessible by a plurality of nodes
in communication with the storage cluster. At least one storage device is
configured to store local cache memory for at least one node of the
plurality of nodes. The cache memory includes a first cache data and a
first cache metadata associated with data transfers between the respective
node and regions of the storage cluster receiving at least a first selected
number of data transfer requests (e.g. at least 3 IO hits). The selected
number of data transfer requests can be arbitrarily set to any value
suitable for caching data and metadata corresponding to "hot" IO or high
activity regions of the storage cluster.
[0005] When the node is temporarily disabled (e.g. failed, shutdown,
suspended, or restarted), the first cache data may become stale or invalid.
A cache manager in communication with the cache memory is configured
to store a second (new) cache data in the cache memory when the node is
rebooted to replace the first (old) cache data. The cache manager is
configured to compile the second cache data by caching at least some
regions of the storage cluster that were previously cached by the first
cache data upon receiving at least a second selected number of data
transfer requests (e.g. 1 IO hit) at the previously cached regions after the
node is rebooted. The previously cached regions are determined or
identified utilizing the first cache metadata, thereby allowing the first cache
data to be purged (i.e. deleted or disregarded) when the node is disabled
or rebooted. The second selected number of data transfer requests is less

than the first selected number of data transfer requests to enable a rapid
build or ramp up of the second cache data based upon the first cache
metadata. Node performance and, therefore, the overall system
performance are improved by decreasing the time required to rebuild at
least a portion of the local cache after the respective node is rebooted.
[0006] It is to be understood that both the foregoing general description
and the following detailed description are not necessarily restrictive of the
disclosure. The accompanying drawings, which are incorporated in and
constitute a part of the specification, illustrate embodiments of the
disclosure.
BRIEF DESCRIPTION OF THE DRAWINGS
[0007] The embodiments of the disclosure may be better understood by
those skilled in the art by reference to the accompanying figures in which:
FIG. 1 is a block diagram illustrating multiple-node storage system, in
accordance with an embodiment of the disclosure;
FIG. 2 is a flow diagram illustrating a method of managing cache for a
rebooted node of a multiple-node storage system, in accordance
with an embodiment of the disclosure; and
FIG. 3 is a flow diagram illustrating a method of managing cache for a
rebooted node of a multiple-node storage system, in accordance
with an embodiment of the disclosure.
DETAILED DESCRIPTION
[0008] Reference will now be made in detail to the embodiments disclosed,
which are illustrated in the accompanying drawings.
[0009] FIG. 1 illustrates an embodiment of a multiple-node storage system
100. The system 100 includes at least one storage cluster 102, such as a
high availability cluster, accessible by a plurality of server nodes 106.

Each node 106 includes at Jeast one controller 108 such as, but not limited
to, a RAID controller, RAID on Chip (ROC) controller, or at least one
single-core or multiple-core processor. The respective controller 108 of
each node 106 is configured to transfer data to or from logical block
address regions or "windows" defined across a plurality of storage devices
104, such as hard disk drives (HDDs) or solid-state disk (SSD) drives,
making up the storage cluster 102.
[0010] In some embodiments, two or more groupings of the storage
devices 104 are contained in two or more enclosures. In some
embodiments, the enclosures are separately powered to allow a first
server to take over a grouping of shared storage devices 104 and continue
to process data transfer requests without disruption when a second server
is permanently or temporarily disabled. In some embodiments, host nodes
106 include at least one processor running a computing program, such as
WINDOWS SERVER or VMWARE CLUSTER SERVER, configured to
provide planned or unplanned failover service to applications or Guest OS.
[0011] According to various embodiments, each node 106 includes or is
communicatively coupled to at least one respective storage device 110
configured to store local cache memory. In some embodiments, the local
storage device 110 includes a SSD drive. The cache memory 110 is
configured to aid data transfers between the respective node 106 and
cached regions of the storage cluster 102 for low latency data transfers
and increased IO operations per second (lOPs). In some embodiments,
the local storage device 110 is onboard the controller 108 or coupled
directly to the respective node 106, thus sharing the same power domain.
[0012] A local cache manager 112 in communication with the cache
memory 110 is configured to manage cache data and cache metadata
stored in the cache memory 110. In some embodiments, the cache
manager 112 includes at least one dedicated processor or controller
configured to manage the cache memory 110 for at least one respective

node 106 according to program instructions executed by the processor
from at least one carrier medium. In some embodiments, the cache
manager 112 includes a module running on the controller 108 or a
processor of the respective node 106.
[0013] In some embodiments, the system 100 includes a High Availability
DAS cluster with very large (e.g. one or more terabytes) cache memory
110 for each node 106. Large cache memory 110 can provide a
significant boost in application performance; however, a considerable
amount of time is needed to ramp up or compile cache data. In some
instances, a particular node 106 is temporarily disabled (may be only for a
few minutes) due to failure, shutdown, suspension, restart, routine
maintenance, power cycle, software installation, or any other user-initiated
or system-initiated event resulting in an period of nodal inactivity. Several
means exist in the art for retaining READ cache data and cache metadata
of a respective node 106 when the node 106 is either intentionally or
unintentionally disabled.
[0014] In some embodiments, for example, the cache manager 112 is
configured to store all cache data and cache metadata in the cache
memory 110 prior to intentionally disabling (e.g. shutting down, restarting,
or suspending) the respective node 106 by flushing random access
memory (RAM) to the cache memory 110. In some embodiments, when
the node 106 is unintentionally disabled (e.g. power failure), a
supercapacitor or battery backup enables a copy of the RAM image to be
stored in the cache memory 110 when the node 106 is disabled. When
the node 106 is rebooted, the cache metadata is read back from the cache
memory 110 to the RAM. In other embodiments, the cache manager 112
is configured for periodic saving of consistent READ Cache Metadata or
any alternative READ cache retention schemes known to the art.
[0015] Simple retention of READ cache data does not cure the
consequences arising from a temporarily disabled node 106 for the

multiple-node storage system 100. The READ cache data cannot be
trusted after a node 106 is rebooted (i.e. restarted or returned to an active
state) because the other nodes 106 continue to transfer data to or from the
storage cluster 102 while the node 106 is disabled or rebooting.
Accordingly, at least a portion of the READ cache data in the local cache
memory 110 may be stale or invalid when the node 106 resumes activity.
In some embodiments, at least a portion of the READ cache data for a first
node 106 becomes invalid when one or more of the other nodes 106
transfer (WRITE) data to regions of the storage cluster 102 cached for the
first node 106. Accordingly, the READ cache data is typically purged from
the cache memory 110 when a respective node 106 is rebooted.
Rebuilding a fresh cache each time a node 106 is rebooted can be very
time consuming, thus resulting in significantly reduced performance each
time a node is rebooted.
[0016] FIGS. 2 and 3 illustrate a method 200 of building a fresh cache
based upon retained cache metadata after a respective node 106 is
rebooted, according to various embodiments of this disclosure. In some
embodiments, method 200 is manifested by system 100. However, one or
more steps may be effected by additional or alternative configurations
beyond those described with regard to embodiments of system 100.
Method 200 is intended to encompass any system or device configured to
execute the following steps.
[0017] At step 202, high activity or "hot" logical block address (LBA)
regions of the storage cluster 102 are cached for each node 106. The hot
regions are those regions receiving at least a selected number "N" data
transfer requests. In some embodiments, for example, a region may be
considered hot after receiving N=3 IO hits. At step 204, cache data and
cache metadata are stored in the cache memory 110 of each node 106 for
the hot regions of the storage cluster 102. In some embodiments, the
cache manager 112 is configured to store first cache data associated with
data transfers between a respective node 106 (hereinafter "first node") and

the hot regions of the storage cluster 102. The cache manager 112 is
further configured to store first cache metadata associated with the cached
regions of the storage cluster 102. According to step 206, the hot regions
of the storage cluster 102 continue to be cached (steps 202 and 204) by
the first cache data until the first node 106 is temporarily disabled or
rebooted.
[0018] At step 208, the first cache metadata is retained in the cache
memory 110 when the first node 106 is rebooted or brought back into an
active state. In some embodiments the first cache metadata is retained in
the same hash as cache data but flagged or identified as "stale" so that
invalid cache data is not used to serve IO. Accordingly, the cache manger
112 is enabled to identify previously cached regions of the storage cluster
102 utilizing the first cache metadata. The first cache data cannot be
trusted and may be purged for the cache memory 110 to make space for
new or fresh cache data when the first node is rebooted.
[0019] At step 210, a second (new) cache is built utilizing the first cache
metadata to identify previously cached regions of the storage cluster 102
which are still likely to be hot regions. In some embodiments, every IO
data transfer goes through the hash as usual and regions associated with
the first cache metadata (i.e. 'stale' entries) are cached as hot regions as
soon as they are hit or after a selected number "M" hits, where M is less
than N. Accordingly, the second cache data is built faster because the
cache manager 112 is configured to store the second cache data for
regions receiving less than N hits as long as they were previously cached
(i.e. hot) regions before the first node 106 became disabled or was
rebooted. Thus cache for a rebooted node 106 is ramped faster for
' previously cached regions of the storage cluster 102 by utilizing the first
(old) cache metadata for guidance.
[0020] Since stale metadata entries are preserved in the hash table, they
will utilize some of the cache memory 110. Furthermore, a stale regions

may be hit a long time (e.g. 48 hours later) after the first node 106 reboots.
In such a case, the first cache metadata entry cannot always be trusted.
For example, the IO pattern might have completely changed while the first
node was disabled. In some embodiments, illustrated in FIG. 3, the
method 200 further includes step 302 of recording a time stamp for
previously cached regions of the storage cluster 102 when the first node
reboots. Step 208 may further include sub-step 304 of purging or ignoring
first cache metadata entries associated with time stamps outside of a
selected time interval. Accordingly, the method 200 of accelerating cache
build for previously cached regions will only be effected for regions
associated with first cache metadata that receive data transfer requests
within the selected time interval. Ignoring or purging regions with expired
time stamps ("stale regions") will avoid caching regions of the storage
cluster 102 that are no longer hot after the first node reboots or wasting
space in the cache memory 110 by storing low activity "cold" cache data or
metadata.
[0021] It should be recognized that the various functions or steps
described throughout the present disclosure may be carried out by any
combination of hardware, software, or firmware. In some embodiments,
various steps or functions are carried out by one or more of the following:
electronic circuits, logic gates, field programmable gate arrays,
multiplexers, or computing systems. A computing system may include, but
is not limited to, a personal computing system, mainframe computing
system, workstation, image computer, parallel processor, or any other
device known in the art. In general, the term "computing system" is
broadly defined to encompass any device having one or more processors,
which execute instructions from a memory medium.
[0022] Program instructions implementing methods, such as those
manifested by embodiments described herein, may be transmitted over or
stored on carrier medium. The carrier medium may be a transmission
medium, such as, but not limited to, a wire, cable, or wireless transmission

link. The carrier medium may also include a storage medium such as, but
not limited to, a read-only memory, a random access memory, a magnetic
or optical disk, or a magnetic tape.
[0023] It is further contemplated that any embodiment of the disclosure
manifested above as a system or method may include at least a portion of
any other embodiment described herein. Those having skill in the art will
appreciate that there are various embodiments by which systems and
methods described herein can be effected, and that the implementation
will vary with the context in which an embodiment of the disclosure
deployed.
[0024] Furthermore, it is to be understood that the invention is defined by
the appended claims. Although embodiments of this invention have been
illustrated, it is apparent that various modifications may be made by those
skilled in the art without departing from the scope and spirit of the
disclosure.

CLAIMS
What is claimed is:
1. A system for managing cache memory, comprising:
a cache memory for at least one node of a plurality of nodes, the
cache memory including a first cache data associated with data
transferred between the at least one node and regions of a storage cluster
receiving at least a first selected number of data transfer requests, and
further including a first cache metadata associated with the regions of the
storage cluster receiving at least the first selected number of data transfer
requests; and
a cache manager in communication with the cache memory, the
cache manager configured to store a second cache data in the cache
memory when the at least one node is rebooted, the second cache data
being associated with data transferred between the at least one node and
one or more of the regions of the storage cluster receiving at least a
second selected number of data transfer requests after the at least one
node is rebooted, wherein the second selected number of data transfer
requests is less than the first selected number of data transfer requests.
2. The system of claim 1, wherein the second cache data is based at
least partially upon the first cache metadata.
3. The system of claim 2, wherein the cache manager is further
configured to identify regions of the storage cluster according to the first
cache metadata, wherein the second cache data is associated with one or
more of the identified regions of the storage cluster receiving at least the
second selected number of data transfer requests after the at least one
node is rebooted.

4. The system of claim 1, wherein the cache manager is further
configured to purge the first cache data from the cache memory when the
at least one node is disabled or when the at least one node is rebooted.
5. The system of claim 1, wherein the first cache metadata includes
time stamps corresponding to the regions of the storage cluster having
received at least the first selected number of data transfer requests.
6. The system of claim 5, wherein the cache manager is further
configured to purge at least a portion of the first cache metadata including
at least one time stamp outside of a selected time interval.
7. The system of claim 6, wherein the cache manager is further
configured to identify regions of the storage cluster according to at least a
portion of the first cache metadata including at least one time stamp within
the selected time interval, wherein the second cache data is associated
with one or more of the identified regions of the storage cluster receiving at
least the second selected number of data transfer requests after the at
least one node is rebooted.

8. A storage system, comprising:
a storage cluster including a plurality of storage devices;
a plurality of nodes in communication with the storage cluster;
a cache memory for at least one node of the plurality of nodes, the
cache memory including a first cache data associated with data
transferred between the at least one node and regions of the storage
cluster receiving at least a first selected number of data transfer requests,
and further including a first cache metadata associated with the regions of
the storage cluster receiving at least the first selected number of data
transfer requests; and
a cache manager in communication with the cache memory, the
cache manager configured to store a second cache data in the cache
memory when the at least one node is rebooted, the second cache data
being associated with data transferred between the at least one node and
one or more of the regions of the storage cluster receiving at least a
second selected number of data transfer requests after the at least one
node is rebooted, wherein the second selected number of data transfer
requests is less than the first selected number of data transfer requests.
9. The system of claim 8, wherein the second cache data is based at
least partially upon the first cache metadata.
10. The system of claim 9, wherein the cache manager is further
configured to identify regions of the storage cluster according to the first
cache metadata, wherein the second cache data is associated with one or
more of the identified regions of the storage cluster receiving at least the
second selected number of data transfer requests after the at least one
node is rebooted.
11. The system of claim 8, wherein the cache manager is further
configured to purge the first cache data from the cache memory when the
at least one node is disabled or when the at least one node is rebooted.

12. The system of claim 8, wherein the first cache metadata includes
time stamps corresponding to the regions of the storage cluster having
received at least the first selected number of data transfer requests.
13. The system of claim 12, wherein the cache manager is further
configured to purge at least a portion of the first cache metadata including
at least one time stamp outside of a selected time interval.
14. The system of claim 13, wherein the cache manager is further
configured to identify regions of the storage cluster according to at least a
portion of the first cache metadata including at least one time stamp within
the selected time interval, wherein the second cache data is associated
with one or more of the identified regions of the storage cluster receiving at
least the second selected number of data transfer requests after the at
least one node is rebooted.

15. A method of managing cache memory, comprising:
storing a first cache data associated with data transferred between
at least one node of a plurality of nodes and regions of a storage cluster
receiving at least a first selected number of data transfer requests;
storing a first cache metadata associated with the regions of the
storage cluster receiving at least the first selected number of data transfer
requests;
storing a second cache data when the at least one node is
rebooted, the second cache data being associated with data transferred
between the at least one node and one or more of the regions of the
storage cluster receiving at least a second selected number of data
transfer requests after the at least one node is rebooted, wherein the
second selected number of data transfer requests is less than the first
selected number of data transfer requests.
16. The method of claim 15, further comprising:
identifying regions of the storage cluster according to the first cache
metadata, wherein the second cache data is associated with one or more
of the identified regions of the storage cluster receiving at least the second
selected number of data transfer requests after the at least one node is
rebooted.
17. The method of claim 15, further comprising:
purging the first cache data when the at least one node is disabled
or when the at least one node is rebooted.
18. The method of claim 15, further comprising:
recording time stamps in the first cache metadata corresponding to
the regions of the storage cluster having received at least the first selected
number of data transfer requests.

19. The method of claim 18, further comprising:
purging at least a portion of the first cache metadata including at
least one time stamp outside of a selected time interval.
20. The method of claim 19, further comprising:
identifying regions of the storage cluster according to at least a
portion of the first cache metadata including at least one time stamp within
the selected time interval, wherein the second cache data is associated
with one or more of the identified regions of the storage cluster receiving at
least the second selected number of data transfer requests after the at
least one node is rebooted.

ABSTRACT

The disclosure is directed to a system and method for managing
cache memory of at least one node of a multiple-node storage cluster.
According to various embodiments, a first cache data and a first cache
metadata are stored for data transfers between a respective node and
regions of a storage cluster receiving at least a first selected number of
data transfer requests. When the node is rebooted, a second (new) cache
data is stored to replace the first (old) cache data. The second cache data
is compiled utilizing the first cache metadata to identify previously cached
regions of the storage cluster receiving at least a second selected number
of data transfer requests after the node is rebooted. The second selected
number of data transfer requests is less than the first selected number of
data transfer requests to enable a rapid build of the second cache data.

Documents

Application Documents

#	Name	Date
1	599-KOL-2013-(20-11-2013)-ANNEXURE TO FORM 3.pdf	2013-11-20
1	599-KOL-2013-(24-05-2013)-SPECIFICATION.pdf	2013-05-24
2	599-KOL-2013-(24-05-2013)-FORM-5.pdf	2013-05-24
2	599-KOL-2013-(20-11-2013)-ASSIGNMENT.pdf	2013-11-20
3	599-KOL-2013-(24-05-2013)-FORM-3.pdf	2013-05-24
3	599-KOL-2013-(20-11-2013)-CORRESPONDENCE.pdf	2013-11-20
4	599-KOL-2013-(20-11-2013)-GPA.pdf	2013-11-20
4	599-KOL-2013-(24-05-2013)-FORM-2.pdf	2013-05-24
5	599-KOL-2013-(24-05-2013)-FORM-1.pdf	2013-05-24
5	599-KOL-2013-(24-05-2013)-ABSTRACT.pdf	2013-05-24
6	599-KOL-2013-(24-05-2013)-DRAWINGS.pdf	2013-05-24
6	599-KOL-2013-(24-05-2013)-CLAIMS.pdf	2013-05-24
7	599-KOL-2013-(24-05-2013)-DESCRIPTION (COMPLETE).pdf	2013-05-24
7	599-KOL-2013-(24-05-2013)-CORRESPONDENCE.pdf	2013-05-24
8	599-KOL-2013-(24-05-2013)-DESCRIPTION (COMPLETE).pdf	2013-05-24
8	599-KOL-2013-(24-05-2013)-CORRESPONDENCE.pdf	2013-05-24
9	599-KOL-2013-(24-05-2013)-DRAWINGS.pdf	2013-05-24
9	599-KOL-2013-(24-05-2013)-CLAIMS.pdf	2013-05-24
10	599-KOL-2013-(24-05-2013)-ABSTRACT.pdf	2013-05-24
10	599-KOL-2013-(24-05-2013)-FORM-1.pdf	2013-05-24
11	599-KOL-2013-(20-11-2013)-GPA.pdf	2013-11-20
11	599-KOL-2013-(24-05-2013)-FORM-2.pdf	2013-05-24
12	599-KOL-2013-(24-05-2013)-FORM-3.pdf	2013-05-24
12	599-KOL-2013-(20-11-2013)-CORRESPONDENCE.pdf	2013-11-20
13	599-KOL-2013-(24-05-2013)-FORM-5.pdf	2013-05-24
13	599-KOL-2013-(20-11-2013)-ASSIGNMENT.pdf	2013-11-20
14	599-KOL-2013-(24-05-2013)-SPECIFICATION.pdf	2013-05-24
14	599-KOL-2013-(20-11-2013)-ANNEXURE TO FORM 3.pdf	2013-11-20