Storage Efficiency Within Clustered Systems

Abstract: Disclosed herein is a system and a method for deduplication of a plurality of data blocks by storing unique single instance of the data block across a clustered system. A global fingerprint database and a plurality of slave node(s) are further disclosed. Global fingerprint database stores fingerprint of all the data blocks. For any incoming data block having a non matching fingerprint of the incoming data block, the global fingerprint database is updated with a fingerprint of the non matching data block. Whereas for any incoming data block having a matching fingerprint, redundant data block is removed, thereby creating unique single instance of the data block. Searching of the matching or non matching fingerprint of the incoming data block by respective slave node(s) is simultaneously undertaken by other available slave node(s) in the clustered system. This leads to efficient utilization of CPU of the available slave node(s).

Patent Information

Application #

Filing Date

27 February 2013

Publication Number

50/2014

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Email

ip@legasis.in

Parent Application

Patent Number

Legal Status

Grant Date

2021-04-29

Renewal Date

Applicants

Tata Consultancy Services Limited

Nirmal Building, 9th Floor, Nariman Point, Mumbai 400021, Maharashtra, India

Inventors

1. SARAN, Ankur

Tata Consultancy Services Ltd, TCS Awadh Park, Vibhuti Khand, Gomti Nagar, Lucknow 226 010, Uttar Pradesh

Specification

CLIAMS:1. A method for providing a single instance data storage across a plurality of slave node(s) (214) in a clustered system (102) characterized in leveraging one or more available slave node(s) (214) for enabling the single instance data storage, the method comprising:
dividing an incoming data stream into a plurality of data blocks;
distributing the plurality of data blocks on a basis of minimum occupancy of the one or more slave node(s) (214);
creating a fingerprint for each data block of the plurality of data blocks;
storing the fingerprint into a global fingerprint database (106);
performing a search in real time in the global fingerprint database (106) to identify duplicate fingerprints associated with the one or more data blocks; and
removing data blocks associated with duplicate fingerprints when duplicate fingerprints are identified in the global fingerprint database (106).

2. The method of claim 1, wherein the global fingerprint database (106) is updated with a fingerprint of the non matching data block when a non matching fingerprint of the data block is detected.

3. The method of claim 1, wherein the minimum occupancy is computed for each other available slave node(s) (214) within a clustered system (102), so that the data blocks gets distributed across the available slave node(s) computed for minimum occupancy.

4. The method of claim 1, wherein any slave node(s) (214) amongst a plurality of available slave node(s) (214) is enabled to function as a master node (212).

5. The method of claim 1, further comprising computing of a checksum for each of the divided data block.
6. The method of claim 1, further comprising creating of a temporary fingerprint database comprising a data block address, the checksum, fingerprint, and other block related information.
7. The method of claim 6, wherein search of new fingerprints is done on fingerprints in the global fingerprint database (106) for items in the temporary fingerprint database.

8. The method of claim 7, comprising merging the fingerprint database with the global fingerprint database.

9. The method of claim 1, wherein the searching of duplicate entries of fingerprint in the global fingerprint database (106) comprises searching duplicate entries of fingerprint in the global fingerprint database (106) using a MapReduce function.

10. The method of claim 9, wherein the searching of duplicate entries of fingerprint in the global fingerprint database (106) further comprising:
mapping the searching of duplicate entries of the fingerprint to a subset of the plurality of slave node(s) (214) comprises distributing the fingerprint, node identifier, and block identifier across the subset of the plurality of slave node(s) (214); and
reducing result by collecting only count of repetition of the fingerprint found in the searching process, the node identifier and block identifier.

11. The method of claim 10, wherein the subset of plurality of the slave node(s) (214) is selected by the MapReduce function randomly.

12. The method of claim 10, wherein subset of the plurality of the slave node(s) (214) is selected by the MapReduce function on basis of CPU utilization by computing occupancy index at the master node (212) for each other available slave node(s) (214) within the clustered system, so that searching of duplicate data block of fingerprint in the global fingerprint database (106)gets distributed across the available slave node(s) (214).

13. The method of claim 10, wherein if the mapping of the searching of duplicate entries of the fingerprint to one or more of the subset of the plurality of slave node(s) (214) fails, distribution of the fingerprint, node identifier, and block identifier to the one or more of the subset of the plurality of slave node(s) (214) is retriggered.

14. The method of claim 1, wherein each of the slave node(s) (214) comprises a storing unit (218) and a processing unit (216).

15. The method of claim 1, wherein the data block is of fixed size.

16. A system for providing a single instance data storage in a clustered system characterized in leveraging one or more available slave node(s) therefrom for enabling said single instance storage, the system comprising:
a processor (202);
a memory (206) coupled to the processor, wherein the processor is capable of executing program instructions stored in the memory (206) to perform:
dividing an incoming data stream into a plurality of data blocks;
distributing the plurality of data blocks on a basis of minimum occupancy of the one or more slave node(s) (214);
creating a fingerprint for each data block of the plurality of data blocks;
storing the fingerprint into a global fingerprint database (106);
performing a search in real time in the global fingerprint database (106) to identify duplicate fingerprints associated with the one or more data blocks; and
removing data blocks associated with duplicate fingerprints when duplicate fingerprints are identified in the global fingerprint database (106).

17. The method of claim 16, wherein the global fingerprint database (106) is updated with a fingerprint of the non matching data block when a non matching fingerprint of the incoming data block is detected.

18. The system of claim 16, wherein the minimum occupancy is computed for each other available slave node(s) (214) within the clustered system (102), so that the data blocks gets distributed across the available slave node(s) (214) computed for minimum occupancy.

19. The system of claim 16, wherein any slave node(s) (214) amongst a plurality of available slave node(s) (214) is enabled to function as a master node (212).

20. The system of claim 16, wherein the searching of duplicate entries of fingerprint in the global fingerprint database (106) comprises searching duplicate entries of fingerprint in the global fingerprint database (106) using a MapReduce function.

21. The system of claim 20, wherein the searching of duplicate entries of fingerprint in the global fingerprint database (106) further comprising:
mapping the searching of duplicate entries of the fingerprint to a subset of the plurality of slave node(s) (214) comprises distributing the fingerprint, node identifier, and block identifier across the subset of the plurality of slave node(s) (214); and
reducing result by collecting only count of repetition of the fingerprint found in the searching process, the node identifier and block identifier.
,TagSPECI:FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION
(See Section 10 and Rule 13)

Title of invention:
STORAGE EFFICIENCY WITHIN CLUSTERED SYSTEMS

Applicant
TATA Consultancy Services Limited
A company Incorporated in India under The Companies Act, 1956
Having address:
Nirmal Building, 9th Floor,
Nariman Point, Mumbai 400021,
Maharashtra, India

The following specification particularly describes the invention and the manner in which it is to be performed

TECHNICAL FIELD
[001] The invention relates generally to a data processing system, and more particularly to a system and a method of storing unique single instance of data across a clustered system resulting in improved storage efficiency.

BACKGROUND
[002] Data storage is a vital part for most of data processing systems. It assumes greater significance in distributive and clustered data processing systems where more storage is required. A clustered system is formed by a plurality of nodes connected to each other, wherein nodes relate to data processing systems and can store data as well. Data processing systems process a huge volume of data that has to be stored somewhere. Generally, such data is stored on a media, both virtual and physical. With rate of data productions increasing each day, conventional methods of storing data are proving to be inadequate. There exists a need for a new technology that can handle storage requirements of ever increasing data.
[003] There can be various ways to expand storage, some of which are as follows:
a) Expansion of the storage capacity of storage media: This method of expanding the storage media is not practically feasible in view of ever increasing storage need. Addition of extra storage space by increasing storage media or increasing media size may be one time solution but not reliable in long run as there would always be requirement of more storage space and expanding the size capacity to accommodate more new media is limited. Also, affording a large storage system that can accommodate large no. of media may not be economically viable option for storage users.
b) Other way to meet increasing requirement of storage is by scaling the storage system horizontally. This allows flexibility in the initial setup of storage, infrastructure setup, scalability and manageability.
[004] Typically, methods disclosed in prior art for data deduplication of the redundant data operates at a particular node irrespective of the availability of the other free nodes which can otherwise participate in the process of deduplication within the distributed computing environment. This eventually leads to inefficient utilization of CPU thereby impacting system performance efficiency and resource utilization ability.
[005] Further, standard single node storage systems regard deduplication as redundant feature when compared to main storage task i.e. handling read/write requests and hence storage efficiency features runs as low priority task that can run in background at scheduled time mostly nightly, weekly or fortnightly and that too at back units not at front end storage box.
[006] Therefore, in light of the above-mentioned problems, it would be desirable to have:
1. Enhanced storage efficiency at the cluster level rather than at node level by performing deduplication of data at the cluster level.
2. Efficient utilization of free nodes in the clustered system and:
3. Enhanced inline storage efficiency (deduplication), wherein process of data deduplication is simultaneously undertaken by available free nodes, as yet another high priority task.
SUMMARY
[007] This summary is provided to introduce aspects related to systems and methods for storing unique single instance of data across a clustered system resulting in improved storage efficiency and the aspects are further described below in the detailed description. This summary is not intended to identify essential features of the claimed subject matter nor is it intended for use in determining or limiting the scope of the claimed subject matter.
[008] In an implementation, a method for providing a single instance data storage across a plurality of slave node(s) 214 in a clustered system 102, characterized in leveraging one or more available slave node(s) 214 for enabling said single instance storage is shown. 214. The slave node(s) 214 receiving the data block creates a fingerprint of the data block and stores the fingerprint to a global fingerprint database 106. In another embodiment, creation of a fingerprint of the data block and storing the fingerprint to a global fingerprint database 106 may be done by a master node 212. Then at the global fingerprint database 106, a search for duplicate fingerprints for each incoming data block is performed in real time. If the fingerprint of the incoming data block is non matching, then the global fingerprint database 106 is updated with the fingerprint of the non matching data block. If the fingerprint of the incoming data block is matching with a fingerprint already stored in the global fingerprint database 106, the data block with the matching fingerprint is removed, thereby creating the single instance of the data block.
[009] In another implementation, a system for providing a single instance data storage in the clustered system 102, characterized in leveraging one or more available slave node(s) 214 therefrom for enabling said single instance storage is disclosed. The system comprises a processor 202 and a memory unit 206 coupled to the processor 202. The processor 202 is capable of executing program instructions stored in the memory 206 to perform a plurality of functions. 214. The slave node(s) 214 receiving the data block creates a fingerprint of the data block and stores the fingerprint to a fingerprint database. In another embodiment, creation of a fingerprint of the data block and storing the fingerprint to a global fingerprint database 106 may be done by a master node 212. Then at the global fingerprint database 106, a search for duplicate fingerprints for each incoming data block is performed in real time. If the fingerprint of the incoming data block is non matching, then the global fingerprint database 106 is updated with the fingerprint of the non-matching data block from the temporary fingerprint database. If the fingerprint of the incoming data block is matching with a fingerprint already stored in the global fingerprint database, the redundant data block with the matching fingerprint is removed, thereby creating the single instance of the data block.

BRIEF DESCRIPTION OF THE DRAWINGS
[0010] The detailed description is described with reference to the accompanying figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The same numbers are used throughout the drawings to refer like features and components.
[0011] FIG. 1 illustrates a network implementation of a clustered system for deduplication of data blocks by storing unique single instance of the data block across a clustered system, in accordance with an embodiment of the present subject matter.
[0012] FIG. 2 illustrates the clustered system, in accordance with an embodiment of the present subject matter.
[0013] FIG. 3 discloses an operational environment in which the invention will be performed, in accordance with an embodiment of the present subject matter.
[0014] FIG. 4 illustrates structure of global fingerprint database in accordance with an embodiment of the present subject matter.
[0015] FIG. 5 illustrates a global fingerprint database where unique single instance of data block is stored by updating new entries to the global fingerprint database, in accordance with an embodiment of the present subject matter.
[0016] FIG. 6 illustrates the global fingerprint database where there is introduction of a new data block which is duplicate to already existing block with fingerprint, in accordance with an embodiment of the present subject matter.
[0017] Fig. 7 is a flowchart illustrating method of storing unique single instance of data block across the clustered system, in accordance with an embodiment of the present subject matter.

DETAILED DESCRIPTION
[0018] Definitions:
[0019] Data Block: All data on the media (virtual or physical) is stored using a structure where atomic level of data stored in the system is called data block. This data block can be of fixed or variable size, depending on requirement of the system used. These data blocks are obtained by dividing a data stream.
[0020] Fingerprint: A unique identifier of the data block is called as fingerprint.
[0021] The present subject matter discloses a system and a method for deduplication of the data blocks by storing a unique single instance of the data block across a clustered system. It results in improved system efficiency. The present subject matter discloses a master node and a plurality of slave node(s). The master node and the plurality of slave node(s) are connected to each other to form the clustered system.
[0022] The master node comprises a global fingerprint database that stores fingerprint of all the data blocks stored across the clustered system. The master node always keeps itself updated about status of utilization of the CPU of the slave node(s). For any incoming data block having a non matching fingerprint of the incoming data block, the global database is updated with a fingerprint of the non matching data block. Whereas for any incoming data block having a matching fingerprint, redundant data block is removed, thereby creating unique single instance of the data block. Searching of the matching or non matching fingerprint of the incoming data block by respective slave node(s) is simultaneously undertaken by relatively free slave node(s) in the clustered system. This leads to efficient utilization of CPU of the available slave node(s) and deduplication of data being performed at the cluster level. In one embodiment of the present subject matter, dedupe engine of the master node also searches for the fingerprints in the global fingerprint database 106.
[0023] While aspects of described system and method for deduplication of the data blocks by storing a unique single instance of the data block across the clustered system is disclosed. The system and method may be implemented in any number of different computing systems, environments, and/or configurations, the embodiments are described in the context of the following exemplary system.
[0024] Referring now to FIG. 1, a network implementation 100 for deduplication of the data blocks by storing a single unique unique single instance of the data block is illustrated, in accordance with an embodiment of the present subject matter.
[0025] In another embodiment, a system and a method for deduplication of the data blocks by storing a unique single instance of the data block across the clustered system 102 is illustrated. It results in improved system efficiency. The invention discloses the clustered system 102 and plurality of user devices collectively referred to as 104. The clustered system 102 comprises a global fingerprint database 106 (shown in FIG.2) that stores fingerprint of all the data blocks. The clustered system 102 always keeps itself updated about status of utilization of the CPU of the user devices 104. For any incoming data block having a non matching fingerprint of the incoming data block, the global fingerprint database 106 is updated with a fingerprint of the non matching data block. Whereas for any incoming data block having a matching fingerprint, redundant data block is removed, thereby creating unique single instance of the data block. Searching of the matching or non matching fingerprint of the incoming data block by respective slave node(s) 214 is simultaneously undertaken by relatively free slave node(s) 214 in the clustered system 102. This leads to efficient utilization of CPU of the available slave node(s) 214 and deduplication of data being performed at the cluster level.
[0026] Although the present subject matter is explained considering that the clustered system 102 is implemented as having a processor (not shown) and the global fingerprint database 106 (shown in Figure 2), it may be understood that the clustered system 102 may also be implemented in a variety of computing systems, such as a laptop computer, a desktop computer, a notebook, a workstation, a mainframe computer, a server, a network server, and the like. It will be understood that the clustered system 102 may be accessed by multiple users through one or more user devices 104-1, 104-2…104-N, collectively referred to as user 104 hereinafter, or applications residing on the slave node(s) 104. Examples of the slave node(s) 104 may include, but are not limited to, a portable computer, a personal digital assistant, a handheld device, and a workstation. The user devices 104 are communicatively coupled to the clustered system 102 through a network 108.
[0027] In one implementation, the network 108 may be a wireless network, a wired network or a combination thereof. The network 106 can be implemented as one of the different types of networks, such as intranet, local area network (LAN), wide area network (WAN), the internet, and the like. The network 106 may either be a dedicated network or a shared network. The shared network represents an association of the different types of networks that use a variety of protocols, for example, Hypertext Transfer Protocol (HTTP), Transmission Control Protocol/Internet Protocol (TCP/IP), Wireless Application Protocol (WAP), and the like, to communicate with one another. Further the network 106 may include a variety of network devices, including routers, bridges, servers, computing devices, storage devices, and the like.
[0028] Referring now to FIG. 2, the clustered system 102 is illustrated in accordance with an embodiment of the present subject matter. In one embodiment, the clustered system 102 may include at least one processor 202, an input/output (I/O) interface 204, and a memory 206. Then at least one processor 202 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the at least one processor 202 is configured to fetch and execute computer-readable instructions stored in the memory 206.
[0029] The I/O interface 204 may include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like. The I/O interface 204 may allow the clustered system 102 to interact with a user directly or through the client devices 104. Further, the I/O interface 204 may enable the clustered 102 to communicate with other computing devices, such as web servers and external data servers (not shown). The I/O interface 204 can facilitate multiple communications within a wide variety of networks and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. The I/O interface 204 may include one or more ports for connecting a number of devices to one another or to another server.
[0030] The memory 206 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. The memory 206 may include a master node 212 and a slave node(s) 214, and data 208.
[0031] The data 210, amongst other things, serves as a repository for storing data processed, received, and generated by the master node 212 and the slave node(s) 214. The data 210 may also include the global fingerprint database 106, and other data 130.
[0032] In one implementation, at first, a user may use the client device 104 to access the clustered system 102 via the I/O interface 204. The user may register themselves using the I/O interface 204 in order to use the clustered system 102. The working of the clustered system 102 may be explained in detail in Figures 3-5 explained below. The clustered system 102 may be used for deduplication of the data blocks by storing a unique single instance of the data block.
[0033] FIG. 3 discloses the clustered system 102 in which the invention will be performed according to an embodiment of the invention. Shown in the FIG.3 is the master node 212 and a plurality of slave node(s) 214. The master node 212 comprises the global fingerprint database 106. The master node 212 can be selected by quorum or it can be a dedicated master depending on requirement. Any slave node(s) 214 among a plurality of available slave node(s) 214 is capable of functioning as the master node 212. Each of the plurality of slave node(s) 214 has its own processing unit 216 and a storage unit 218. The global fingerprint database 106 is a universal structure stored on the master node 212 with multiple (one or more than 1) replica stored on other slave node(s) 214 that might act as failover slave node(s) 214 to the master node 212.
[0034] An incoming data stream is received at the master node 212 wherein the data stream is divided into data blocks A, B, C, D, E, F. The data blocks may be of fixed or variable size. The master node 212 computes a checksum for each of the data blocks to maintain the integrity of the data blocks throughout the clustered system 102. The checksum is stored in the global fingerprint database 106. In an embodiment of the invention, fingerprint is also computed by the master node 212 for each of the incoming data blocks. Then the data blocks A, B, C, D, E, and F are distributed among the slave node(s) 214. As shown in the FIG. 1, blocks A, B are stored on slave 1, blocks C,D stored on slave node(s) 2, and blocks E, F stored on slave n.
[0035] Distribution of the data blocks among the slave node(s) 214 is based on many criterion. In an embodiment of the invention, the master node 10 computes occupancy index for each of the available slave node(s) 214 within a clustered system 102, so that the data blocks gets distributed across the available slave node(s) 214 computed for minimum occupancy. For example, if the utilization of a particular CPU is less than 50 %, data block would be stored on the respective slave node(s) 214. In another embodiment of the invention, data blocks would be distributed randomly among the slave node(s) 214. At the slave node(s) 214, fingerprint of each of the data blocks is calculated. In an embodiment of the present subject matter, fingerprint is calculated by the master node 212.
[0036] At the respective slave node(s) 214, a local temporary fingerprint database is created comprising of block id of new blocks, respective checksum, and fingerprint of the data block. According to an exemplary embodiment of the invention, calculation of fingerprint may be performed, but not limited to, by combining checksum, first, middle, and last block, and XOR of preselected bits from 8, 16, 32, 64 to last 4 bits of the data block. The search of new fingerprints in the temporary fingerprint database is performed. If there is a matching fingerprint, required removal of duplicate entry is done in temporary fingerprint database and global fingerprint is updated about it by appending the new block id and node id to the already existing fingerprint on global fingerprint database 106. Once this process is complete, the temporary fingerprint database is merged to global fingerprint database 106.
[0037] FIG. 4 illustrates structure of global fingerprint data base 106 in accordance with an embodiment of the present subject matter. Structure of global fingerprint data base 106 is combination of node based fingerprint database that are named after the slave node(s)(s) 214 the individual file represents. All the slave node(s) 214 have access to read the fingerprint database of each slave node(s) 214 in the global fingerprint database 106 but master node 212 and owner of the temporary fingerprint database can only write to the temporary fingerprint database in global fingerprint database 106. Global fingerprint data base 106 is updated by temporary files (temporary is node specific) formed by information about all the new entries in slave node(s) 214; in the Files Named after node in global fingerprint database 106.
[0038] Merging of fingerprints to the global fingerprint database 106 is CPU intensive and complex. Therefore, it is performed in a plurality of steps:
1) For every incoming data block, search for duplicate entries of fingerprint is performed by MapReduce function in following steps:
a) Mapping the task of search of duplicate entries to other nodes. In one embodiment, the slave node(s) 214 are selected randomly. In another embodiment, nodes are selected on the basis of occupancy index computed for each of the available slave node(s) 214. Slave node(s) 214 with minimum occupancy are selected.
Map(fingerprint, database index)
Search in the global fingerprint database 106 for fingerprint of the data block is distributed among other slave node(s) 214 of the clustered system 102 by distributing the fingerprint and its global fingerprint database index that comprises references to the data block address. If the mapping task fails, the process is retriggered.
b) Once the mapping is finished, reducer comes into picture. Reducer function is shown as follows:
Reduce (count, Location)
It reduces the result by collecting the count of required fingerprint repetition found in the search process and its location in the clustered system 102, i.e. node id and data block id.
2) Any new data block having fingerprint matching in the existing global fingerprint database 106, implies existence of duplicate data block in the clustered system 102 and hence new block need to be removed and the pointer of existing block is used in place of new data block. In case where no matching fingerprint is found for the new fingerprint, then the global fingerprint database 106 is updated with the new fingerprint.
we can use map/reduce to search the global fingerprint database 106 as it will utilize relatively free nodes of CPU for searching the global fingerprint database 106 making process a parallel process This improves the CPU utilization of system, improves processing time and as well as improves the storage efficiency feature making it a inline process
[0039] FIG. 5 illustrates a global fingerprint database 106 where first data sequence is stored by updating new entries to global fingerprint database 106. Shown in FIG.5 are fingerprint which is data block identifier, reference count which indicates how many times the data block is stored in the global fingerprint database 106. Further, shown is block identifier which acts as reference to actual data block stored in respective slave node(s) 214.

[0040] FIG. 6 illustrates a global fingerprint database 106 where there is introduction of a new data block which is duplicate to already existing block with fingerprint as ‘A’
As shown, fingerprint A is shared by two block references, i.e., Slave 1, block 1; slave 2 block3. The global fingerprint database 106 is updated for duplicate entry in the clustered system 102.
Once the reference is done the duplicate block is removed and we have only one instance of block with fingerprint ‘A’.
1. To avoid collision and locks in global fingerprint database 106, All slave node(s) 214 have exclusive rights to its part of global fingerprint database 106 where as other nodes just have read-only permissions and owner and master node only have write permission with highest priority of right to master of cluster system (102).
2. All the nodes also share the global fingerprint database 106 by multiple means including NFS/CIFS share other possible way of communication hence all the slave node(s) 214 have the latest update for the changes occurring in the clustered system 102.

[0041] An advantage of the present subject matter is to provide a method and a system for deduplication of data at cluster level with unique instance of data being stored across pluratity of nodes of the clustered system.
[0042] Another advantage of the present subject matter is to provide for in line storage efficiency, wherein process of deduplication of data is executed in parallel on relatively free nodes without effecting capacity of the system.
[0043] Yet another advantage of the present subject matter is to provide storage efficiency across the clustered system rather than at nodes only.
[0044] Further advantage of the present subject matter is to provide for better utilization of CPU using relatively free nodes.
[0045] Still further advantage of the present subject matter is to obviate the need for an extra hardware for providing deduplication within the clustered system.
[0046] Further object of the invention is to provide for a global storage database that stores unique identifiers of data present within each node of the clustered system.
[0047] Referring to FIG. 7, a flowchart illustrating method 300 of storing unique single instance of data block across the clustered system 102 is shown in accordance with an embodiment of the present subject matter. An incoming data stream is received at the master node 212 wherein the data stream is divided into data blocks A, B, C, D, E, F (Block 302). In one embodiment, dividing may also be performed by the slave node(s) 214. The data blocks may be of fixed or variable size.
[0048] The master node 212 computes a checksum for each of the data blocks to maintain the integrity of the data blocks throughout the clustered system 102 (Block 304). The checksum is stored in the global fingerprint database 106.
[0049] In another block 306, the data blocks A, B, C, D, E, and F are distributed among the slave node(s) 214. In an embodiment of the invention, the master node 212 computes occupancy index for each of the available slave node(s) 214 within the clustered system 102, so that the data blocks gets distributed across the available slave node(s) 214 computed for minimum occupancy (block 308). In another embodiment of the invention, the data blocks are distributed randomly.
[0050] In yet another block 310, at the slave node(s) 214, the fingerprint of each of the data blocks is calculated. In an embodiment of the invention, fingerprint is created by the master node 212. In yet another block 312, a local temporary fingerprint database is created comprising of block id of new blocks, respective checksum, and fingerprint of the data block, and other data block related information at the respective slave node(s) 214.
[0051] In yet another block 314, the temporary fingerprint database is merged with the global fingerprint database 106, thereby storing the fingerprint to the global fingerprint database 106.
[0052] The merger of fingerprint to the global fingerprint database 106 is CPU intensive as it can contain redundant blocks information; therefore, it is done by using MapReduce function. The MapReduce function utilizes relatively free node of CPU for searching the global fingerprint database for any existing fingerprint in the clustered system 102.This improves the CPU utilization as well as improves the storage efficiency making it a inline process.
[0053] In another block 316, searching of duplicate entries of fingerprints in the global fingerprint database 14 is mapped to relatively free slave node(s) 214 by using the MapReduce function. if the mapping of the searching of duplicate entries of the fingerprint to one or more of the subset of the plurality of slave node(s) 214 fails, distribution of the fingerprint, node identifier, and block identifier to the one or more of the subset of the plurality of slave node(s) is retriggered.
[0054] Once the mapping is done, Reducer function comes into picture. The Reducer function collects count of fingerprint repetition found in the search process and its location in the clustered system 102.
[0055] In yet another block 318, a determination is made. If a matching fingerprint is found in the global fingerprint database 106, then the global fingerprint database 106 is updated and the duplicate block is deleted and a pointer is placed in the global fingerprint database as well in temporary fingerprint database of the node where data block is deleted 106 to actual data block in the clustered system 102 (block 320). If no match is found, then fingerprint is added to the global fingerprint database 14 (block 322).
[0056] Although implementations for methods and systems for deduplication of data at cluster level with unique instance of data being stored across multiple nodes of the clustered system have been described in language specific to structural features and/or methods, it is to be understood that the appended claims are not necessarily limited to the specific features or methods described. Rather, the specific features and methods are disclosed as examples of implementations for deduplication of data at cluster level with unique instance of data being stored across multiple nodes of the clustered system 102.

Documents

Application Documents

#	Name	Date
1	Form2_P327_Final.pdf	2018-08-11
2	ABSTRACT1.jpg	2018-08-11
3	580-MUM-2013-FORM 26(4-4-2013).pdf	2018-08-11
4	580-MUM-2013-FORM 18(5-3-2013).pdf	2018-08-11
5	580-MUM-2013-FORM 1(19-4-2013).pdf	2018-08-11
6	580-MUM-2013-CORRESPONDENCE(5-3-2013).pdf	2018-08-11
7	580-MUM-2013-CORRESPONDENCE(4-4-2013).pdf	2018-08-11
8	580-MUM-2013-CORRESPONDENCE(19-4-2013).pdf	2018-08-11
9	580-MUM-2013-FER.pdf	2019-01-11
10	580-MUM-2013-OTHERS [11-07-2019(online)].pdf	2019-07-11
11	580-MUM-2013-FER_SER_REPLY [11-07-2019(online)].pdf	2019-07-11
12	580-MUM-2013-DRAWING [11-07-2019(online)].pdf	2019-07-11
13	580-MUM-2013-COMPLETE SPECIFICATION [11-07-2019(online)].pdf	2019-07-11
14	580-MUM-2013-CLAIMS [11-07-2019(online)].pdf	2019-07-11
15	580-MUM-2013-FORM-26 [23-02-2021(online)].pdf	2021-02-23
16	580-MUM-2013-Correspondence to notify the Controller [23-02-2021(online)].pdf	2021-02-23
17	580-MUM-2013-FORM-26 [24-02-2021(online)].pdf	2021-02-24
18	580-MUM-2013-Written submissions and relevant documents [11-03-2021(online)].pdf	2021-03-11
19	580-MUM-2013-PatentCertificate29-04-2021.pdf	2021-04-29
20	580-MUM-2013-IntimationOfGrant29-04-2021.pdf	2021-04-29
21	580-MUM-2013-US(14)-HearingNotice-(HearingDate-25-02-2021).pdf	2021-10-03
22	580-MUM-2013-RELEVANT DOCUMENTS [30-09-2023(online)].pdf	2023-09-30

Search Strategy

1	search_28-11-2018.pdf