Active Archiving Of Data On A Distributed File System

Abstract: ABSTRACT ACTIVE ARCHIVING OF DATA ON A DISTRIBUTED FILE SYSTEM Systems and methods for active archiving of data onto a distributed file system are described. A data archiving system may implement archiving method, where the method includes receiving a request from an authenticated user to archive the data on to the distributed file system, wherein the data to be archived is located at a data source and transferring the data to be archived onto the distributed file system based on a distributed file transfer mechanism, wherein the data is one of a structured data and an unstructured data. The method further includes loading the data to be archived onto the distributed file system based on Hbase bulk load mechanism. The method also includes indexing the data loaded onto the distributed file system to generate at least one indices corresponding to the data, wherein the indexing comprises segmenting of data into plurality of index segments.

Patent Information

Application #

Filing Date

14 March 2014

Publication Number

39/2015

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Email

iprdel@lakshmisri.com

Parent Application

Patent Number

Legal Status

Grant Date

2022-06-30

Renewal Date

Applicants

TATA CONSULTANCY SERVICES LIMITED

Nirmal Building, 9th Floor, Nariman Point, Mumbai, Maharashtra 400021,

Inventors

1. KUTTAN, Binesh

7th Floor, TCS Office, Tejomaya, infopark, Kochi,

2. JACOB, Vivek

7th Floor, TCS Office, Tejomaya, infopark, Kochi,

3. VARGHESE, Abraham

7th Floor, TCS Office, Tejomaya, infopark, Kochi,

4. JOHN, Thomas Jeby

7th Floor, TCS Office, Tejomaya, infopark, Kochi,

Specification

CLIAMS:I/We claim:
1. A method for active archiving of data onto a distributed file system, the method comprising:
receiving a request from an authenticated user to archive the data on to the distributed file system, wherein the data to be archived is located at a data source, and wherein the data is one of a structured data and an unstructured data;
transferring the data to be archived onto the distributed file system based on a distributed file transfer mechanism;
loading the data to be archived onto the distributed file system based on Hbase bulk load mechanism;
indexing the data loaded onto the distributed file system to generate at least one indices corresponding to the data, wherein the indexing comprises segmenting of data into plurality of index segments, and wherein the at least one indices correspond to the plurality of index segments; and
transferring the at least one indices to a search engine from the distributed file system.

2. The method as claimed in claim 1, wherein the transferring the data based on the distributed file transfer mechanism comprises:
segmenting the data to be stored onto a plurality of data nodes of the distributed file system into a plurality of virtual segments, wherein at least one virtual segment from amongst the plurality of virtual segments is stored onto a data node from amongst the plurality of data nodes;
sending a plurality of secure connection requests to the plurality of data nodes for transferring the plurality of virtual segments to the plurality of data nodes;
obtaining at least one file transfer protocol connection request corresponding to each of the plurality of secure connection requests from the plurality of data nodes; and
transferring the plurality of virtual segments to the plurality of data nodes of the distributed file system through the plurality of secure connections, in parallel, wherein a virtual segment from amongst the plurality of virtual segments is transferred to a data node from amongst the plurality of data nodes.

3. The method as claimed in claim 1, wherein the Hbase bulk load mechanism comprises:
segmenting each virtual segment from amongst the plurality of virtual segments into a plurality of subsets of data, wherein each of the plurality of subsets of data includes one or more records;
generating a set of intermediate key-value pairs for each of the one or more records in each of the plurality of subsets of data, wherein a key-value pair includes a key and a value corresponding to the key, and wherein the key is a unique identifier of the value;
sorting the set of intermediate key-value pairs to generate a plurality of output files, wherein each of the plurality of output files includes at least one key-value pair; and
storing the plurality of output files in the distributed file system, wherein the plurality of output files is representative of the data.

4. The method as claimed in claim 3, wherein each of the plurality of output files is further stored onto the distributed file system as at least one HFile, wherein the at least one HFile includes one or more key-value pairs, and wherein keys corresponding to the one or more key-value pairs are same.

5. The method as claimed in claim 1, wherein the distributed file system is a Hadoop Distributed File System (HDFS).

6. The method as claimed in claim 1, wherein the at least one indices is generated based on a Lucene Application Programming Interface (API).

7. The method as claimed in claim 1, wherein the method further comprises receiving another request from the authenticated user to perform at least one of retrieval of archived data and purging of the archived data.

8. The method as claimed in claim 1, wherein the method further comprises scheduling the request based on a pre-defined scheduling mechanism.

9. The method as claimed in claim 1, wherein the method further comprises:
identifying failure of the request to archive data onto the distributed file system;
notifying the authenticated user of the failure; and
receiving another request from the authenticated user to reinitiate the data archiving onto the distributed file system.

10. A data archiving system (102) for active archiving of data onto a distributed file system (108), the data archiving system (102) comprising:
a processor (110);
a user management module (120), coupled to the processor (110), to receive a request from an authenticated user to archive the data on to the distributed file system (108), wherein the data to be archived is located at a data source;
an analysis and processing module (118), coupled to the processor (110), to:
transfer the data to be archived onto the distributed file system (108) based on a distributed file transfer mechanism, wherein the data is one of a structured data and an unstructured data; and
load the data to be archived onto the distributed file system (108) based on Hbase bulk load mechanism;
an indexing module (124), coupled to the processor (110), to index the data loaded onto the distributed file system (108) to generate at least one indices corresponding to the data, wherein the indexing comprises segmenting of data into plurality of index segments, and wherein the at least one indices correspond to the plurality of index segments; and
a communication module (122), coupled to the processor (110), to transfer the at least one indices to a search engine from the distributed file system (108).

11. The data archiving system (102) as claimed in claim 10, wherein the analysis and processing module (118) based on the distributed file transfer mechanism segments the data to be stored onto a plurality of data nodes of the distributed file system (108), into a plurality of virtual segments, wherein at least one virtual segment from amongst the plurality of virtual segments is stored onto a data node from amongst the plurality of data nodes.
12. The data archiving system (102) as claimed in claim 10, wherein the communication module (122) based on the distributed file transfer mechanism:
sends a plurality of secure connection requests to the plurality of data nodes for transferring the plurality of virtual segments to the plurality of data nodes;
obtains at least one file transfer protocol connection request corresponding to each of the plurality of secure connection requests from the plurality of data nodes; and
transfers the plurality of virtual segments to the plurality of data nodes of the distributed file system (108) through the plurality of secure connections, in parallel, wherein the virtual segment from amongst the plurality of virtual segments is transferred to the data node from amongst the plurality of data nodes.

13. The data archiving system (102) as claimed in claim 10, wherein the analysis and processing module (118) based on the Hbase bulk load mechanism:
segments each virtual segment from amongst the plurality of virtual segments into a plurality of subsets of data, wherein each of the plurality of subsets includes one or more records;
generates a set of intermediate key-value pairs for each of the one or more records in each of the plurality of subsets of data, wherein a key-value pair includes a key and a value corresponding to the key, and wherein the key is a unique identifier of the value;
sorts the set of intermediate key-value pairs to generate a plurality of output files, wherein each of the plurality of output files includes at least one key-value pair; and
stores the plurality of output files in the distributed file system (108), wherein the plurality of output files is representative of the data.

14. The data archiving system (102) as claimed in claim 10, wherein the user management module (120) schedules a plurality of requests from authenticated users, wherein the plurality of requests include at least one of archiving requests, retrieval requests, and purging requests.

15. The data archiving system (102) as claimed in claim 10, wherein the data archiving system (102) further comprises of a protection module (126), coupled to the processor (110), to encrypt and decrypt the data.

16. A non-transitory computer-readable medium having embodied thereon a computer program for executing a method comprising:
receiving a request from an authenticated user to archive a data on to a distributed file system, wherein the data to be archived is located at a data source;
transferring the data to be archived onto the distributed file system based on a distributed file transfer mechanism, wherein the data is one of a structured data and an unstructured data;
loading the data to be archived onto the distributed file system based on Hbase bulk load mechanism;
indexing the data loaded onto the distributed file system to generate at least one indices corresponding to the data, wherein the indexing comprises segmenting of data into plurality of index segments, and wherein the at least one indices correspond to the plurality of index segments; and
transferring the at least one indices to a search engine from the distributed file system.
,TagSPECI:-As Attached-

Documents

Application Documents

#	Name	Date
1	SPEC IN.pdf	2018-08-11
2	FORM 5.pdf	2018-08-11
3	FORM 3.pdf	2018-08-11
4	FIGURES IN.pdf	2018-08-11
5	ABSTRACT1.jpg	2018-08-11
6	872-MUM-2014-Power of Attorney-200115.pdf	2018-08-11
7	872-MUM-2014-FORM 18.pdf	2018-08-11
8	872-MUM-2014-FORM 1(15-9-2014).pdf	2018-08-11
9	872-MUM-2014-Correspondence-200115.pdf	2018-08-11
10	872-MUM-2014-CORRESPONDENCE(15-9-2014).pdf	2018-08-11
11	872-MUM-2014-FER.pdf	2019-10-31
12	872-MUM-2014-OTHERS [27-04-2020(online)].pdf	2020-04-27
13	872-MUM-2014-FER_SER_REPLY [27-04-2020(online)].pdf	2020-04-27
14	872-MUM-2014-CLAIMS [27-04-2020(online)].pdf	2020-04-27
15	872-MUM-2014-US(14)-HearingNotice-(HearingDate-19-05-2022).pdf	2022-04-21
16	872-MUM-2014-Correspondence to notify the Controller [22-04-2022(online)].pdf	2022-04-22
17	872-MUM-2014-FORM-26 [10-05-2022(online)].pdf	2022-05-10
18	872-MUM-2014-FORM-26 [25-05-2022(online)].pdf	2022-05-25
19	872-MUM-2014-Written submissions and relevant documents [02-06-2022(online)].pdf	2022-06-02
20	872-MUM-2014-PatentCertificate30-06-2022.pdf	2022-06-30
21	872-MUM-2014-IntimationOfGrant30-06-2022.pdf	2022-06-30

Search Strategy

1	searchstrategy_10-10-2019.pdf