Sign In to Follow Application
View All Documents & Correspondence

Method And System For Document Indexing And Retrieval

Abstract: Existing systems for document processing are either based on a supervised approach using annotated tags, and these systems identify section-based data from the unstructured documents without considering the statistical variations in content, which results in highly inaccurate content extraction. The disclosure herein generally relates to document processing, and, more particularly, to method and system for document indexing and retrieval. The system provides a mechanism to correlate unique words in a document with different topics identified in the document, based on a word pattern identified from the document. The correlations are captured in a knowledge graph, and can be further used in applications such as but not limited to document retrieval.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #
Filing Date
18 March 2021
Publication Number
38/2022
Publication Type
INA
Invention Field
COMPUTER SCIENCE
Status
Email
Parent Application
Patent Number
Legal Status
Grant Date
2024-05-08
Renewal Date

Applicants

Tata Consultancy Services Limited
Nirmal Building, 9th Floor, Nariman Point, Mumbai - 400021, Maharashtra, India

Inventors

1. THAKARE, Shreya Sanjay
Tata Consultancy Services Limited, ODC II Intersil, MIDC Road, SEEPZ, Andheri East, Mumbai - 400096, Maharashtra, India
2. TRIPATHY, Saswati Soumya
Tata Consultancy Services Limited, Gopalan Enterprises Pvt Ltd (Global Axis) SEZ "H" Block,No. 152 (Sy No. 147,157 & 158), Hoody Village, EPIP Zone, II Stage, Whitefield, K. R. Puram, Bangalore - 560066, Karnataka, India
3. SHAH, Pranav Champaklal
Tata Consultancy Services Limited, ODC II Intersil, MIDC Road, SEEPZ, Andheri East, Mumbai - 400096, Maharashtra, India
4. POOJARY, Sudhakara Deva
Tata Consultancy Services Limited, ODC II Intersil, MIDC Road, SEEPZ, Andheri East, Mumbai - 400096, Maharashtra, India
5. RANA, Rahul
Tata Consultancy Services Limited, ODC II Intersil, MIDC Road, SEEPZ, Andheri East, Mumbai - 400096, Maharashtra, India
6. PATEL, Hemil
Tata Consultancy Services Limited, Plot No. 2, 3, Rajiv Gandhi Infotech Park, Phase III, Hinjawadi-Maan, Pune - 411057, Maharashtra, India
7. ANSARI, Saad
Tata Consultancy Services Limited, ODC II Intersil, MIDC Road, SEEPZ, Andheri East, Mumbai - 400096, Maharashtra, India

Specification

Claims: 1. A processor implemented method (200) of document processing, comprising: collecting (202) a document as input, via one or more hardware processors; pre-processing (204) the document, via the one or more hardware processors, to generate a pre-processed document; identifying (206) one or more topics in the pre-processed document; identifying (208) a plurality of unique words in the pre-processed document; identifying (210) a plurality of phrases and word patterns in the pre-processed document; correlating (212) each of the plurality of the unique words to corresponding at least one topic, based on the identified word patterns; and building (214) a knowledge graph using the correlation of the plurality of the unique words with the corresponding at least one topic. 2. The processor implemented method as claimed in claim 1, wherein pre-processing the document comprises: determining (302) range of characters in the document; dividing (304) text in the document at a granular level, based on the determined range of characters; and converting (306) the text in the document to one of a structured format and a hierarchical format. 3. The processor implemented method as claimed in claim 1, wherein document extraction, performed using the knowledge graph, comprising: receiving (402) a user query for at least one document, wherein the user query comprises at least one keyword; comparing (404) the at least one keyword with the knowledge graph to identify at least one match; and extracting (406) at least one document based on the at least one match. 4. A system for document processing, comprising: a memory (102) storing instructions; one or more communication interfaces (106); and one or more hardware processors (104) coupled to the memory (102) via the one or more communication interfaces (106), wherein the one or more hardware processors (104) are configured by the instructions to: collect a document as input; pre-process the document to generate a pre-processed document; identify one or more topics in the pre-processed document; identify a plurality of unique words in the pre-processed document; identify a plurality of phrases and word patterns in the pre-processed document; correlate each of the plurality of the unique words to corresponding at least one topic, based on the identified word patterns; and build a knowledge graph using the correlation of the plurality of the unique words with the corresponding at least one topic. 5. The system as claimed in claim 4, wherein the system pre-processes the document by: determining range of characters in the document; dividing text in the document at a granular level, based on the determined range of characters; and converting the text in the document to one of a structured format and a hierarchical format. 6. The system as claimed in claim 4, wherein the system performs a document extraction using the knowledge graph, by: receiving a user query for at least one document, wherein the user query comprises at least one keyword; comparing the at least one keyword with the knowledge graph to identify at least one match; and extracting at least one document based on the at least one match. , Description:FORM 2 THE PATENTS ACT, 1970 (39 of 1970) & THE PATENT RULES, 2003 COMPLETE SPECIFICATION (See Section 10 and Rule 13) Title of invention: METHOD AND SYSTEM FOR DOCUMENT INDEXING AND RETRIEVAL Applicant: Tata Consultancy Services Limited A company Incorporated in India under the Companies Act, 1956 Having address: Nirmal Building, 9th Floor, Nariman Point, Mumbai 400021, Maharashtra, India The following specification particularly describes the invention and the manner in which it is to be performed. TECHNICAL FIELD The disclosure herein generally relates to document processing, and, more particularly, to method and system for document indexing and retrieval. BACKGROUND Document indexing and retrieval is a major requirement in any industry/domain in which huge size of data need to be handled. For example, organizations belonging to different business domains who provide support to their users with customer support are required to handle customer data as well as organizational data. Employees at call centers, research centers, product companies have to perform tedious task of scanning humongous amount of data to answer customer queries. This is true for different industries such as but not limited to E-commerce, Education, Pharma, Tourism, and IT. Existing systems for document processing are based on supervised approach using annotated tags, which comes with conditions such as but not limited to uniform and predefined text parameters like font size, and font style, for document processing. Such systems identify section-based data from the unstructured documents without considering the statistical variations in content which results in highly inaccurate content extraction. SUMMARY Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one embodiment, a processor implemented method of document processing is provided. In this process, initially a document is collected as input, via one or more hardware processors. Further, the document is pre-processed via the one or more hardware processors, to generate a pre-processed document. Further, one or more topics in the pre-processed document are identified. Further, a plurality of unique words in the pre-processed document are identified. Further, a plurality of phrases and word patterns in the pre-processed document are identified. Further, each of the plurality of the unique words is correlated to corresponding at least one topic, based on the determined word patterns. Further, a knowledge graph is built using the correlation of the plurality of the unique words with the corresponding at least one topic. In another aspect, a system for document processing is provided. The system includes a memory storing instructions, one or more communication interfaces, and one or more hardware processors coupled to the memory via the one or more communication interfaces. The one or more hardware processors are configured by the instructions to initially collect a document as input. The system then pre-processes the document to generate a pre-processed document. Further, one or more topics in the pre-processed document are identified by the system. Further, the system identifies a plurality of unique words in the pre-processed document. Further, a plurality of phrases and word patterns in the pre-processed document are identified. Further, each of the plurality of the unique words is correlated to corresponding at least one topic, based on the determined word patterns. Further, a knowledge graph is built using the correlation of the plurality of the unique words with the corresponding at least one topic. In yet another aspect, a non-transitory computer readable medium for document processing is provided. The non-transitory computer readable medium contains a plurality of instructions, which when executed, causes the document processing via the following steps. In this process, initially a document is collected as input, via one or more hardware processors. Further, the document is pre-processed via the one or more hardware processors, to generate a pre-processed document. Further, one or more topics in the pre-processed document are identified. Further, a plurality of unique words in the pre-processed document are identified. Further, a plurality of phrases and word patterns in the pre-processed document are identified. Further, each of the plurality of the unique words is correlated to corresponding at least one topic, based on the determined word patterns. Further, a knowledge graph is built using the correlation of the plurality of the unique words with the corresponding at least one topic. It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed. BRIEF DESCRIPTION OF THE DRAWINGS The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles: FIG. 1 illustrates an exemplary system for document processing, according to some embodiments of the present disclosure. FIG. 2 is a flow diagram depicting steps in the method of document processing, by the system of FIG. 1, according to some embodiments of the present disclosure. FIG. 3 is a flow diagram depicting steps in the method of pre-processing the document, by the system of FIG. 1, according to some embodiments of the present disclosure. FIG. 4 is a flow diagram depicting steps in the method of document retrieval, by the system of FIG. 1, according to some embodiments of the present disclosure. FIG. 5 is an example implementation of the system of FIG. 1, according to some embodiments of the present disclosure. DETAILED DESCRIPTION OF EMBODIMENTS Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments. It is intended that the following detailed description be considered as exemplary only, with the true scope being indicated by the following claims. Referring now to the drawings, and more particularly to FIG. 1 through FIG. 5, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method. FIG. 1 illustrates an exemplary system for document processing, according to some embodiments of the present disclosure. The system 100 includes one or more hardware processors 102, communication interface(s) or input/output (I/O) interface(s) 103, and one or more data storage devices or memory 101 operatively coupled to the one or more hardware processors 102. The one or more hardware processors 102 can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, graphics controllers, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor(s) are configured to fetch and execute computer-readable instructions stored in the memory. In an embodiment, the system 100 can be implemented in a variety of computing systems, such as laptop computers, notebooks, hand-held devices, workstations, mainframe computers, servers, a network cloud and the like. The communication interface(s) 103 can include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like and can facilitate multiple communications within a wide variety of networks N/W and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. In an embodiment, the communication interface(s) 103 can include one or more ports for connecting a number of devices to one another or to another server. The memory 101 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random-access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. In an embodiment, one or more components (not shown) of the system 100 can be stored in the memory 101. The memory 101 is configured to store a plurality of operational instructions (or ‘instructions’) which when executed cause one or more of the hardware processor(s) 102 to perform various actions associated with the document processing being performed by the system 100. Various steps involved in the process of document processing being performed by the system 100 of FIG. 1 are depicted in FIG. 2 through FIG. 5, and are explained with reference to the hardware components depicted in FIG. 1. FIG. 2 is a flow diagram depicting steps in the method of document processing, by the system of FIG. 1, according to some embodiments of the present disclosure. In an embodiment, the system 100 comprises one or more data storage devices or the memory 102 operatively coupled to the processor(s) 104 and is configured to store instructions for execution of steps of the method 200 by the processor(s) or one or more hardware processors 104. The steps of the method 200 of the present disclosure will now be explained with reference to the components or blocks of the system 100 as depicted in FIG. 1 and the additional steps of flow diagrams as depicted FIG. 3, and FIG. 4. Although process steps, method steps, techniques or the like may be described in a sequential order, such processes, methods, and techniques may be configured to work in alternate orders. In other words, any sequence or order of steps that may be described does not necessarily indicate a requirement that the steps to be performed in that order. The steps of processes described herein may be performed in any order practical. Further, some steps may be performed simultaneously. At step 202 of the method 200, the system 100 collects a document as input. The document may be in any format, for example, pdf, pptx, docx and txt and so on. In various embodiments, the document may be fed to the system 100 using a suitable interface provided, or the system 100 may be configured automatically fetch the document from a source that is connected to the system 100 via a suitable interface. At step 204, the system 100 pre-processes the document, to generate a pre-processed document. By pre-processing the document, the system 100 converts the document to a format that can be further processed for indexing. Various steps involved in the process of pre-processing the document are depicted in method 300 in FIG. 3. At step 302, the system 100 determines a range of characters in the document. At this step, the system 100 crawls over the document using suitable crawling technique(s) and extracts values of various parameters such as but not limited to size, capitalized words, Title words, style like bold, and normal. Based on the extracted values of the different parameters, the system 100 plots a distribution graph, and from the distribution graph, a range of character distribution is determined. The system 100 then, at step 304, divides the text in the document at granular levels, based on the determined range of characters. Further, at step 306, the system 100 converts the text in the document to one of a structured format and a hierarchical format, using appropriate data processing mechanism. In addition, the pre-processing of the document may also involve a) identifying relevant content from the document by scanning the document, b) creating a normal distribution over the determined range of characters, and c) eliminating irrelevant sections in the document. Identifying the relevant sections in the document involves the following steps. The system 100 normalizes a mean distribution of the document and takes a mean value as reference for calculating an overall threshold. The overall threshold indicates/represents a minimum number of any of the parameters such as but not limited to capitalized words, title words, and style like bold, normal, that is required in a section of the document so that the section can be considered as a relevant section by the system 100. If the number of parameters being considered exceed the overall threshold for any section, the system 100 considers that section as relevant, and if otherwise, as irrelevant. By comparing the overall threshold value with the parameters such as but not limited to capitalized words, title words, and style like bold, normal, the system 100 determines different sections/portions in the document as relevant and irrelevant sections. For example, the document may contain header, footer, index page and so on, which do not contain any parameter that belong to the mentioned types, and hence the number of parameters could be less than the overall threshold. Hence the system 100 may determine the header, footer, index page and so on as irrelevant sections and then eliminate/remove. However, for paragraphs in the document, the number of parameters may be exceeding the overall threshold, and hence the system 100 determines the paragraphs as relevant sections. A statistical approach that may be used by the system 100 for identifying the relevant contents, and in turn the relevant sections, is explained below: The document d is divided into T blocks/sections. Consider that number of title words of ith block is nci (where, i ranges from 1 to T). The values of the various parameters/characteristics are extracted at this stage, and these values are used to plot a distribution graph which is further used to determine the range of character distribution. Ratio to maximum size is a value that is indicative of category of each of the sections in the document i.e. whether the section is a heading, sub-heading, paragraph, header, or footer etc., and is defined for the entire document as: RMED=f_si/maxT where, max font size = max(fs1, fs2, ...,fsi,...fsT) and fsi refers to font size of ith block. The system 100 further checks if RMED and a percentage capital count (pcc) >= threshold value, where threshold value is automatically calculated based on highest character size in document. For each section in the document, the pcc value represents percentage of capitalized count in comparison with total number of words in the section. The capitalized count is measured in terms of number of title words, block words, and capitalized words in the section. If the aforementioned condition is true, then the ith block is qualified as heading else ith block is determined as in a paragraph. Now if ith block is qualified for heading: if (fsi ~ max font size) then ith block will be heading otherwise it will be subheading. The system 100 then performs pattern recognition to achieve elimination of index/table of contents. In various embodiments, the system 100 may perform the pattern recognition by considering all pages in the document at once, or based on contents from a certain number (n) of pages, wherein value of n may be pre-configured with the system 100. In the pages being considered, the system 100 identifies frequency of numeric data and non-numeric data and their pattern of occurrences is determined. Based on the pattern of occurrences, the relevant contents are identified. The system 100 may then eliminate/remove the irrelevant sections in the document, such that only the relevant sections are included in the pre-processed document that is to be processed in subsequent steps. Further, at step 206, the system 100 identifies one or more topics in the pre-processed document that contains the relevant sections. The system 100 may use a stochastic process to identify the topics. The system 100 calculates value of number of topics (T) as: T={¦(vN,ifvN

Documents

Application Documents

# Name Date
1 202121011653-STATEMENT OF UNDERTAKING (FORM 3) [18-03-2021(online)].pdf 2021-03-18
2 202121011653-REQUEST FOR EXAMINATION (FORM-18) [18-03-2021(online)].pdf 2021-03-18
3 202121011653-FORM 18 [18-03-2021(online)].pdf 2021-03-18
4 202121011653-FORM 1 [18-03-2021(online)].pdf 2021-03-18
5 202121011653-FIGURE OF ABSTRACT [18-03-2021(online)].jpg 2021-03-18
6 202121011653-DRAWINGS [18-03-2021(online)].pdf 2021-03-18
7 202121011653-DECLARATION OF INVENTORSHIP (FORM 5) [18-03-2021(online)].pdf 2021-03-18
8 202121011653-COMPLETE SPECIFICATION [18-03-2021(online)].pdf 2021-03-18
9 202121011653-Proof of Right [24-06-2021(online)].pdf 2021-06-24
10 202121011653-FORM-26 [14-10-2021(online)].pdf 2021-10-14
11 Abstract1.jpg 2022-02-22
12 202121011653-Request Letter-Correspondence [07-04-2022(online)].pdf 2022-04-07
13 202121011653-Power of Attorney [07-04-2022(online)].pdf 2022-04-07
14 202121011653-Form 1 (Submitted on date of filing) [07-04-2022(online)].pdf 2022-04-07
15 202121011653-Covering Letter [07-04-2022(online)].pdf 2022-04-07
16 202121011653-CERTIFIED COPIES TRANSMISSION TO IB [07-04-2022(online)].pdf 2022-04-07
17 202121011653 CORRESPONDANCE (IPO) WIPO DAS 12-04-2022.pdf 2022-04-12
18 202121011653-FORM 3 [12-07-2022(online)].pdf 2022-07-12
19 202121011653-FER.pdf 2022-10-18
20 202121011653-OTHERS [04-01-2023(online)].pdf 2023-01-04
21 202121011653-FER_SER_REPLY [04-01-2023(online)].pdf 2023-01-04
22 202121011653-COMPLETE SPECIFICATION [04-01-2023(online)].pdf 2023-01-04
23 202121011653-CLAIMS [04-01-2023(online)].pdf 2023-01-04
24 202121011653-US(14)-HearingNotice-(HearingDate-14-03-2024).pdf 2024-02-23
25 202121011653-FORM-26 [07-03-2024(online)].pdf 2024-03-07
26 202121011653-FORM-26 [07-03-2024(online)]-1.pdf 2024-03-07
27 202121011653-Correspondence to notify the Controller [07-03-2024(online)].pdf 2024-03-07
28 202121011653-Written submissions and relevant documents [28-03-2024(online)].pdf 2024-03-28
29 202121011653-PatentCertificate08-05-2024.pdf 2024-05-08
30 202121011653-IntimationOfGrant08-05-2024.pdf 2024-05-08

Search Strategy

1 dellarocca2017E_18-10-2022.pdf

ERegister / Renewals

3rd: 08 Jul 2024

From 18/03/2023 - To 18/03/2024

4th: 08 Jul 2024

From 18/03/2024 - To 18/03/2025

5th: 05 Mar 2025

From 18/03/2025 - To 18/03/2026