Abstract: This disclosure relates to method, device, and system for clustering document objects based on information content. The method may include identifying a plurality of object chunks from at least one document based on semantic context of each of the plurality of object chunks, determining at least one document portion from the at least one document as a base document based on a plurality of parameters applied to the plurality of object chunks, determining a plurality of hierarchies within the base document, and categorizing the plurality of object chunks based on the plurality of hierarchies and information in each of the plurality of object chunks. It should be noted that each of the plurality of object chunks may include at least one object selected from the at least one document. Figure 2
Claims:
WE CLAIM:
1. A method of clustering document objects based on information content, the method comprising:
identifying, by a document clustering device, a plurality of object chunks from at least one document based on semantic context of each of the plurality of object chunks, wherein each of the plurality of object chunks comprise at least one object selected from the at least one document;
determining, by the document clustering device, at least one document portion from the at least one document as a base document, based on a plurality of parameters applied to the plurality of object chunks;
determining, by the document clustering device, a plurality of hierarchies within the base document; and
categorizing, by the document clustering device, the plurality of object chunks based on the plurality of hierarchies and information in each of the plurality of object chunks.
2. The method of claim 1, wherein each of the at least one object comprises at least one of text, an image, a figure, a table, or a graph.
3. The method of claim 1, wherein identifying an object chunk from the plurality of object chunks comprises:
summarizing a paragraph within a document from the at least one document;
iteratively adding at least one sentence to the paragraph;
iteratively computing a summary quotient based on length of sentences within the paragraph and length of the at least one first sentence added in a current iteration; and
iteratively comparing the summary quotient with a predefined threshold.
4. The method of claim 3, further comprising demarcating the object chunk in a current iteration, when the summary quotient in the current iteration exceeds the predefined threshold, wherein the demarcated object chunk excludes the at least one sentence added in the current iteration.
5. The method of claim 1, wherein determining the at least one document portion as the base document comprises:
determining the plurality of parameters for each document portion in a plurality of document portions within the at least one document, wherein the plurality of document portions comprise the at least one document portion;
computing, for each document portion, a weighted sum of the plurality of parameters in response to determining the plurality of parameters for each document portion; and
selecting the at least one document portion as the base document in response to computing the weighted sum for each document portion, wherein the at least one document portion comprises the highest weighted sum.
6. The method of claim 1, wherein the plurality of parameters comprises at least one of: number of object chunks in each document portion, number of object chunks in each document portion that are common with remaining document portions in the plurality of document portions, number of object chunks in each document portion that overlap with one or more of the remaining document portions, or number of documents from the at least one document that each document portion overlaps.
7. The method of claim 1, wherein categorizing an object chunk from the plurality of object chunks comprises:
creating an index for the object chunk based on iterative summarization of the object chunk; and
extracting information context from the object chunk based on frequency of occurrence of each term in the object chunk and total number of terms in the object chunk.
8. The method of claim 7, wherein iterative summarization is performed to reduce a summary of the object chunk to a predefined number of words.
9. The method of claim 7, wherein the object chunk is categorized in a hierarchy from the plurality of hierarchies based on similarity of the index and the information context with the hierarchy.
10. The method of claim 1 further comprising receiving a user query, wherein the user query comprises at least one of textual query and a vocal query.
11. The method of claim 10 further comprising:
extracting keywords from the user query to determine a context of the user query;
comparing the extracted keywords with each hierarchy in the plurality of hierarchies to identify a hierarchy matching the extracted keywords;
retrieving at least one object chunk from a set of chunks categorized within the matching hierarchy; and
presenting the at least one object chunk to a user generating the user query.
12. The method of claim 11, wherein the at least one object chunk is retrieved based on history associated with the user.
13. A system for clustering document objects based on information content, the method comprising:
a document clustering device comprising at least one processor and a memory storing instructions that, when executed by the at least one processor, cause the at least one processor to perform operations comprising:
identifying a plurality of object chunks from at least one document based on semantic context of each of the plurality of object chunks, wherein each of the plurality of object chunks comprise at least one object selected from the at least one document;
determining at least one document portion from the at least one document as a base document, based on a plurality of parameters applied to the plurality of object chunks;
determining a plurality of hierarchies within the base document; and
categorizing the plurality of object chunks based on the plurality of hierarchies and information in each of the plurality of object chunks.
14. The system of claim 13, wherein identifying an object chunk from the plurality of object chunks comprises:
summarizing a paragraph within a document from the at least one document;
iteratively adding at least one sentence to the paragraph;
iteratively computing a summary quotient based on length of sentences within the paragraph and length of the at least one first sentence added in a current iteration; and
iteratively comparing the summary quotient with a predefined threshold.
15. The system of claim 14, wherein the operations further comprise demarcating the object chunk in a current iteration, when the summary quotient in the current iteration exceeds the predefined threshold, wherein the demarcated object chunk excludes the at least one sentence added in the current iteration.
16. The system of claim 13, wherein determining the at least one document portion as the base document comprises:
determining the plurality of parameters for each document portion in a plurality of document portions within the at least one document, wherein the plurality of document portions comprise the at least one document portion;
computing, for each document portion, a weighted sum of the plurality of parameters in response to determining the plurality of parameters for each document portion; and
selecting the at least one document portion as the base document in response to computing the weighted sum for each document portion, wherein the at least one document portion comprises the highest weighted sum.
17. The system of claim 13, wherein categorizing an object chunk from the plurality of object chunks comprises:
creating an index for the object chunk based on iterative summarization of the object chunk; and
extracting information context from the object chunk based on frequency of occurrence of each term in the object chunk and total number of terms in the object chunk.
18. The system of claim 17, wherein iterative summarization is performed to reduce a summary of the object chunk to a predefined number of words, and wherein the object chunk is categorized in a hierarchy from the plurality of hierarchies based on similarity of the index and the information context with the hierarchy.
19. The method of claim 13, wherein the operations further comprise:
receiving a user query;
extracting keywords from the user query to determine a context of the user query;
comparing the extracted keywords with each hierarchy in the plurality of hierarchies to identify a hierarchy matching the extracted keywords;
retrieving at least one object chunk from a set of chunks categorized within the matching hierarchy, wherein the at least one object chunk is retrieved based on history associated with the user; and
presenting the at least one object chunk to a user generating the user query.
| # | Name | Date |
|---|---|---|
| 1 | 201841045339-STATEMENT OF UNDERTAKING (FORM 3) [30-11-2018(online)].pdf | 2018-11-30 |
| 2 | 201841045339-REQUEST FOR EXAMINATION (FORM-18) [30-11-2018(online)].pdf | 2018-11-30 |
| 3 | 201841045339-POWER OF AUTHORITY [30-11-2018(online)].pdf | 2018-11-30 |
| 4 | 201841045339-FORM 18 [30-11-2018(online)].pdf | 2018-11-30 |
| 5 | 201841045339-FORM 1 [30-11-2018(online)].pdf | 2018-11-30 |
| 6 | 201841045339-FIGURE OF ABSTRACT [30-11-2018].jpg | 2018-11-30 |
| 7 | 201841045339-DRAWINGS [30-11-2018(online)].pdf | 2018-11-30 |
| 8 | 201841045339-DECLARATION OF INVENTORSHIP (FORM 5) [30-11-2018(online)].pdf | 2018-11-30 |
| 9 | 201841045339-COMPLETE SPECIFICATION [30-11-2018(online)].pdf | 2018-11-30 |
| 10 | 201841045339-Request Letter-Correspondence [11-12-2018(online)].pdf | 2018-12-11 |
| 11 | 201841045339-Power of Attorney [11-12-2018(online)].pdf | 2018-12-11 |
| 12 | 201841045339-Form 1 (Submitted on date of filing) [11-12-2018(online)].pdf | 2018-12-11 |
| 13 | 201841045339-Proof of Right (MANDATORY) [09-05-2019(online)].pdf | 2019-05-09 |
| 14 | Correspondence by Agent_Proof of Right_15-05-2019.pdf | 2019-05-15 |
| 15 | 201841045339-PETITION UNDER RULE 137 [01-10-2021(online)].pdf | 2021-10-01 |
| 16 | 201841045339-OTHERS [01-10-2021(online)].pdf | 2021-10-01 |
| 17 | 201841045339-Information under section 8(2) [01-10-2021(online)].pdf | 2021-10-01 |
| 18 | 201841045339-FORM-26 [01-10-2021(online)].pdf | 2021-10-01 |
| 19 | 201841045339-FORM 3 [01-10-2021(online)].pdf | 2021-10-01 |
| 20 | 201841045339-FER_SER_REPLY [01-10-2021(online)].pdf | 2021-10-01 |
| 21 | 201841045339-CORRESPONDENCE [01-10-2021(online)].pdf | 2021-10-01 |
| 22 | 201841045339-CLAIMS [01-10-2021(online)].pdf | 2021-10-01 |
| 23 | 201841045339-FER.pdf | 2021-10-17 |
| 24 | 201841045339-PatentCertificate17-10-2023.pdf | 2023-10-17 |
| 25 | 201841045339-IntimationOfGrant17-10-2023.pdf | 2023-10-17 |
| 26 | 201841045339-PROOF OF ALTERATION [11-01-2024(online)].pdf | 2024-01-11 |
| 1 | SearchStrategy45339E_23-02-2021.pdf |