Abstract: A hybrid system (100) and method (400) for finding document similarity is disclosed. The controller (112) is configured to receive an original document (102) and a test document (104) from a user; extract text features and image features from the original document (102) and the test document (104); match the similarity between the text features and image features of both the original document (102), and the test document (104); score the text features similarity of the original document (102) with the test document (104); score the image features dissimilarity of the original document (102) with the test document (104), and validate the test document (104) as fake or original based on the scoring of the text features and the image features of the original document (102) with the test document. FIGs. 1
Description:HYBRID SYSTEM AND METHOD FOR FINDING DOCUMENT SIMILARITY
BACKGROUND
Technical Field
[0001] The embodiment herein generally relates to document processing systems and more particularly, to a hybrid system and method for finding document similarity.
Description of the Related Art
[0002] Generally determining how similar two documents are goes beyond simple text matching as the two documents can be Identical or near-identical copies. Further, one document may be a modified version of the other document, with changes in formatting, order of information, or minor textual alterations. The two documents may share significant overlap in content, for example two documents may discuss similar topics or ideas, even if the exact wording is different.
[0003] Furthermore, the two documents may show semantic gap. For example, matching of two documents are based solely on exact word matches can miss subtle similarities in meaning. It is also important to consider structural differences, for example documents can have different layouts, formatting, and organisational structures, making direct comparison difficult.
[0004] Further the existing system doesn’t hand noise efficiently. For examples, the challenges in the system can be dealing with typos, OCR errors, and other imperfections in the text. Also, efficiently of the system reduces while comparing large volumes of documents.
[0005] Another technical challenges of the existing system are that Identifying fraudulent documents such as forged documents, manipulated documents, and plagiarised content are difficult. Furthermore, fraudsters are constantly developing new techniques to create convincing forgeries. Also Identifying subtle changes in text or visual elements that indicate tampering.
[0006] Accordingly, there remains a need for rapid and accurate fraud detection in real-world scenarios using a hybrid system and method for finding document similarity.
SUMMARY
[0007] In view of the foregoing, embodiments herein provide a hybrid system for finding document similarity. A controller connected to a memory and at least one processor is configured to receive an original document and a test document from a user. The controller is further configured to extract text features and image features from the original document and the test document. The controller is further configured to match the similarity between the text features and image features of both the original document, and the test document.
[0008] The controller is further configured to score the text features similarity of the original document with the test document. The controller is further configured to score the image features dissimilarity of the original document with the test document. The controller is further configured to validate the test document as fake or original based on the scoring of the text features and the image features of the original document with the test document.
[0009] In some embodiments, the text features are extracted using a Tesseract OCR.
[00010] In some embodiments, the image features are extracted using a Convolutional Neural Network (CNN).
[00011] In some embodiments, the system further comprises a machine learning module for training the system to learn patterns, identify anomalies, and improve accuracy.
[00012] In some embodiments, the system further comprises Natural Language Processing (NLP) for analyzing text for semantic meaning and relationships.
[00013] In some embodiments, the text similarity scores and image feature dissimilarity scores are combined to create an overall similarity score “S” to assess a degree of similarity between documents.
[00014] In another aspect of the embodiments herein provides a hybrid method for finding document similarity. The method includes receiving, by a controller, an original document and a test document from a user. The method further includes extracting, by the controller, text features and image features from the original document and the test document. The method further includes matching, by the controller, the similarity between the text features and image features of both the original document, and the test document.
[00015] The method further includes scoring, by the controller, the text features similarity of the original document with the test document. The method further includes scoring, by the controller, the image features similarity of the original document with the test document. The method further includes validating, by the controller, the test document as fake or original based on the scoring of the text features and the image features of the original document with the test document.
[00016] In some embodiments, the method further includes training, by a machine learning module, the system to learn patterns, identify anomalies, and improve accuracy.
[00017] In some embodiments, the method further includes analysing, by a Natural Language Processing (NLP), text for semantic meaning and relationships.
[00018] In some embodiments, the text similarity scores and image feature dissimilarity scores are combined to create an overall similarity score “S” to assess a degree of similarity between documents.
[00019] These and other aspects of the embodiments herein will be better appreciated and understood when considered in conjunction with the following description and the accompanying drawings. It should be understood, however, that the following descriptions, while indicating preferred embodiments and numerous specific details thereof, are given by way of illustration and not of limitation. Many changes and modifications may be made within the scope of the embodiments herein, and the embodiments herein include all such modifications.
BRIEF DESCRIPTION OF THE DRAWINGS
[00020] The embodiments herein will be better understood from the following detailed description with reference to the drawings, in which:
[00021] FIG. 1 illustrates a hybrid system for finding document similarity, according to some embodiments herein;
[00022] FIG. 2 illustrates a plurality of module stored in a memory, according to some embodiments herein;
[00023] FIG. 3 illustrates an exemplary method for finding document similarity, according to some embodiments herein; and
[00024] FIG. 4 illustrates a flow chart shows a hybrid method for finding document similarity, according to some embodiments herein.
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
[00025] The embodiments herein and the various features and advantageous details thereof are explained more fully with reference to the non-limiting embodiments that are illustrated in the accompanying drawings and detailed in the following description. Descriptions of well-known components and processing techniques are omitted so as to not unnecessarily obscure the embodiments herein. The examples used herein are intended merely to facilitate an understanding of ways in which the embodiments herein may be practiced and to further enable those of skill in the art to practice the embodiments herein. Accordingly, the examples should not be construed as limiting the scope of the embodiments herein.
[00026] As mentioned, there remains a need for a hybrid system and method for finding document similarity. Referring now to the drawings, and more particularly to FIGs. 1 through 4, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments.
[00027] FIG. 1 illustrates a hybrid system 100 for finding document similarity, according to some embodiments herein. The system 100 includes a memory 108, a processor 110, a controller 112, and a communicator 114.
[00028] The controller 112 connected to the memory 108 and the processor 110 is configured to receive an original document 102 and a test document 104 from a user. The controller 112 is further configured to extract text features and image features from the original document 102 and the test document 104. The controller 112 is further configured to match the similarity between the text features and image features of both the original document 102 and the test document 104.
[00029] The controller 112 is further configured to score the text features similarity of the original document 102 with the test document 104. The controller 112 is further configured to score the image features dissimilarity of the original document 102 with the test document 104. The controller 112 is further configured to validate the test document 104 as fake or original based on the scoring of the text features and the image features of the original document 102 with the test document. In a non-limiting example, the original document 102 and the test document 104 may include documents gathered from various sources, such as legal documents, online articles, and internal company records.
[00030] In some embodiments, the text features are extracted using a Tesseract OCR. The image features are extracted using a Convolutional Neural Network (CNN).
[00031] In some embodiments, the system 100 further includes a machine learning module for training the system to learn patterns, identify anomalies, and improve accuracy. The system 100 further includes Natural Language Processing (NLP) for analyzing text for semantic meaning and relationships. The text similarity scores and image feature dissimilarity scores are combined to create an overall similarity score “S” to assess a degree of similarity between documents. Further, the similarity score of text is obtained using Tesseract OCR, while a dissimilarity score of image is obtained using CNN.
[00032] In some embodiments, the CNN is used to extract features from documents images. Starting from fundamental preprocessing steps such as resizing and normalization, CNN’s architecture (convolutional, pooling, fully connected layers) detects and decouples key edges, textures, and patterns. Those are then converted to normalized vectors to yield a complete set of the visual and textual features for each document. The CNN records fundamental visual and structural elements present in the documents.
[00033] After feature extraction, we normalize the extracted features so that it is comparable accurately. Normalized features are then applied to quantify the difference and find anomalies which could represent fraud or irregularities.
[00034] Finally, the synthesized outputs of the text similarity analysis and the image feature analysis are also displayed using the MATLAB GUI for the overall analysis of the documents. With this combined tool, the system 100 have a robust and comprehensive evaluation of document authenticity and identity.
[00035] The system 100 provides enhanced fraud detection capabilities by providing anomaly detection. The anomaly detection is implements algorithms to detect unusual patterns or anomalies in documents, such as sudden changes in writing style, inconsistencies in data, or unexpected transactions. This can help identify potentially fraudulent activities more proactively.
[00036] The system 100 provides real-time monitoring by Integrate the system 100 into real-time systems to monitor document streams and flag suspicious documents as they are created or received. This is crucial for applications like financial transactions, insurance claims, and online fraud prevention.
[00037] The system 100 performs deepfake detection to detect deepfake documents, where images or text are manipulated using AI techniques. This would require incorporating advanced image and text analysis techniques to identify subtle signs of manipulation.
[00038] The system 100 automatically route documents to the appropriate departments or individuals based on their content and type. This can streamline workflows and improve efficiency in organizations. The system 100 includes search engine that allows users to retrieve documents based on their content, rather than just keywords. This can improve the accuracy and efficiency of information retrieval.
[00039] The system 100 analyzes the sentiment expressed in documents, such as customer feedback, news articles, or social media posts, to gain insights into public opinion and market trends.
[00040] The text extraction includes extract key information from documents, such as dates, names, addresses, and financial figures, to populate databases and automate data entry processes. The system 100 includes build knowledge graphs from a collection of documents to represent relationships between entities and concepts. This can be used for tasks such as knowledge discovery, question answering, and decision support.
[00041] The system 100 uses document analysis to build predictive models, such as predicting customer churn, assessing credit risk, or forecasting market trends. The system 100 integrates the document analysis system with Customer Relationship Management (CRM) systems to improve customer service and sales processes. The system 100 integrates ECM systems to enhance document management, search, and retrieval capabilities.
[00042] The system 100 is integrated with business intelligence platforms to provide insights from document data and support data-driven decision making. The system 100 easily scale resources up or down based on demand.
[00043] The system 100 is a Pay-as-you-go pricing models can be more economical for variable workloads. The system 100 provides access and manage the system from anywhere with an internet connection. The system 100 utilizes cloud computing platforms like AWS, Azure, or Google Cloud to host the system. Leverage cloud-based services for storage, compute power, and machine learning.
[00044] The system 100 utilises technologies such as Apache Spark or Hadoop to distribute data and processing tasks across a cluster of machines. The system 100 uses Edge Computing reduced latency and improved privacy.
[00045] The system is deployed on edge devices, such as IoT devices, smartphones, or local servers, to process data locally and send only the necessary information to the cloud. The system 100 includes Blockchain-Based Implementation for Enhanced.
[00046] The system 100 provides access to advanced AI/ML capabilities: Leverage pre-trained models and APIs from platforms such as TensorFlow, PyTorch, and Hugging Face. The system 100 integrates with AI/ML platforms to utilize their advanced capabilities for tasks such as image recognition, natural language processing, and machine learning model training.
[00047] FIG. 2 illustrates a plurality of module stored in the memory 108, according to some embodiments herein. The plurality of module includes an analysis module 202, a text similarity scoring module 210, an image unmatched scoring module 212, a document validation module 214. The analysis module 202 includes a text extraction module 204, an image extraction module 206, and a similarity matching module 208.
[00048] In some embodiments, the text extraction module 204 performs cosine similarity, Jaccard similarity, and Euclidian distance on the original document 102 and the test document 104. The cosine similarity measures the cosine of the angle between two vectors representing the documents. The Jaccard similarity compares the overlap between sets of words in the documents. The Euclidian distance is a Measure of the number of insertions, deletions, and substitutions required to transform one string into another.
[00049] In some embodiments, the image extraction module 206 extracts features such as edges, shapes, textures, and layout information from document images. The text similarity scoring module 210 provides text similarity score and the image dissimilarity scoring module 212 provides image feature dissimilarity score. The text similarity scores and image feature dissimilarity scores are combined to create an overall similarity score, which can be used to assess the degree of similarity between documents.
[00050] In some embodiments, comparative similarity scores are computed from textual and visual information between the original and test paper. The Jaccard similarity compares the overlap between sets of words in the documents and cosine similarity (near to 1 indicates very similar text) are assessed. At the same time, visual features captured by CNNs are compared with Euclidean distance and cosine similarity on feature vectors to compute visual similarity. The scores are combined, perhaps using averaging or weighting, to create a total similarity score from 0–1. The aggregate score gives you a more subtle estimate of document similarity which can be used to authenticate a document, detect plagiarism or assess the content revisions, by capturing textual and visual integrity of documents.
[00051] In some embodiment, for example, table 1 shows results of the system 100 to detect similarity and dissimilarity between the original document 102 and the test document 104. The system 100 provides robust and highest accuracy document fraud detection. The accuracy is up to 97%.
Document type Total tested True tested data Accuracy (%)
Invoices 200 186 93%
Forms 150 146 97%
Reports 100 95 95%
[0001] FIG. 3 illustrates an exemplary method 300 for finding document similarity, according to some embodiments herein. At step 302, the method 300 starts. At step 304, the user uploads the original document 102 to the system 100. At step 306, the user uploads the test document 104 to the system 100. At step 308, the original document 102 and the test document 104 is processed by the tesseract OCR. At step 310, the original document 102 and the test document 104 is processed by the CNN features. At step 312, the text similarity is obtained. At step 314, feature is normalized. At step 316, the similarity score is compared. At step 318, check if the S=1. At step 320, if the S-1 then the test document 104 is fake document. At step 322, if the S is not equal to 1, then the test document 104 is valid document. At step 324, the method 300 ends.
[0002] FIG. 4 illustrates a flow chart shows a hybrid method 400 for finding document similarity, according to some embodiments herein. At step 402, the method 400 includes receiving, by a controller, an original document and a test document from a user. At step 404, the method 400 includes extracting, by the controller, text features and image features from the original document and the test document. At step 406, the method 400 includes matching, by the controller, the similarity and a dissimilarity between the text features and image features of both the original document, and the test document.
[0003] At step 408, the method 400 includes scoring, by the controller, the text features similarity of the original document with the test document. At step 410, the method 400 includes scoring, by the controller, the image features dissimilarity of the original document with the test document. At step 412, the method 400 includes validating, by the controller, the test document as fake or original based on the scoring of the text features and the image features of the original document with the test document.
[0004] An advantage of the system 100 is that the by combining textual and visual analysis, the system can more accurately assess the similarity between documents, going beyond simple text matching.
[0005] An advantage of the system 100 that the system 100 can capture semantic meaning and context, leading to more accurate similarity assessments.
[0006] An advantage of the system 100 that the system 100 is designed to handle a variety of document types, including invoices, forms, reports, and other structured documents.
[0007] An advantage of the system 100 is that by integrating multiple analysis techniques, the system can more effectively detect fraudulent documents, such as forgeries, manipulations, and plagiarism.
[0008] An advantage of the system 100 that the system 100 can help identify potential fraud early on, enabling timely intervention and mitigating losses.
[0009] An advantage of the system 100 that the system 100 performs automated fraud detection can significantly reduce the time and resources required for manual review.
[00010] An advantage of the system 100 that the system 100 performs automated document analysis can streamline various business processes, such as invoice processing, data entry, and customer service.
[00011] An advantage of the system 100 is that by providing accurate and timely information, the system can support better decision-making in areas such as risk management, compliance, and business intelligence.
[00012] An advantage of the system 100 is that by detecting and preventing fraud, the system can help protect businesses from financial losses and reputational damage.
[00013] The foregoing description of the specific embodiments will so fully reveal the general nature of the embodiments herein that others can, by applying current knowledge, readily modify and/or adapt for various applications such specific embodiments without departing from the generic concept, and, therefore, such adaptations and modifications should and are intended to be comprehended within the meaning and range of equivalents of the disclosed embodiments. It is to be understood that the phraseology or terminology employed herein is for the purpose of description and not of limitation. Therefore, while the embodiments herein have been described in terms of preferred embodiments, those skilled in the art will recognize that the embodiments herein can be practised with modification within the scope of the appended claims.
, Claims:We claim:
1. A hybrid system for finding document similarity, the system comprising:
a memory (108);
at least one processor (110); and
a controller (112) connected to the memory (108) and the at least one processor (110) is configured to:
receive an original document and a test document from a user;
extracting text features and image features from the original document and the test document;
match the similarity and a dissimilarity between the text features and image features of both the original document, and the test document;
score the text features similarity of the original document with the test document;
score the image features dissimilarity of the original document with the test document; and
validate the test document as fake or original based on the scoring of the text features and the image features of the original document with the test document.
2. The system as claimed in claim 1, wherein the text features are extracted using a Tesseract OCR.
3. The system as claimed in claim 1, wherein the image features are extracted using a Convolutional Neural Network (CNN).
4. The system as claimed in claim 1, wherein the system further comprises a machine learning module for training the system to learn patterns, identify anomalies, and improve accuracy.
5. The system as claimed in claim 1, wherein the system further comprises Natural Language Processing (NLP) for analyzing text for semantic meaning and relationships.
6. The system as claimed in claim 1, wherein the text similarity scores and image feature similarity scores are combined to create an overall similarity score “S” to assess a degree of similarity between documents.
7. A hybrid method for finding document similarity, the method comprising:
receiving, by a controller, an original document and a test document from a user;
extracting, by the controller, text features and image features from the original document and the test document;
matching, by the controller, the similarity and a dissimilarity between the text features and image features of both the original document, and the test document;
scoring, by the controller, the text features similarity of the original document with the test document;
scoring, by the controller, the image features dissimilarity of the original document with the test document; and
validating, by the controller, the test document as fake or original based on the scoring of the text features and the image features of the original document with the test document.
8. The method as claimed in claim 7, wherein the method further comprises training, by a machine learning module, the system to learn patterns, identify anomalies, and improve accuracy.
9. The method as claimed in claim 7, wherein the method further comprises analysing, by a Natural Language Processing (NLP), text for semantic meaning and relationships.
10. The method as claimed in claim 7, wherein the text similarity scores and image feature dissimilarity scores are combined to create an overall similarity score “S” to assess a degree of similarity between documents.
| # | Name | Date |
|---|---|---|
| 1 | 202521037294-STATEMENT OF UNDERTAKING (FORM 3) [17-04-2025(online)].pdf | 2025-04-17 |
| 2 | 202521037294-REQUEST FOR EARLY PUBLICATION(FORM-9) [17-04-2025(online)].pdf | 2025-04-17 |
| 3 | 202521037294-POWER OF AUTHORITY [17-04-2025(online)].pdf | 2025-04-17 |
| 4 | 202521037294-MSME CERTIFICATE [17-04-2025(online)].pdf | 2025-04-17 |
| 5 | 202521037294-FORM28 [17-04-2025(online)].pdf | 2025-04-17 |
| 6 | 202521037294-FORM-9 [17-04-2025(online)].pdf | 2025-04-17 |
| 7 | 202521037294-FORM FOR SMALL ENTITY(FORM-28) [17-04-2025(online)].pdf | 2025-04-17 |
| 8 | 202521037294-FORM FOR SMALL ENTITY [17-04-2025(online)].pdf | 2025-04-17 |
| 9 | 202521037294-FORM 18A [17-04-2025(online)].pdf | 2025-04-17 |
| 10 | 202521037294-FORM 1 [17-04-2025(online)].pdf | 2025-04-17 |
| 11 | 202521037294-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [17-04-2025(online)].pdf | 2025-04-17 |
| 12 | 202521037294-EVIDENCE FOR REGISTRATION UNDER SSI [17-04-2025(online)].pdf | 2025-04-17 |
| 13 | 202521037294-DRAWINGS [17-04-2025(online)].pdf | 2025-04-17 |
| 14 | 202521037294-COMPLETE SPECIFICATION [17-04-2025(online)].pdf | 2025-04-17 |
| 15 | Abstract.jpg | 2025-05-02 |
| 16 | 202521037294-FER.pdf | 2025-05-22 |
| 17 | 202521037294-FORM-8 [28-05-2025(online)].pdf | 2025-05-28 |
| 18 | 202521037294-Retyped Pages under Rule 14(1) [07-08-2025(online)].pdf | 2025-08-07 |
| 19 | 202521037294-FER_SER_REPLY [07-08-2025(online)].pdf | 2025-08-07 |
| 20 | 202521037294-CORRESPONDENCE [07-08-2025(online)].pdf | 2025-08-07 |
| 21 | 202521037294-CLAIMS [07-08-2025(online)].pdf | 2025-08-07 |
| 22 | 202521037294-2. Marked Copy under Rule 14(2) [07-08-2025(online)].pdf | 2025-08-07 |
| 1 | 202521037294_SearchStrategyNew_E_Search_294E_20-05-2025.pdf |