Sign In to Follow Application
View All Documents & Correspondence

Document Vector Store

Abstract: ABSTRACT OF THE INVENTION: The invention presents an entirely new paradigm of document-search framework to make the retrieval and access of specific information from several thousands of documents, especially PDF files, as simple and easy as possible. The document content and metadata are retrieved for search using a vector-bas.ed approach with state-of-the-art vectorization · techniques and vector databases such as MilvusDB. ' Document content is prepared and vectorized using preprocessing met~ods such as stop word removal, tokenization, normalization, etc. The vecior data are stored in the database and indexed by such techniques that embrace similarity cosines to make fast and "~~nrM.e retriev~l~ of relevant results. It is an offline system that works m one's locality, enhancing privacy and dependability in the absence of a network. The solution is scalable, and flexible and supports every possible document format; thus, it is one stop solution for students, research scholars, and job-workers. Spending less time searching through documents increases productivity and has taken working with an incredibly user-friendly approach enriched with metadata.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #
Filing Date
02 December 2024
Publication Number
50/2024
Publication Type
INA
Invention Field
COMPUTER SCIENCE
Status
Email
Parent Application

Applicants

Maheswaran T
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING, SRI SHAKTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY, L&T BYPASS ROAD, COIMBATORE-641062 TAMILNADU, INDIA
Prames M
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING, SRI SHAKTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY, L&T BYPASS ROAD, COIMBATORE-641062 TAMILNADU, INDIA
Priyadharshini K
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING, SRI SHAKTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY, L&T BYPASS ROAD, COIMBATORE-641062 TAMILNADU, INDIA
Sarika O R
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING, SRI SHAKTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY, L&T BYPASS ROAD, COIMBATORE-641062 TAMILNADU, INDIA
Yogeswari S
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING, SRI SHAKTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY, L&T BYPASS ROAD, COIMBATORE-641062 TAMILNADU, INDIA

Inventors

1. Maheswaran T
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING, SRI SHAKTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY, L&T BYPASS ROAD, COIMBATORE-641062 TAMILNADU, INDIA
2. Prames M
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING, SRI SHAKTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY, L&T BYPASS ROAD, COIMBATORE-641062 TAMILNADU, INDIA
3. Priyadharshini K
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING, SRI SHAKTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY, L&T BYPASS ROAD, COIMBATORE-641062 TAMILNADU, INDIA
4. Sarika O R
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING, SRI SHAKTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY, L&T BYPASS ROAD, COIMBATORE-641062 TAMILNADU, INDIA
5. Yogeswari S
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING, SRI SHAKTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY, L&T BYPASS ROAD, COIMBATORE-641062 TAMILNADU, INDIA

Specification

4. DESCRIPTION
The proposed invention changes how users search for specific information in entire
oceans of documents, especially in PDFs.
This fast, reliable, resource-efficient, and accurate system utilizes vector databases
such as Milvus 'DB, wctorization, and advanced techniques for. document-searching .
purposes.
~ t
It IS indeed very valuable as far as research and ·practitioners are concerned,
especially the. fact that it requires no network to retrieve information from different
documents.
Additionally, the system returns metadata-rich results to enable the user to quickly
locate the document/page/line of the sean:h results.
Moreover, this flexible design can manage all document types. This makes it useful
in various domains.
BACKGROUND ART
Most modern document searching methodologies adopt brute force search which
becomes quite inefficient and resource-insatiable when dealing with large data sets or
several documents.
It becomes critical for researchers and postgraduates to spend a lot of time sifting
through a great number of PDF files in search of some reference.
. The existing solutions fail to provide such a seamless, fast-sought, and precise means
of searching across document collections and thus demand a new sophisticated, more
advancerl approach to searching advanced vector-based methods.
Traditional databases also f~il to achieve the capacity for efficient. storage and
querying ofvectorized data to pose search challenges.
Furthermore, these techniques lack accuracy by not giving correct results, which
again raises the time and manual effort required for filtration.
The present invention overcomes these disadvantages by providing a more efficient,
accurate, and scalable solution for document searching.
NOVEL SYSTEM AND METHOD
SOFTWARE IMPLEMENTATION:
Converting PDF to Text:
By always using tools such as pdf-to-text, input PDF documents are transformed
from source files into machine-readable texts. It is generally and structurally ~untent
adaptive· for further processing.
Text Preprocessing:
Text preprocessing prepares tokenization and normalization by removing stop words
to set for turning into vectors standard meaningful input.
Vectorization:
Preprocessed text is vectorized using n-gram models or pre-trained embcddings, with
document name, page, and line numbers as metadata attached.
Storing in the Vector Database:
The vectorized data and metadata are stored using an indexing technique such as
HNSW in a database like MilvusDB for memory-efficient similarity search.
Similarity Search:
The user-provided input goes through vectorization and is made to compare its vector
form with the stored vector forms of others using cosine similarity to get the closest relevant
results together with their corresponding metadata.
User Interface and Output:
The output results are made available either through GUI or CLI interfaces, allowing
users to carry out rich metadata output searches for specific information.

s. CLAIMS:
1/WeClaim,
L A document search system using a vector-based store for efficient retrieval
of information from large document collections.
II. The system employs a vector database, such as MilvusDB, to store
vectorized content of documents with metadata consisting of the document
name, page, and line numbers.
111. Preprocessing the contents of a PDF document through stopword
elimination, tokenization, and normalization prepares it for vectorization.
tv. The vectorization process uses n-gram models or pre-trained embeddings to
represent text as vectors for similarity searching.
v. The system performs similarity searches using cosine similarity and gives
the most relevant results more accurately and efficiently.
vi. · While metadata-rich results, offiine functionality, and scalability will enable·
the system to be used across diverse formats and large collections without
performance issues

Documents

Application Documents

# Name Date
1 202441094635-Form 9-021224.pdf 2024-12-05
2 202441094635-Form 5-021224.pdf 2024-12-05
3 202441094635-Form 3-021224.pdf 2024-12-05
4 202441094635-Form 2(Title Page)-021224.pdf 2024-12-05
5 202441094635-Form 1-021224.pdf 2024-12-05