A Parallelized Influenced Citation And Semantic Similarity Analysis

A Parallelized Influenced Citation And Semantic Similarity Analysis For Medical Documents Using Biobert Or Clinicalbert With Wms Or Wmd

Abstract: The invention presents a parallelized citation analysis framework for medical research articles using a modified BioBERT or ClinicalBERT model. Traditional citation analysis methods rely on keyword matching and cosine similarity, which fail to capture the semantic relationships between research papers. The proposed model employs Word Mover’s Distance (WMD) instead of cosine similarity to improve accuracy in determining citation influence. By leveraging high-performance computing (HPC), the system parallelizes embedding generation and similarity computation, significantly reducing processing time. The proposed approach achieves a 4x speedup compared to non-parallelized methods while enhancing the accuracy of semantic citation similarity measurement. This invention is a scalable and efficient solution for citation impact analysis in biomedical research.

Patent Information

Application #

Filing Date

19 March 2025

Publication Number

13/2025

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Email

Parent Application

Applicants

Andhra University

Andhra University, Visakhapatnam-530003, Andhra Pradesh, India.

Inventors

1. Majji Venkata kishore

Research Scholar, Department of CS&SE, Andhra University College of Engineering, Andhra University, Visakhapatnam-530003, Andhra Pradesh, India.

2. Dr. Prajna Bodapati

Professor, Department of CS&SE, Andhra University College of Engineering, Andhra University, Visakhapatnam-530003, Andhra Pradesh, India.

Specification

Description:The present invention introduces a parallelized citation analysis framework that employs a modified BioBERT or ClinicalBERT model for determining the semantic similarity between medical research papers and their references. Traditional citation analysis methods rely on keyword-based techniques that fail to capture the true semantic relationships between documents. This invention overcomes these limitations by utilizing deep learning-based embeddings and an improved similarity measurement approach, namely Word Mover’s Distance (WMD), instead of cosine similarity. The proposed system is designed to run efficiently by leveraging high-performance computing (HPC), ensuring faster and more accurate citation analysis for large-scale biomedical research datasets.
The process begins with text extraction and preprocessing, wherein research papers and their cited references are collected in PDF format. The system employs pdfplumber to extract textual content while filtering out unnecessary elements such as figures, tables, and metadata. Tokenization is then performed using NLTK to split the extracted text into individual words, and stop words are removed to enhance the effectiveness of semantic analysis. The preprocessed text is converted into a structured format, preparing it for further embedding generation and similarity computation.
To analyze the semantic similarity between a base research paper and its cited references, the system utilizes BioBERT or ClinicalBERT, both of which are domain-specific adaptations of BERT trained on large biomedical datasets. These models generate dense vector representations (embeddings) for each word in the document. Unlike traditional word representation techniques such as TF-IDF or Word2Vec, BioBERT and ClinicalBERT capture the contextual meaning of words, ensuring that semantically similar words are represented closely in the vector space. The embeddings are computed using a transformer-based architecture, allowing the model to consider long-range dependencies and domain-specific terminology.
The similarity between a research paper and its references is calculated using a modified Word Mover’s Distance (WMD) approach. Unlike cosine similarity, which only measures the angle between document vectors and disregards word semantics, WMD computes the minimum effort required to move words from one document to another using pre-trained word embeddings. This allows the system to recognize conceptually similar but lexically different terms (e.g., "physician" and "doctor") and accurately determine the degree of influence a citation has on the research paper. The modified WMD algorithm enhances performance by incorporating a semantic distance matrix, ensuring more precise similarity computations.
Given the computational complexity of transformer-based models and WMD calculations, the invention optimizes performance using parallelized execution and high-performance computing (HPC) techniques. The system parallelizes two major computational steps: (1) word embedding generation using BioBERT/ClinicalBERT, and (2) WMD similarity calculation between document pairs. GPU acceleration is utilized via PyTorch’s CUDA implementation, enabling efficient parallel computation of embeddings. Multi-threading and multiprocessing are also employed, ensuring that document similarity calculations are distributed across multiple CPU cores. The system further incorporates batch processing, allowing large volumes of research papers to be processed simultaneously.
To enhance scalability, the system constructs a citation similarity matrix that stores pairwise similarity scores between documents. Each entry in this matrix represents the computed semantic influence of a cited reference on the research paper. The citation similarity matrix is then analyzed to identify highly influential citations, distinguishing them from background references that do not contribute significantly to the research. This feature is particularly useful for automated literature reviews, plagiarism detection, and research impact assessment in biomedical sciences.
, C , Claims:1. A parallelized citation analysis system that employs BioBERT or ClinicalBERT models to compute semantic similarity between medical research papers and their references.
2. A method for text extraction and preprocessing, including tokenization and stop-word removal, to prepare research papers for citation analysis.
3. A modified Word Mover’s Distance (WMD) algorithm for accurately measuring semantic similarity between medical documents.
4. A parallelized execution framework using GPU acceleration and multi-threading to optimize computation time.
5. A technique for differentiating influenced citations from non-influenced citations using contextual analysis.
6. A method for integrating BioBERT or ClinicalBERT embeddings with HPC-based parallel computing to improve performance.
7. A system for handling large-scale citation analysis in medical research by distributing computations across multiple processing units.
8. A framework that ensures scalability by efficiently processing thousands of medical research documents.

Documents

Application Documents

#	Name	Date
1	202541024395-FORM 1 [19-03-2025(online)].pdf	2025-03-19
2	202541024395-COMPLETE SPECIFICATION [19-03-2025(online)].pdf	2025-03-19
3	202541024395-FORM-9 [20-03-2025(online)].pdf	2025-03-20