Abstract: This disclosure relates generally to a method and system for identifying gene expression map through-super-resolved histological image. State-of-art methods of gene expression profiling, such as spatial transcriptomics requires incredibly expensive experimental procedure that limits its widespread dissemination in healthcare and diagnostic sectors. The present disclosure addresses these problems through a method of identifying a tissue-wide gene expression map by performing super-resolution of gene expression in a tissue section, which contains a set of spots having a known gene expression profile. Pixels are extracted from the area surrounding the spots having the known gene expression. The machine learning model is trained on these spots to predict the gene expression on the additional spots devoid of gene expression in the entire histological image. Finally, the gene expression map is prepared by combining the spots with known gene expression and the additional spots with predicted gene expression.
Description:FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION
(See Section 10 and Rule 13)
Title of invention:
METHOD AND SYSTEM FOR GENERATING TISSUE-WIDE GENE EXPRESSION MAP BY SUPER-RESOLVING GENE EXPRESSION WITHIN A TISSUE SECTION
Applicant:
Tata Consultancy Services Limited
A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th Floor,
Nariman Point, Mumbai 400021,
Maharashtra, India
The following specification particularly describes the invention and the manner in which it is to be performed.
TECHNICAL FIELD
[001] The disclosure herein generally relates to gene expression identification, and, more particularly, to systems and methods for generating gene expression map using deep learning.
BACKGROUND
[002] Deregulated expression of ribonucleic acid (RNA) can lead to many human diseases including cancer. Over 170 chemical modifications have been identified in protein-coding and noncoding RNAs and shown to exhibit broad impacts on gene expression. Dysregulation of RNA modifications caused by aberrant expression of RNA modifiers aberrantly reprograms the epitranscriptome and skews global gene expression, which in turn leads to carcinogenesis. Gene expression is a process by which the information encoded in a gene is translated to a functional gene “product”, which is typically a protein or an RNA molecule, which performs highly specific functions. Protein coding genes are expressed through a molecule called messenger RNA (mRNA), which is read as a sequence of triplets based on which the protein is synthesized. The amount of mRNA produced is directly correlated to the amount of protein synthesized, in most cases. RNA plays a key role in the execution of key cellular processes and its expression is tightly regulated through internal mechanisms and feedback loops which respond to external stimuli. Aberrant expression of RNA can be a result of errors in these regulatory mechanisms and is often associated with cancer-related abnormalities. Thus, measuring RNA expression, typically done by sequencing and counting the amount of RNA produced, can provide key insights which may help in identification of errant regulatory mechanisms and aid in proposing therapeutic pathways.
[003] Spatial transcriptomics technologies enable researchers to accurately quantify and localize messenger ribonucleic acid (mRNA) transcripts at a high resolution while preserving their spatial context. Spatial transcriptomics technologies have facilitated the profiling of genome-wide readouts and the documentation of the spatial locations of individual cells. This wealth of information on gene expressions and their spatial contexts has enabled researchers to identify cancer initiation and disease progression. Breakthroughs in a number of new spatial transcriptomics (ST) technologies have made it possible to accurately locate cells. In situ hybridization (ISH)-based technologies (e.g., smFISH, MERFISH, and seqFISH) hybridize the targeted RNA sequences with pre-designed probes and use spectral barcodes or sequential imaging technologies to capture fluorescent signals for transcript identification. However, these technologies lack the capacity to discover new transcriptomes and isoforms. In situ sequencing (ISS)-based technologies (e.g., FISSEQ and STARmap) use micron- or nanometer-sized DNA balls to enhance RNA signals to achieve ISS but can only capture a limited number of genes. In addition, ISH-based and ISS-based technologies require a highly sensitive single-molecule fluorescence imaging system, complex repeated imaging, and further complex image analysis processes. Current ST technologies suffer many challenges. Firstly, the cost of generating ST data is very high, which discourages its widespread use. Multi-omics data generated from ST technologies can be integrated into different analysis tasks, but the corresponding tools require higher analytical capabilities. Further, scoring and ranking of the features extracted from multiple omics data becomes challenging when the available evidence for the same event is inconsistent. For example, both the transcriptome and proteome can reveal the gene expression, but they are generally not identical. Another challenge is to achieve finer detail at the single-cell level, higher resolution, higher sensitivity, and a larger field of view at the microscopic level.
SUMMARY
[004] Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one embodiment, a method of generating a tissue-wide gene expression map by super-resolving gene expression in a tissue section is provided. The method includes extracting, via one or more hardware processors, pixels surrounding a plurality of spots from a histological image to obtain a plurality of spot images representing spatial locations in the histological image where gene expression has already been measured. An area of WxH pixels surrounding the pixel coordinate of each spot is extracted where W and H are the width and height, respectively, of the area, selected based on the original spot density and for optimal super-resolution. Each pixel in the histological image represents a single cell, and the Pixels surrounding the spot having gene expression are extracted to cover the neighboring cells influencing the gene expression. The method further includes, creating, via the one or more hardware processors, a plurality of vector embeddings of the plurality of spot images using a pre-trained convolutional neural network (CNN) model. The plurality of spot image represented in the RGB format is passed through “KimiaNet”, the CNN model, to generate “spot embeddings”. The CNN model generating embedding vectors is a pre-trained KimiaNet model that receives the plurality of spot images in RGB format having NxWxHx3 dimensions and converts the spot images to embedding vector with N x 1024 dimensions, wherein N is the number of spots, W is the width of each of the plurality of spot images, H is the height of the spot image and 3 corresponds to the number of channels in the RGB representation. The method further includes learning, via the one or more hardware processors, by a pre-trained Graph Convolutional Network (GCN) model to predict gene expression at a plurality of additional spots devoid of gene expression measurements covering entire tissue. First, the plurality of spot images having known gene expression are processed using the “KimiaNet” CNN to generate “embeddings”, which are used to represent nodes in the K-Nearest Neighbour (KNN) graph, constructed by connecting nodes with K other nodes closest to it, in terms of Euclidean distance. The known gene expression scores of the nodes and the KNN graph is utilized in training the GCN module. The KNN graph is re-computed by processing the plurality of additional nodes and their corresponding embeddings covering the entire tissue. The gene expression score is generated for the additional nodes by a trained GCN model using the re-computed KNN graph to obtain a tissue-wide gene expression map. The method further includes combining, via the one or more hardware processors, the spot images with gene expression and a plurality of additional spot images associated with the plurality of additional spots devoid of gene expression to obtain a super-resolution gene expression map. The gene expression map obtained by combining the spot images with known gene expression and additional images for which gene expression is identified using the method of the present disclosure is utilized in estimating the mRNA count which is indicative of cellular level abnormalities.
[005] In another aspect, a system for a method of generating a tissue-wide gene expression map by super-resolving gene expression in a tissue section is provided. The system includes at least one memory storing programmed instructions; one or more Input /Output (I/O) interfaces; and one or more hardware processors, a gene expression identification map model, operatively coupled to a corresponding at least one memory, wherein the system is configured to extract, via one or more hardware processors, pixels surrounding a plurality of spots from a histological image to obtain a plurality of spot images representing spatial locations in the histological image where gene expression has already been measured. An area of WxH pixels surrounding the pixel coordinate of each spot is extracted where W and H are the width and height, respectively, of the area, selected based on the original spot density and for optimal super-resolution. Each pixel in the histological image represents a single cell, and the Pixels surrounding the spot having gene expression are extracted to cover the neighboring cells influencing the gene expression. The system is configured to create, via the one or more hardware processors, a plurality of vector embeddings of the plurality of spot images using a pre-trained convolutional neural network (CNN) model. The plurality of spot image represented in the RGB format is passed through “KimiaNet”, the CNN model, to generate “spot embeddings”. The CNN model generating embedding vectors is a pre-trained KimiaNet model that receives the plurality of spot images in RGB format having NxWxHx3 dimensions and converts the spot images to embedding vector with N x 1024 dimensions, wherein N is the number of spots, W is the width of each of the plurality of spot images, H is the height of the spot image and 3 corresponds to the number of channels in the RGB representation. Further the system is configured to learn, via the one or more hardware processors, by a pre-trained Graph Convolutional Network (GCN) model to predict gene expression at a plurality of additional spots devoid of gene expression measurements covering entire tissue. First, the plurality of spot images having known gene expression are processed using the “KimiaNet” CNN to generate “embeddings”, which are used to represent nodes in the K-Nearest Neighbour (KNN) graph, constructed by connecting nodes with K other nodes closest to it, in terms of Euclidean distance. The known gene expression scores of the nodes and the KNN graph is utilized in training the GCN module. The KNN graph is re-computed by processing the plurality of additional nodes and their corresponding embeddings covering the entire tissue. The gene expression score is generated for the additional nodes by a trained GCN model using the re-computed KNN graph to obtain a tissue-wide gene expression map. Further the system is configured to combine, via the one or more hardware processors, the spot images with gene expression and a plurality of additional spot images associated with the plurality of additional spots devoid of gene expression to obtain a super-resolution gene expression map. The gene expression map obtained by combining the spot images with known gene expression and additional images for which gene expression is identified using the method of the present disclosure is utilized in estimating the mRNA count which is indicative of cellular level abnormalities.
[006] In yet another aspect, a computer program product including a non-transitory computer-readable medium having embodied therein a computer program for identifying gene expression map by generating super-resolution of a histological image comprising tissue section is provided is provided. The computer readable program, when executed on a computing device, causes the computing device to extract, via one or more hardware processors, pixels surrounding a plurality of spots from a histological image to obtain a plurality of spot images representing spatial locations in the histological image where gene expression has already been measured. An area of WxH pixels surrounding the pixel coordinate of each spot is extracted where W and H are the width and height, respectively, of the area, selected based on the original spot density and for optimal super-resolution. Each pixel in the histological image represents a single cell, and the Pixels surrounding the spot having gene expression are extracted to cover the neighboring cells influencing the gene expression. The computer readable program, when executed on a computing device, causes the computing device to create, a plurality of vector embeddings of the plurality of spot images using a pre-trained convolutional neural network (CNN) model. The plurality of spot image represented in the RGB format is passed through “KimiaNet”, the CNN model, to generate “spot embeddings”. The CNN model generating embedding vectors is a pre-trained KimiaNet model that receives the plurality of spot images in RGB format having NxWxHx3 dimensions and converts the spot images to embedding vector with N x 1024 dimensions, wherein N is the number of spots, W is the width of each of the plurality of spot images, H is the height of the spot image and 3 corresponds to the number of channels in the RGB representation. The computer readable program, when executed on a computing device, causes the computing device to learn by a pre-trained Graph Convolutional Network (GCN) model, predicting gene expression at a plurality of additional spots devoid of gene expression measurements covering entire tissue. First, the plurality of spot images having known gene expression are processed using the “KimiaNet” CNN to generate “embeddings”, which are used to represent nodes in the K-Nearest Neighbour (KNN) graph, constructed by connecting nodes with K other nodes closest to it, in terms of Euclidean distance. The known gene expression scores of the nodes and the KNN graph is utilized in training the GCN module. The KNN graph is re-computed by processing the plurality of additional nodes and their corresponding embeddings covering the entire tissue. The gene expression score is generated for the additional nodes by a trained GCN model using the re-computed KNN graph to obtain a tissue-wide gene expression map. The computer readable program, when executed on a computing device, causes the computing device to combine the spot images with gene expression and a plurality of additional spot images associated with the plurality of additional spots devoid of gene expression to obtain a super-resolution gene expression map. The gene expression map obtained by combining the spot images with known gene expression and additional images for which gene expression is identified using the method of the present disclosure is utilized in estimating the mRNA count which is indicative of cellular level abnormalities.
BRIEF DESCRIPTION OF THE DRAWINGS
[007] The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
[008] FIG. 1 illustrates an exemplary block diagram of a system for generating tissue-wide gene expression map by super-resolving gene expression within a tissue section, according to some embodiments of the present disclosure.
[009] FIG. 2 is a diagram that illustrates essential components of the system with sequential functionalities for generating tissue-wide gene expression map offered by the system 100, according to some embodiments of the present invention.
[010] FIG. 3 is a flow diagram of an illustrative method generating tissue-wide gene expression map by super-resolving gene expression within a tissue section, according to some embodiments of the present disclosure.
[011] FIG. 4 depicts a sample Haematoxylin and Eosin (H&E) stained tissue section image, taken on a slide under a microscope that is received by the system as input to identify corresponding gene expression map, according to some embodiments of the present invention.
[012] FIG. 5 depicts a super-resolution template with black spots representing the areas in the tissue where gene expression has been measured and grey spots representing the areas where the proposed invention predicts gene expression counts, according to some embodiments of the present invention.
[013] FIG. 6 illustrates segregation of spot images to prepare a training dataset, a test dataset, and a validation dataset, according to some embodiments of the present invention.
[014] FIG. 7 illustrates pixel extraction from the H&E-stained histological image, according to some embodiments of the present invention.
[015] FIG. 8 illustrates generation of spot embeddings using KimiaNet convolutional neural network (CNN) model, according to some embodiments of the present invention.
[016] FIG. 9 illustrates k-nearest neighbor graph generation using spot embeddings, according to some embodiments of the present invention.
[017] FIG. 10 depicts a two-layer Graph Convolutional Network (GCN) model trained for predicted gene expression of the spots, according to some embodiments of the present invention.
[018] FIG. 11 illustrates computation of Pearson Correlation Coefficients (PCCs) for a training set, a test set and a validation set for each gene, based on the true and predicted gene expression values, according to some embodiments of the present invention.
DETAILED DESCRIPTION OF EMBODIMENTS
[019] Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments.
[020] As used herein the term “exemplary” is used herein to mean “serving as an example, instance, or illustration.” Any implementation described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other implementations.
[021] As used herein, the term “pixel” means a two-dimensional unit cell or elementary picture element in a display. Pixels have a set of values associated with them.
[022] As used herein, a “histological image” refers to an image showing the microscopic structure of organic tissue. A “histological feature of interest” means a feature of this microscopic structure. The feature may be of interest for diagnostic or therapeutic purposes, or for scientific research, for instance. Histological specimens are typically used to review the structure to determine the diagnosis or to try determining a prognosis. In the case where the histological images relate to pathologies, the term "histopathological image" may be used.
[023] As used herein, the term “neural network” refers to a machine learning model that can be tuned (e.g., trained) based on inputs to approximate unknown functions. In particular, the term neural network can include a model of interconnected artificial neurons (e.g., organized in layers) that communicate and learn to approximate complex functions and generate outputs based on a plurality of inputs provided to the model. For instance, the term neural network includes one or more machine learning algorithms. In addition, a neural network can refer to an algorithm (or set of algorithms) that implements deep learning techniques that utilize a set of algorithms to model high-level abstractions in data. To illustrate, a neural network can include a convolutional neural network, a recurrent neural network, a generative adversarial neural network, and/or a graph neural network (i.e., a neural network that comprises learned parameters for analyzing a graph topology).
[024] As used herein, the term “vector embedding” refers to a 1024-dimensional vector entity generated by passing the spot image through a pretrained KimiaNet encoder
[025] As used herein, the term “graph neural network” refers to the neural network model that learns associations between the graph representation and gene expression
[026] As used herein, the term “node” refers to a spot in the tissue space where gene expression is measured, in the graph
[027] As used herein, the term “edge” refers to the entity connecting neighboring nodes in the graph
[028] Gene expression is a process by which the information encoded in a gene is translated to a functional gene “product”, which is typically a protein or an RNA molecule, and performs highly specific functions. Protein coding genes are expressed through a molecule called as messenger RNA (mRNA), which is read as a sequence of triplets based on which the protein is synthesized. The amount of mRNA produced is directly correlated to the amount of protein synthesized, in most cases. RNA plays a key role in the execution of key cellular processes and its expression is tightly regulated through internal mechanisms and feedback loops which respond to external stimuli. Aberrant expression of RNA can be a result of errors in these regulatory mechanisms and is often associated with cancer-related abnormalities. Thus, measuring RNA expression, typically done by sequencing and counting the amount of RNA produced, can provide key insights which may help in identification of errant regulatory mechanisms and aid in proposing therapeutic pathways. Bulk RNA sequencing techniques can measure the average amount of RNA produced in a tissue sample. While the data generated using bulk RNA-seq has broad utilities in cancer classification, biomarker discovery and more, it cannot do so at the tumor microenvironment level, and key information may be lost. Spatial Transcriptomics (ST) is a technique which allows us to measure gene expression at the local level in a tissue and was named as “method of the year” in 2020 by Nature Methods. ST allows to measure gene expression at certain spots in a tissue section, where each spot represents either a single cell or collection of cells. ST preserves the spatial information while sequencing and can enable us to unlock complex spatial structures and aid in identification of different cell types in the microenvironment and decode the cross-talk between tumor cells and the microenvironment. ST is necessary to understand tumorigenesis and design effective treatment strategies. The experimental procedure to generate spatial gene expression, however, is incredibly expensive to execute, thus limiting its widespread use in diagnosis, design of therapeutic strategies and prognosis of cancer. In the recent past, various studies have attempted inference of spatial gene expression by applying machine learning-based methods on histopathological images of tissue sections. Machine learning models are trained on subsections of these images where gene expression has already been measured using ST procedures. These models are then used to impute gene expression at additional spots in the same tissue or a totally different tissue by subjecting its histopathological image to the model. These images are typically obtained through a microscope at various magnification levels, after staining the tissue sample using dyes that allow us to visually distinguish between individual cells and their corresponding nuclear and cytoplasmic matter.
[029] Therefore in the present disclosure, a Graph Neural Network (GNN)-based method is utilized to analyse a tissue section image stained using Hematoxylin and Eosin (H&E) and corresponding spatial gene expression data is generated to accurately infer gene expression at additional spots within the same tissue section and termed as “super-resolution” of gene expression. A publicly-available dataset containing histological images of tissues and corresponding gene expression data measured at spatial coordinates within each tissue images are used to train and validate the GNN model.
[030] Referring now to the drawings, and more particularly to FIG. 1 through FIG. 7, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments, and these embodiments are described in the context of the following exemplary system and/or method.
[031] FIG. 1 illustrates an exemplary block diagram of system 100 for generating tissue-wide gene expression map by super-resolving gene expression within a tissue section, according to some embodiments of the present disclosure.
[032] In an embodiment, the system 100 includes one or more processors 104, communication interface device(s) or input/output (I/O) interface(s) 106, and one or more data storage devices or memory 102 operatively coupled to the one or more processors 104. The one or more processors 104 that are hardware processors can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, graphics controllers, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor(s) are configured to fetch and execute computer-readable instructions stored in the memory. In the context of the present disclosure, the expressions ‘processors’ and ‘hardware processors’ may be used interchangeably. In an embodiment, system 100 can be implemented in a variety of computing systems, such as, laptop computers, notebooks, hand-held devices, workstations, mainframe computers, servers, a network cloud, and the like. The I/O interface (s) 106 may include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like and can facilitate multiple communications within a wide variety of networks and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. In an embodiment, the I/O interface(s) 106 can include one or more ports for connecting a number of devices such as the user terminals enabling user to communicate with system via the chat bot UI or enabling devices to connect with one another or to another server. The memory 104 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random-access memory (SRAM) and dynamic random-access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. In an embodiment, memory 102 may include a database or repository. Memory 102 may comprise information pertaining to input(s)/output(s) of each step performed by the processor(s) 104 of the system 100 and methods of the present disclosure. In an embodiment, the database may be external (not shown) to the system 100 and coupled via the I/O interface 106. The memory 102, further include a gene expression identification map model 110 which comprises an embedding computation module 110A, K-nearest neighbor (KNN) module 110B, and Graph neural network (GCN) module 110C. The gene expression identification map model 110 super-resolves spatial gene expression in a tissue section through a regressive approach. The model 110 receives a histological image of the tissue section with spots depicting known gene expression profile. The model 110 maps the spots at which gene expression profiles were measured in the tissue to the pixels in the tissue image. The model 110 further creates vector embeddings of the spots and processes the embeddings using deep neural network methodologies to identify spots devoid of gene expression throughout the remaining area of the histological image. This results in a super-resolved gene expression map of the entire histological image. The embedding computation module 110A processes the pixels of the histological image and create vector embeddings. The area around the pixel is extracted which constitutes the spot image corresponding to the gene expression measurement. The spot images (i.e. encoders) are passed through a Convolutional Neural Network (CNN) model, being pre-trained specifically on histopathological images and outputs vector embeddings of the spot image. K-nearest neighbor (KNN) module 110B utilizes the spot images with their corresponding gene expression profiles and the embeddings generated by the embedding computation module 110A as a training set to generate a nearest neighbour graph, in which the nodes are the spots (at which gene expression profiles were measured). The edges in the nearest neighbour graph are determined based on the spatial distances between spots, i.e., each node is connected to .k. nodes which are closest to the node, where k is a hyperparameter. The k nearest neighbor graph processes the pixels (representing a cell in the tissue image) wherein each cell is a node that extends edges to the k other nodes with most similar gene expression. The graph topology thus obtained from KNN graph corresponds to the gene expression is further resolved to obtain the gene expression map. Therefore, firstly, the plurality of spot images having known gene expression are processed using the “KimiaNet” CNN to generate “embeddings”, which are used to represent nodes in the K-Nearest Neighbour (KNN) graph, constructed by connecting nodes with K other nodes closest to it, in terms of Euclidean distance. The known gene expression scores of the nodes and the KNN graph is utilized in training the Graph Neural Network (GCN) module 110C. The GCN module 110C processes the node-based topology obtained from KNN graph using regression arithmetic approach where the spot image embeddings corresponding to each node constitute the node features, or the independent variable in the regression problem, and the gene expression scores are the dependent variable. For prediction of gene expression profiles at additional spots within the tissue, regression-based calculations are extended to generate the graph and pass it through the trained GCN to generate the predicted gene expression profiles. The memory 102 further includes a plurality of modules (not shown here) comprises programs or coded instructions that supplement applications or functions performed by the system 100 for executing different steps involved in the data rate prediction and prioritization. The plurality of modules, amongst other things, can include routines, programs, objects, components, and data structures, which perform particular tasks or implement particular abstract data types. The plurality of modules may also be used as, signal processor(s), node machine(s), logic circuitries, and/or any other device or component that manipulates signals based on operational instructions. Further, the plurality of modules can be used by hardware, by computer-readable instructions executed by one or more hardware processors 104, or by a combination thereof. The plurality of modules can include various sub-modules (not shown).
[033] FIG. 2 is a diagram that illustrates essential components of the system 100 with sequential functionalities for generating tissue-wide gene expression map offered by the system 100, according to some embodiments of the present invention.
[034] As illustrated in FIG. 2, the system 100 comprises of a gene expression identification map model 110 that performs super-resolution of the histological image by learning from the known gene expression in the form of limited spots and then identifies gene expression for other spots existing within the entire area covered of the histological image using deep learning techniques. The identification of the spots devoid of gene expression is combined with the spots having known gene expression to obtain super-resolved gene map of the histological image. The gene expression identification map model 110 receives the histological image 202. Histological specimens are typically used to review the structure to determine the diagnosis or try to determine a prognosis. At the microscopic scale, many of the interesting features of cells are not naturally visible, because they are transparent and colorless. To reveal these features, specimens are commonly stained with a marker before being imaged under a microscope. The marker includes one or more colorants (dyes or pigments) that are designed to bind specifically to particular components of the cell structure, thus revealing the histological feature of interest. One commonly used staining system is called H&E (Haematoxylin and Eosin). H&E contains the two dyes haematoxylin and eosin. Eosin is an acidic dye—it is negatively charged. It stains basic (or acidophilic) structures red or pink. Haematoxylin can be considered as a basic dye. It is used to stain acidic (or basophilic) structures a purplish blue. DNA (heterochromatin and the nucleolus) in the nucleus, and RNA in ribosomes and in the rough endoplasmic reticulum are both acidic, and so haematoxylin binds to them and stains them purple. Some extracellular materials (i.e. carbohydrates in cartilage) are also basophilic. Most proteins in the cytoplasm are basic, and so eosin binds to these proteins and stains them pink. This includes cytoplasmic filaments in muscle cells, intracellular membranes, and extracellular fibers. Those skilled in the art will be aware of a number of alternative stains that may be used. Such histological images may be used in particular for evaluating tissues that may be cancerous. It is useful to be able to classify images to determine the expected outcome.
[035] Further, a pre-trained deep learning based embedding computation module 110A of the gene expression identification model 110 creates vector embeddings. Image embeddings are a numerical representation of images encoded into a lower-dimensional vector representation. Image embeddings condense the complexity of visual data into a compact form. This makes it easier for machine learning models to process the semantic and visual features of visual data. According to an embodiment of the present disclosure, from the H&E-stained image, a tissue section having gene expression data measured at various spatial locations is selected as “spots”. The spots are used initially to train the network. For the training purpose, the spots are divided into three datasets, a training dataset, a test dataset and a validation dataset. According to an embodiment of the present invention, 40% of the total number of spots in the tissue image constitute the training set. These spots are selected using an algorithm called “farthest point sampling”, which ensures that spots that are farthest from each other are selected. Half of the remaining spots, selected at random, constitute the test set, and the remaining half constitute the validation set. The pre-trained deep learning based embedding computation module 110A processes the spots in each dataset represented by a “spot image” in the Red-Green-Blue (RGB) format (having dimension 112 x 112 x 3), through “KimiaNet” to generate “spot embeddings”. KimiaNet is a Convolutional Neural Network (CNN) with the same architecture as that of DenseNet-121, with the final classification layer removed, and is pre-trained on histological images. In order to get an intuition of how effective training from scratch is on the expressiveness of the deep features, the deep features of randomly selected patches, from each cancer subtype, are extracted using both KimiaNet and pre-trained DenseNet-121 and visualized after reducing their dimensionality. This visualization illustrates that for KimiaNet, the instances of each class can easily be distinguished from others while for pre-trained DenseNet the instances of almost all of the classes are mixed together. This comparison is another verification to show how discriminative training with domain-specific images has made the features. Also, four simpler networks, made up of repetitions of convolutional, batch-normalization and ReLU layers, (CBR networks) are implemented and compared against the KimiaNet to check if the network design could still be further simplified. The experiments demonstrated that KimiaNet features are by far better than CBR networks which validates the DenseNet-121 as a good candidate for KimiaNet’s architecture. Furter using the embeddings, a K-nearest neighbour (KNN) graph 206 of the spots is generated. In an embodiment, the training and test set spots and their corresponding spot embeddings are selected to generate the KNN graph 206 of the spots, in which the spots constitute the nodes. Each spot is connected by an edge to K other spots that are closest to it. Subsequently, graph data object is generated which contains the set of edges (represented by indices of source and destination spots), and the spot embeddings generated by the pre-trained deep learning based embedding computation module 110A, that acts as “node features”. The division of the data into training, test and validation is only done during the training procedure, with the intent of validating the approach by comparing the predicted and true gene expression of spots in the test and validation sets, the expression of which is not seen by the model during training. For practical purposes, the model is trained on a small set of spots with known gene expression and is then used to generate a tissue-wide spatial gene expression map. For generating the KNN graph, spatially close cells having stronger interactions are selected than distant cells having weaker cellular interactions. Accordingly, nearby cells with edges are connected to each other to model their interactions.
[036] The KNN algorithm is used to build a topology. Euclidean distances is utilized between nuclei centroids in the image space to quantify cellular distances. Subsequently, each spot (node) is connected to K other spots (or nodes) with the least Euclidean distances. Then, build an edge (euv) between spots (nodes) u and v if:
d(u,v) <= d(xk,v)
where d is the Euclidean distance, and xk is the kth nearest neighbour of the spot v, i.e., the spot with the kth lowest Euclidean distance from v.
[037] The graph data object thus obtained is utilized in training a Graph Convolutional Network (GCN) 208. The GCN 208 is a two-layered model wherein first layer takes the edges and 1024-dimensional node features as input to generate a H-dimensional representation of the nodes. This representation is sometimes passed through a dropout layer, with a probability of 0.5. The second graph convolutional layer takes the resultant node representation to generate the final predictions, which is a vector of G dimensions, with G being the number of genes. In an embodiment validation set spot images are also added to the generate the graph data object and GCN is analyzed. The resultant graph is then passed through the trained GCN model to generate the gene expression value predictions for the spot images devoid of gene expression. Further, Pearson Correlation Coefficients (PCCs) is computed for each gene, based on the true and predicted gene expression values in the test and validation sets. The mean value of PCCs for all the genes acts as a validation metric to test the quality of super-resolved gene expression predictions. Trained GCN 208 thus generate the gene expression map 210 by combining spot images with known gene expression and spot images devoid of gene expression.
[038] FIG. 3 is a flow diagram of an illustrative method 300 generating tissue-wide gene expression map by super-resolving gene expression within a tissue section, according to some embodiments of the present disclosure.
[039] The steps of method 300 of the present disclosure will now be explained with reference to the components or blocks of the system 100 as depicted in FIG. 1 through FIG. 7. Although process steps, method steps, techniques or the like may be described in a sequential order, such processes, methods, and techniques may be configured to work in alternate orders. In other words, any sequence or order of steps that may be described does not necessarily indicate a requirement that the steps be performed in that order. The steps of processes described herein may be performed in any order practical. Further, some steps may be performed simultaneously. The system 100 comprises of the model 110 that generates tissue-wide super-resolved gene expression map by learning associations between histological images and the known gene expression of limited spots and identifying gene expression for other spots existing within the entire tissue area. However, the existing solutions simply focus on identifying the spots in the entire histological image received as H&E-stained tissue section image presented in FIG. 4 by the system 100 that has few spots with known gene expression and remaining spots in the image are devoid of gene expression. Therefore, present disclosure takes an H&E-stained tissue section image (FIG.4) along with corresponding spatial gene expression data as input, and produces an accurate, high resolution spatial gene expression profile within the same tissue, as demonstrated in FIG. 5. At step 302 of the method 300, the one or more hardware processors 104 are configured to extract pixels surrounding a plurality of spots from a histological image to obtain a plurality of spot images representing spatial locations in the histological image having a gene expression. The histological image is an image of a tissue section stained using Hematoxylin and Eosin (H&E) and observed under a microscope. The H&E-stained image is used by histopathologists to understand the cellular and tissue structures of tumor samples. Histopathologists manually inspect the H&E-stained images of tumor samples and routinely determine tumor grades (Grades-I/II/III/IV) and other cellular phenotypes (astrocytoma, oligoastrocytoma, oligodendroglioma, etc.) of tumor biopsy samples of glioma patients. However, histopathologists cannot determine molecular subtypes of glioma tissue by manually inspecting the H&E-stained image. . An area of WxH pixels surrounding the pixel coordinate of each spot in the H&E image, corresponding to locations within the tissue where gene expression is measured, is extracted, where W and H are the width and height, respectively, of the area selected based on the original spot density and for optimal super-resolution. Each pixel in the histological image represents a single cell, and the pixels surrounding the spot having gene expression are extracted to cover the neighboring cells influencing the gene expression. At step 304 of the method 300, the one or more hardware processors 104 are configured to create a plurality of vector embeddings of the plurality of spot images using a convolutional neural network (CNN) model, called “KimiaNet”, which is pre-trained specifically on histological images. CNN is a modified variety of deep neural networks which is suitable to apply to image-related problems. It uses randomly defined patches for input at the start and modifies them in the training process. Once training is done, the network uses these modified patches to predict and validate the result in the testing and validation process. Convolutional neural networks have achieved success in the image classification problem, as the defined nature of CNN matches the data point distribution in the image. The CNN architecture has two main types of transformation. The first is convolution, in which pixels are convolved with a filter or kernel. This step provides the dot product between image patch and kernel. The width (W) and height (H) of filters can be set according to the network, and the depth of the filter is the same as the depth of the input. A second important transformation is subsampling, which can be of many types (max_pooling, min_pooling and average_pooling) and used as per requirement. The size of the pooling filter can be set by the user and is generally taken in odd numbers. The pooling layer is responsible for lowering the dimensionality of the data and is quite useful to reduce overfitting. After using a combination of convolution and pooling layers, the output can be fed to a fully connected layer for efficient classification.
[040] According to an embodiment, the plurality of spot image represented in the RGB format is passed through “KimiaNet” to generate “spot embeddings”. KimiaNet is a Convolutional Neural Network (CNN) with the same architecture as that of DenseNet-121, which contains several convolutional and pooling layers and one fully connected layer (or classification layer) at the end. Since we are only using the CNN to generate embeddings, we remove the fully connected layer. KimiaNet is pre-trained on histological images. The output of KimiaNet contains 1024 dimensions. Therefore, the CNN model generating embedding vectors is a pre-trained KimiaNet model that receives the plurality of spot images in RGB format having NxWxHx3 dimensions and converts the spot images to embedding vector with N x 1024 dimensions, wherein N is the number of spots, W is the width of each of the plurality of spot images, H is the height of the spot image and 3 represent no of channels corresponds to RGB. Thus, the plurality of spot images are converted to a plurality of embedding vectors. At step 306 of the method 300, the one or more hardware processors 104 are configured to learn by a pre-trained machine learning model to predict gene expression at a plurality of additional spots devoid of gene expression measurements covering entire tissue by processing the plurality of spot images having known gene expression and corresponding vector embeddings in the machine learning model comprising a K-nearest neighbors (KNN) graph generation module and a graph convolutional networks (GCN) module. The vector embeddings corresponding to the spot images having known gene expression are input to the KNN module 110B to generate the KNN graph. The topology of the KNN graphs are such that the spots are represented by the nodes The edges in the nearest neighbor graph are determined based on the spatial distances between spots, i.e., each node is connected to .k. nodes which are closest to the node, where k is a hyperparameter. Subsequently, GCN module 110C processes the KNN graph wherein modelling it as a node regression problem, where the spot image embeddings corresponding to each node constitute the .node features., or the independent variable in the regression problem, and the gene expression scores are the dependent variable. The KNN graph is obtained by processing the plurality of spot images having gene expression and their corresponding vector embeddings in the KNN graph generation module; wherein each node of the KNN graph representing a spot is connected by an edge to K other closest nodes; and wherein k other closest nodes are determined by an euclidean distance. The GCN is trained in the in the GCN module 110C using the known gene expression scores of the nodes and the KNN graph. The KNN graph is further re-computed by processing the plurality of additional nodes and their corresponding vector embeddings covering the entire tissue. Further, gene expression score is generated for the additional nodes by a trained GCN model using the re-computed KNN graph to obtain gene expression map of the histological image. At step 308 of the method 300, the one or more hardware processors 104 are configured to combine the spot images with gene expression and a plurality of additional spot images associated with the plurality of additional spots devoid of gene expression to obtain a super-resolution gene expression map. Initially, a graph is constructed using the spots with known gene expression, as explained above, which is used to train the model. Subsequently, additional spots in the tissue are sampled, covering the entire tissue where gene expression is not known. This graph is then passed through the trained model to generate a tissue-wide gene expression map. The gene expression maps then may be used for downstream analyses, including but not limited to cancer diagnosis, prognosis, deciding therapeutic strategies, etc.
USE CASE: IDENTIFYING SUPER-RESOLUTION GENE EXPRESSION MAP FROM A H&E-STAINED TISSUE SECTION IMAGE ISOLTED FROM BREAST CANCER TISSUES
[041] An example scenario depicting the method identifying super-resolved gene expression map from the H&E-stained histological image of breast cancer tissues by the disclosed system 100 is described with reference to the FIGS. 6-11. The system 100 takes H&E-stained tissue section image along with corresponding spatial gene expression data as input, and produces an accurate, high resolution spatial gene expression profile within the same tissue. To perform identification of the super-resolved gene expression map from the H&E-stained histological image, a spatial transcriptomic dataset available in the public domain are utilized.
Dataset 1:
• Contains H&E-stained tissue section images and corresponding spatial gene expression data.
• Data is derived from 8 patients, A-H, diagnosed with Her2-positive breast cancer.
• Contains data for 36 tissues, 6 each from patients A-D and 3 each from patients E-H.
• Data also contains annotations of tumor and non-tumor areas in tissue sections A1, B1, ...., H1. This annotation was done by expert pathologists.
• Data was obtained from a spatial transcriptomic study by Andersson et al. (Nature Comms., 2021).
[042] The method starts with a tissue section in which gene expression data is measured at various spatial locations, which are termed as “spots”. The spots having known gene expression are divided into three sets – a training set, a test set and a validation set as depicted in FIG. 6. About 40% of the total number of spots in the tissue constitute the training set. These spots are selected using an algorithm called “farthest point sampling”, which ensures that spots that are farthest from each other are selected. Half of the remaining spots, selected at random, constitute the test set, and the remaining half constitute the validation set. From the H&E-stained tissue section image, an area of WxH pixels surrounding the pixel coordinate of each spot is extracted (in all the training set, test set and validation set), where W and H are the width and height, respectively, of the area. Based on the original spot density and for optimal super-resolution, W = H = 112 pixels are selected (shown in FIG. 7). Each pixel is represented by a single cell in the histological image, and the pixels surrounding the spot having gene expression are extracted to cover the neighboring cells influencing the gene expression.
[043] Each spot of the training set, the test set and the validation set is represented by a “spot image” represented in the RGB format (dimension 112 x 112 x 3), through “KimiaNet” to generate “spot embeddings”. KimiaNet is a Convolutional Neural Network (CNN) with the same architecture as that of DenseNet-121, with the final classification layer removed, and is pre-trained on histological images. The output of KimiaNet contains 1024 dimensions (as shown in FIG. 8). Thus, the spot images (with dimensions N x 112 x 112 x 3, N being the number of spots) are converted to an embedding vector (with dimensions N x 1024). Further, the training set and test set spots and their corresponding spot embeddings are selected and provide as input to the KNN module. A KNN graph of the spots is generated in which the spots constitute the nodes, and each spot is connected by an edge to K other spots that are closest to it. Subsequently, a graph data object is generated which contains the set of edges (represented by indices of source and destination spots), and the spot embeddings generated in (c), which act as “node features” (FIG. 9).
[044] The graph data object thus obtained is used to train a two-layer Graph Convolutional Network (GCN) model (FIG. 10). The first layer takes the edges and 1024-dimensional node features as input to generate a H-dimensional representation of the nodes. This representation is sometimes passed through a dropout layer, with a probability of 0.5. The second graph convolutional layer takes the resultant node representation to generate the final predictions, which is a vector of G dimensions, with G being the number of genes. The model is trained for 200 epochs using the Adam optimizer, with a learning rate of 0.001 and a weight decay of 0.0005. A value of 256 is picked for H, after trial-and-error optimization. Further, graph data object is generated again after adding the validation set spots by following the above steps. The resultant graph is passed through the trained GCN model to generate the gene expression value predictions for the training set, the test set and the validation set. Then a Pearson Correlation Coefficients (PCCs) is computed for each of the sets and for each gene, based on the true and predicted gene expression values. The mean of PCCs for all the genes acts as a validation metric to test the quality of super-resolved gene expression predictions.
[045] Finally, a set of spots (roughly 4 times the number of training set spots) are selected to predict gene expression. And a graph data object is generated by the above-described procedure. This graph is then passed through the trained GCN model to generate gene expression value predictions. To validate the accuracy and effectiveness of the disclosed method, the predicted expression of these genes known to be overexpressed in breast cancer are compared with the tumor annotation to assess the quality of predictions. It has been observed that the genes which are known to be overexpressed in cancer are also predicted by the model to be overexpressed in the areas which are labelled as cancer in the tumor annotation. The method of the present disclosure successfully super-resolved the expression of 250 genes with the highest mean expression across all tissues in dataset 1, ensuring that a high signal-to-noise ratio is maintained, and good-quality predictions are obtained. To demonstrate the quality of the predictions, a comparative study is performed wherein the super-resolution results for three genes –ERBB2, FASN and FN1 which are known to be overexpressed in Her2-positive breast cancer is performed as per the method of the present disclosure and using a previously-published approach called “DeepSpaCE” . The DeepSpaCE leverage CNNs pre-trained ImageNet data, and acts as a benchmark. The genes ERBB2, FN1 and FASN are found to be overexpressed in multiple breast cancer instances, and act as markers of diagnosis and prognosis. The PCC values are computed for various tissue sections (A1, B1……H1) for the above genes wherein disclosed method computed PCC value for all three training, test and validation datasets, and the DeepSpaCE computed the PCC for the training and test sets. Results for PCC comparison of both the methods for ERBB2 gene is presented in Table-I, for FN1 gene is presented in Table-II, and for FASN gene is presented in Table-III. As depicted in the Tables I-III, high positive PCC (> 0.5) indicates strong correlation between true and predicted values. The high PCC in training set and low PCC in test/validation sets indicates overfitting (i.e., model learns on training set well but fails to generalize on unseen samples). However, the disclosed method outperforms DeepSpaCE in nearly all instances if PCC is compared. Also, DeepSpaCE seems to suffer from the lack of data points, as significant overfitting is observed in nearly all the tissues. Therefore, the disclosed method indicates strong correlation between true and predicted values in most of the tissues in the breast cancer dataset. No significant overfitting is observed despite the model being trained on limited number of data points.
Table-I
Gene Significance Tissue Section PCC
Method of the present disclosure DeepSpaCE
Train Test Validation Train Test
ERBB2 Gene overexpression in Her2+ breast cancer A1 0.588 0.237 0.322 0.822 0.095
B1 0.87 0.888 0.936 0.917 0.524
C1 0.574 0.664 0.613 0.83 0.357
D1 0.74 0.645 0.62 0.855 0.537
E1 0.8 0.787 0.695 0.821 0.325
F1 0.614 0.369 0.333 0.597 -0.025
G1 0.623 0.73 0.603 0.855 0.69
H1 0.768 0.802 0.779 0.878 0.622
Table-II
Gene Gene Significance Tissue Section PCC
Method of the present disclosure DeepSpaCE
Train Test Validation Train Test
FN1 Gene overexpression in breast cancer A1 0.523 0.126 0.31 0.802 0.094
B1 0.831 0.767 0.776 0.936 0.602
C1 0.451 0.465 0.164 0.933 0.131
D1 0.669 0.659 0.58 0.829 0.418
E1 0.501 0.41 0.479 0.797 0.127
F1 0.414 0.173 0.261 0.68 0.08
G1 0.61 0.659 0.571 0.863 0.345
H1 0.511 0.504 0.352 0.808 0.16
Table-III
Gene Gene Significance Tissue Section PCC
Method of the present disclosure DeepSpaCE
Train Test Validation Train Test
FASN Gene overexpression in several cancers A1 0.568 0.3 0.462 0.894 0.295
B1 0.882 0.802 0.932 0.943 0.525
C1 0.705 0.853 0.77 0.844 0.466
D1 0.674 0.585 0.548 0.918 0.352
E1 0.634 0.692 0.618 0.778 0.208
F1 0.53 0.358 0.326 0.732 0.106
G1 0.538 0.552 0.488 0.896 0.396
H1 0.664 0.511 0.691 0.863 0.437
[046] The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
[047] The embodiments of the present disclosure herein addresses unresolved problem of identifying gene expression map of the tissue section captured in the histological image. The disclosed method utilizes GNN based approach to obtain super-resolved gene expression profile. From the histological image, pixels surrounding a plurality of spots having a known gene expression profile are extracted and corresponding KNN graph is prepared which is introduced to a ML model. Further, ML model trains on the KNN graph and predicts the gene expression on the additional spots devoid of gene expression that covers entire area of the histological image. The combination of spots having known gene expression and the additional spots devoid of gene expression are used in constructing entire gene expression map of the tissue section captured by the histological image. The present disclosure involves the use of an existing CNN model specifically designed for histopathological images to generate a latent representation of the histological image component. An additional contribution to spatial gene expression prediction involve the use of graphs constructed based on spatial proximity of gene expression spots. The present disclosure combines the image processing task (carried out by the CNN) as well as leverages spatial information (carried out by the GNN) giving desired results by super-resolving spatial gene expression.
[048] It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g., any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g., hardware means like e.g., an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g., an ASIC and an FPGA, or at least one microprocessor and at least one memory with software processing components located therein. Thus, the means can include both hardware means, and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g., using a plurality of CPUs.
[049] The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
[050] The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
[051] Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
[052] It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims.
, Claims:
1. A processor implemented method (300) of identifying gene expression map, the method comprising:
extracting (302), via one or more hardware processors, pixels surrounding a plurality of spots from a histological image to obtain a plurality of spot images representing spatial locations in the histological image having a gene expression;
creating (304), via the one or more hardware processors, a plurality of vector embeddings of the plurality of spot images using a pre-trained convolutional neural network (CNN) model;
learning (306), via the one or more hardware processors, by a pretrained machine learning model to predict gene expression at a plurality of additional spots devoid of gene expression measurements covering entire tissue by processing the plurality of spot images having gene expression and corresponding vector embeddings, wherein the machine learning model comprises a K-nearest neighbors (KNN) graph generation module and a graph convolutional networks (GCN) module;
combining (308), via the one or more hardware processors, the spot images with gene expression and the plurality of additional spot images associated with the plurality of additional spots devoid of gene expression to obtain a super-resolution gene expression map.
2. The method as claimed in claim 1, wherein each pixel represents a single cell in the histological image, and wherein the pixels surrounding the spot having gene expression are extracted to cover the neighboring cells influencing the gene expression.
3. The method as claimed in claim 1, wherein processing of the spot images having gene expression by the machine learning model comprises:
obtaining a K-nearest neighbor (KNN) graph by processing the plurality of spot images having gene expression and their corresponding vector embeddings in the KNN graph generation module, wherein each node of the KNN graph representing a spot is connected by an edge to K other closest nodes; and wherein k other closest nodes are determined by an euclidean distance;
learning a graph convolutional network (GCN) in the GCN module using the known gene expression scores of the nodes and the KNN graph;
re-computing the KNN graph by processing the plurality of additional nodes and their corresponding vector embeddings covering the entire tissue;
generating gene expression score of the additional nodes by a trained GCN model using the re-computed KNN graph to obtain gene expression map of the histological image.
4. The method as claimed in claim 1, wherein the CNN model generating embedding vectors is a pre-trained KimiaNet model that receives the plurality of spot images in RGB format having NxWxHx3 dimensions and converts the spot images to embedding vector with N x 1024 dimensions, wherein N is the number of spots, W is the width of each of the plurality of spot images, H is the height of the spot image and 3 represent no of channels corresponds to RGB.
5. The method as claimed in claim 1, wherein the gene expression map is utilized in estimating the mRNA count which is indicative of cellular level abnormalities.
6. A system (100), comprising:
a memory (102) storing instructions;
one or more communication interfaces (106); and
one or more hardware processors (104) coupled to the memory (102) via the one or more communication interfaces (106), wherein the one or more hardware processors (104) are configured by the instructions to:
extract, pixels surrounding a plurality of spots from a histological image to obtain a plurality of spot images representing spatial locations in the histological image having a gene expression;
create, a plurality of vector embeddings of the plurality of spot images using a pre-trained convolutional neural network (CNN) model;
learn, by a pretrained machine learning model to predict gene expression at a plurality of additional spots devoid of gene expression measurements covering entire tissue by processing the plurality of spot images having gene expression and corresponding vector embeddings, wherein the machine learning model comprises a K-nearest neighbors (KNN) graph generation module and a graph convolutional networks (GCN) module;
combine, the spot images with gene expression and the plurality of additional spot images associated with the plurality of additional spots devoid of gene expression to obtain a super-resolution gene expression map.
7. The system as claimed in claim 6, wherein each pixel represents a single cell in the histological image, and wherein the pixels surrounding the spot having gene expression are extracted to cover the neighboring cells influencing the gene expression.
8. system as claimed in claim 6, wherein processing of the spot images having gene expression by the machine learning model comprises:
obtaining a K-nearest neighbor (KNN) graph by processing the plurality of spot images having gene expression and their corresponding vector embeddings in the KNN graph generation module, wherein each node of the KNN graph representing a spot is connected by an edge to K other closest nodes; and wherein k other closest nodes are determined by an euclidean distance;
learning a graph convolutional network (GCN) in the GCN module using the known gene expression scores of the nodes and the KNN graph;
re-computing the KNN graph by processing the plurality of additional nodes and their corresponding vector embeddings covering the entire tissue;
generating gene expression score of the additional nodes by a trained GCN model using the re-computed KNN graph to obtain gene expression map of the histological image.
9. The system as claimed in claim 6, wherein the CNN model generating embedding vectors is a pre-trained KimiaNet model that receives the plurality of spot images in RGB format having NxWxHx3 dimensions and converts the spot images to embedding vector with N x 1024 dimensions, wherein N is the number of spots, W is the width of each of the plurality of spot images, H is the height of the spot image and 3 represent no of channels corresponds to RGB.
10. The system as claimed in claim 6, wherein the gene expression map is utilized in estimating the mRNA count which is indicative of cellular level abnormalities.
| # | Name | Date |
|---|---|---|
| 1 | 202421015872-STATEMENT OF UNDERTAKING (FORM 3) [06-03-2024(online)].pdf | 2024-03-06 |
| 2 | 202421015872-REQUEST FOR EXAMINATION (FORM-18) [06-03-2024(online)].pdf | 2024-03-06 |
| 3 | 202421015872-FORM 18 [06-03-2024(online)].pdf | 2024-03-06 |
| 4 | 202421015872-FORM 1 [06-03-2024(online)].pdf | 2024-03-06 |
| 5 | 202421015872-FIGURE OF ABSTRACT [06-03-2024(online)].pdf | 2024-03-06 |
| 6 | 202421015872-DRAWINGS [06-03-2024(online)].pdf | 2024-03-06 |
| 7 | 202421015872-DECLARATION OF INVENTORSHIP (FORM 5) [06-03-2024(online)].pdf | 2024-03-06 |
| 8 | 202421015872-COMPLETE SPECIFICATION [06-03-2024(online)].pdf | 2024-03-06 |
| 9 | Abstract1.jpg | 2024-04-05 |
| 10 | 202421015872-FORM-26 [20-05-2024(online)].pdf | 2024-05-20 |
| 11 | 202421015872-Proof of Right [17-07-2024(online)].pdf | 2024-07-17 |