Abstract: Method and system for document embedding by obtain average word embeddings is provided. Existing approaches for computing average word embeddings first compute word embeddings for all words in a document and then average the word embeddings to compute an average word embedding for the document making it a resource and time inefficient approach. The method eliminates the need to compute word embeddings for all words in the document by introducing a frequency matrix comprising normalized frequency vectors that capture information on frequency of a word in each document for a plurality documents in a batch with reference to words listed in a dictionary along with frequency of the word in every document. A matrix multiplication is performed between the frequency matrix and a vocabulary embedding matrix generated for the dictionary to directly provide an average embedding matrix. Each row of the average embedding matrix represents the average word embedding corresponding to each of the plurality of documents in the batch. [To be published with FIG. 1B]
FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION (See Section 10 and Rule 13)
Title of invention:
TITLE METHOD AND SYSTEM FOR OBTAINING AVERAGE WORD
EMBEDDINGS
Applicant
Tata Consultancy Services Limited
A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th floor,
Nariman point, Mumbai 400021,
Maharashtra, India
Preamble to the description
The following specification particularly describes the invention and the manner in which it is to be performed.
TECHNICAL FIELD [001] The embodiments herein generally relate to the field of document embedding and, more particularly, to a method and system for document embedding by obtaining average embeddings for words in a document by matrix multiplication.
BACKGROUND [002] Natural language processing (NLP) is an important area of research. Training a machine learning (ML) model for tasks in the domain of NLP requires huge amount of training data. Resource efficient and time efficient approaches are required to translate this training data into a format that can be consumed by the ML models. In the NLP domain, word embeddings have a broad range of applications. The word embeddings are real number vectors where one vector represents one word in a vocabulary. For many applications like document classification, a sentence, a paragraph or an entire document is represented as a vector, which is referred as document embedding. There are many approaches to document embedding. One approach is word averaging , which averages the word embeddings of all the words that make up the document. The average word embeddings approach is specifically useful for use cases like document classification, and the like. Average word embeddings are required to training Deep Neural Networks (DNN) for applications such as document classification. The average word embeddings, interchangeably referred as average embeddings, are conventionally obtained by first computing word embeddings, which are then averaged to derive average word embeddings. To generate these average embeddings, there are well known techniques that use the Keras Library™. Keras ™ or similar techniques require that the input data be integer encoded, so that each unique word is represented by a unique integer. The input documents to be classified are then represented as a sequence of numbers. The sentences are then padded/truncated to the same length, as a batch is trained, and a matrix representing the batch is generated. However, the truncation and padding approach to generate uniform length word embeddings has its limitations. The truncation of sentences to specified length affects accuracy of each document embedding. Further, to generate
the word embeddings, an embedding layer is used, which needs to be initialized. The embedding layer can be initialized to random values or to pre-trained embeddings. For the pre-trained embeddings, a dictionary of unique words and their embeddings is required to be generated. This dictionary is passed to the embedding layer during initialization. With reference to the dictionary, the embedding layer outputs the embeddings of all the words in the batch. The number of computations required for such word embeddings are huge and values can reach a million computations, wherein the tensor size is given by (batch_size * document_length * embedding_length). Reason for requirement of this high volume of computations is due to the need for computing embedding for every word. Once the word embeddings are generated, an average embedding is obtained as average of embedding of all words in a document.
[003] Practically, the embedding layer is necessary because based on the use case, sometimes the embeddings of all words that make up the document is required and sometimes only the average of those word embeddings is required. However, it can be understood that for a use case where only averaging of the embeddings is in focus, using the built-in embedding layer such as that of the Keras Library™ is computationally expensive.
[004] Thus, with this conventional approach followed in generating average embeddings, there is challenge and limitation to reduce the number of computations and hence time efficiency is low. Further, huge computations generate large volume of results requiring more resources required to record the results.
SUMMARY [005] Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one embodiment, a method for obtaining average word embeddings is provided.
[006] The method comprises receiving a plurality of rows of data corresponding to a plurality of documents to be embedded for obtaining an average word embedding for each of the plurality of rows of data.
[007] Further the method comprises preprocessing the plurality of rows of data to determine a plurality of significant words and eliminate a plurality of non-significant words in each of the plurality of rows of data, wherein the plurality of significant words summarize each of the plurality of documents.
[008] Furthermore the method comprises determining a frequency vector for each of the processed plurality of rows to obtain a plurality of frequency vectors and generate a frequency matrix comprising matrix rows equal to the plurality of rows of data and matrix columns equal to a frequency vector length of each of the plurality of frequency vectors, a) wherein the frequency vector length is equal to a number of words in a dictionary created from the preprocessed plurality of rows of data, wherein words in the dictionary are arranged in a predefined sequence; b) wherein each of a plurality of elements in each of the plurality of frequency vectors corresponds to a word in the dictionary having one to one mapping with the predefined sequence; and c) wherein a value of each of the plurality of elements of each of the plurality of frequency vectors is a) ‘0’ if a corresponding word from the predefined sequence of words is missing from a corresponding row among the plurality of rows of data; and b) is equal to number of times the word has occurred in the corresponding row if the word from the predefined sequence of words is present in the corresponding row;
[009] Further the method comprises normalizing each of the plurality of frequency vectors by scaling the value of each of the plurality of elements based on a total number of non-zero elements present in a corresponding row of the frequency matrix to generate a normalized frequency vector matrix.
[0010] Further the method comprises creating a vocabulary embedding matrix by generating word embeddings for the words in the dictionary based on the domain of interest.
[0011] Furthermore the method comprises determining an average embedding matrix by performing matrix multiplication of the normalized frequency
matrix and the vocabulary embedding matrix, wherein each row of the average embedding matrix represents the average word embedding corresponding to each of the plurality of rows of the document. The average embedding matrix is used as training data for a machine learning model for classification task. The vocabulary embedding matrix is generated using pretrained embeddings for each of the plurality of unique words.
[0012] In another aspect, a system for obtaining average word embeddings is provided. The system comprises a memory storing instructions; one or more Input/Output (I/O) interfaces; and one or more hardware processors coupled to the memory via the one or more I/O interfaces, wherein the one or more hardware processors are configured by the instructions to receive a plurality of rows of data corresponding to a plurality of documents to be embedded for obtaining an average word embedding for each of the plurality of rows of data.
[0013] Further preprocess the plurality of rows of data to determine a plurality of significant words and eliminate a plurality of non-significant words in each of the plurality of rows of data, wherein the plurality of significant words summarize each of the plurality of documents.
[0014] Furthermore determine a frequency vector for each of the processed plurality of rows to obtain a plurality of frequency vectors and generate a frequency matrix comprising matrix rows equal to the plurality of rows of data and matrix columns equal to a frequency vector length of each of the plurality of frequency vectors, a) wherein the frequency vector length is equal to the number of words in a dictionary created from the preprocessed plurality of rows of data, wherein words in the dictionary are arranged in a predefined sequence; b) wherein each of a plurality of elements in each of the plurality of frequency vectors corresponds to a word in the dictionary having one to one mapping with the predefined sequence; and c) wherein a value of each of the plurality of elements of each of the plurality of frequency vectors is a) ‘0’ if a corresponding word from the predefined sequence of words is missing from a corresponding row among the plurality of rows of data; and b) is equal to number of times the word has occurred in the corresponding row
if the word from the predefined sequence of words is present in the corresponding row.
[0015] Further normalize each of the plurality of frequency vectors by scaling the value of each of the plurality of elements based on a total number of non-zero elements present in a corresponding row of the frequency matrix to generate a normalized frequency vector matrix.
[0016] Further create a vocabulary embedding matrix by generating word embeddings for the words in the dictionary based on the domain of interest.
[0017] Furthermore determine an average embedding matrix by performing matrix multiplication of the normalized frequency matrix and the vocabulary embedding matrix, wherein each row of the average embedding matrix represents the average word embedding corresponding to each of the plurality of rows of the document. The average embedding matrix is used as training data for a machine learning model for classification task. The vocabulary embedding matrix is generated using pretrained embeddings for each of the plurality of unique words.
[0018] In yet another aspect, there are provided one or more non-transitory machine-readable information storage mediums comprising one or more instructions, which when executed by one or more hardware processors causes a method for obtaining average word embeddings. The method comprises receiving a plurality of rows of data corresponding to a plurality of documents to be embedded for obtaining an average word embedding for each of the plurality of rows of data.
[0019] Further the method comprises preprocessing the plurality of rows of data to determine a plurality of significant words and eliminate a plurality of non-significant words in each of the plurality of rows of data, wherein the plurality of significant words summarize each of the plurality of documents.
[0020] Furthermore the method comprises determining a frequency vector for each of the processed plurality of rows to obtain a plurality of frequency vectors and generate a frequency matrix comprising matrix rows equal to the plurality of rows of data and matrix columns equal to a frequency vector length of each of the plurality of frequency vectors, a) wherein the frequency vector length is equal to a number of words in a dictionary created from the preprocessed plurality of rows of
data, wherein words in the dictionary are arranged in a predefined sequence; b) wherein each of a plurality of elements in each of the plurality of frequency vectors corresponds to a word in the dictionary having one to one mapping with the predefined sequence; and c) wherein a value of each of the plurality of elements of each of the plurality of frequency vectors is a) ‘0’ if a corresponding word from the predefined sequence of words is missing from a corresponding row among the plurality of rows of data; and b) is equal to number of times the word has occurred in the corresponding row if the word from the predefined sequence of words is present in the corresponding row.
[0021] Further the method comprises normalizing each of the plurality of frequency vectors by scaling the value of each of the plurality of elements based on a total number of non-zero elements present in a corresponding row of the frequency matrix to generate a normalized frequency vector matrix.
[0022] Further the method comprises creating a vocabulary embedding matrix by generating word embeddings for the words in the dictionary based on the domain of interest.
[0023] Furthermore the method comprises determining an average embedding matrix by performing matrix multiplication of the normalized frequency matrix and the vocabulary embedding matrix, wherein each row of the average embedding matrix represents the average word embedding corresponding to each of the plurality of rows of the document. The average embedding matrix is used as training data for a machine learning model for classification task. The vocabulary embedding matrix is generated using pretrained embeddings for each of the plurality of unique words.
[0024] It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS
[0025] The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
[0026] FIG. 1A is a functional block diagram of a system for obtaining average word embeddings based on matrix multiplication approach, in accordance with some embodiments of the present disclosure.
[0027] FIG. 1B is an architectural overview of the system of FIG. 1, in accordance with some embodiments of the present disclosure.
[0028] FIG. 2A and FIG. 2B is are flow diagrams illustrating a method for obtaining average word embeddings based on matrix multiplication approach, using the system of FIG. 1, in accordance with some embodiments of the present disclosure.
[0029] It should be appreciated by those skilled in the art that any block diagrams herein represent conceptual views of illustrative systems and devices embodying the principles of the present subject matter. Similarly, it will be appreciated that any flow charts, flow diagrams, and the like represent various processes which may be substantially represented in computer readable medium and so executed by a computer or processor, whether or not such computer or processor is explicitly shown.
DETAILED DESCRIPTION OF EMBODIMENTS [0030] Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments. It is intended that the following detailed description be considered as exemplary only, with the true scope being indicated by the following claims.
[0031] Embodiments herein provide a method and system for obtaining average word embeddings based on matrix multiplication approach. Existing approaches for computing average word embedding first compute word embeddings for all words in a document and then perform averaging of the word embeddings to compute an average word embedding, interchangeably referred herein after as average embedding, for the document. The conventional approach is resource and time inefficient, as it requires higher number of computation and consumes more resources in terms of memory size and computational power. The method disclosed herein eliminates the need to compute word embeddings for all words in the document by introducing a frequency matrix. The frequency matrix comprises normalized frequency vectors corresponding to each document among a plurality of documents in a batch to be processed. The normalized frequency vectors a) capture information on presence and absence of a word in each document with reference to words listed in a dictionary (vocabulary) and b) frequency of the word appearing in each corresponding document. Computation of the frequency vectors is a low complexity low computation task. Further, a matrix multiplication is performed between the frequency matrix and a vocabulary embedding matrix generated for the dictionary using word embedding for the words listed in the dictionary. Unlike the vocabulary or dictionary used by conventional techniques which is more like a look up table, the method disclosed generates the vocabulary embedding matrix, which serves as a trainable matrix. The resultant matrix directly provides an average embedding matrix. Each row of the average embedding matrix represents the average word embedding corresponding to each document among the plurality of documents in the batch. Thus, the method disclosed provides time efficient solution for computing average embeddings, wherein a considerable time reduction is obtained using the approach of matrix multiplication that directly provides average embeddings for the documents. Further, the method disclosed also provides resource efficient approach as unlike the existing method it does not require large memory to store word embedding generated for every word of the document. The average embedding so obtained can then be used for any specific use case requiring training data in form of average embedding to train Machine Learning models. One of the
example applications of the method disclosed is to generate average embedding of patent documents to train a ML model for automated patent Cooperative Patent Classification (CPC) classification.
[0032] Consider the example use case of generating training data for patent document classification into CPC codes. While training a machine learning model for classifying patent and non-patent literature into CPC codes, large volumes of text has to be analyzed. Combine that with training for patent descriptions available, where we have too many rows for a given code. Many codes multiplied by many patents per code multiplied by many words in the patent description results in too much demand on the computing power. Thus, to resolve the technical problem faced either an expensive hardware is required, or the architecture need to be modified to provide higher computational efficiency, as provided by the method disclosed herein.
[0033] Referring now to the drawings, and more particularly to FIGS. 1A through 2B, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
[0034] FIG. 1A is a functional block diagram of a system 100 for obtaining average word embeddings based on matrix multiplication approach, in accordance with some embodiments of the present disclosure.
[0035] In an embodiment, the system 100 includes a processor(s) 104, communication interface device(s), alternatively referred as input/output (I/O) interface(s) 106, and one or more data storage devices or a memory 102 operatively coupled to the processor(s) 104. The system 100 with one or more hardware processors is configured to execute functions of one or more functional blocks of the system 100.
[0036] Referring to the components of system 100, in an embodiment, the processor(s) 104, can be one or more hardware processors 104. In an embodiment, the one or more hardware processors 104 can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors,
central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the one or more hardware processors 104 is configured to fetch and execute computer-readable instructions stored in the memory 102. In an embodiment, the system 100 can be implemented in a variety of computing systems including laptop computers, notebooks, hand-held devices such as mobile phones, workstations, mainframe computers, servers, a network cloud and the like.
[0037] The I/O interface(s) 106 can include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, a touch user interface (TUI) and the like and can facilitate multiple communications within a wide variety of networks N/W and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. In an embodiment, the I/O interface (s) 106 can include one or more ports for connecting a number of devices (nodes) of the system 100 to one another or to another server.
[0038] The memory 102 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes.
[0039] Further, the memory 102 may include a database 108, which may store the dictionary, the vocabulary embedding matrix, the frequency vectors, the frequency matrix and the like. Thus, the memory 102 may comprise information pertaining to input(s)/output(s) of each step performed by the processor(s) 104 of the system 100 and methods of the present disclosure. In an embodiment, the database 108 may be external (not shown) to the system 100 and coupled to the system via the I/O interface 106. Functions of the components of system 100 are explained in conjunction with architecture overview of the system 100 in FIG. 1B and flow diagram of FIGS. 2A and 2B for document embedding to obtain average embeddings for documents.
[0040] FIG. 1B is an architectural overview of the system of FIG. 1, in accordance with some embodiments of the present disclosure. As depicted in the architecture, the system 100 eliminates the need to compute word embeddings for all words in the document by introducing the frequency vector that captures information on presence and absence of a word in a document based on words listed in a dictionary (vocabulary) and also captures frequency of the word appearing in the document.
[0041] In an example implementation, the system 100 utilizes a pre-trained Word2Vec™ model for the patent document classification use case referred herein. Thus, the system 100 preprocess the text as per the intricacies of the domain of the use case.
[0042] The frequency matrix is generated post preprocessing of the documents. The preprocessing determines a plurality of significant words and eliminates a plurality of non-significant words in each of the plurality of rows of data, wherein the plurality of significant words summarize each of the plurality of documents.
[0043] As can be understood computation of frequency vectors is a low complexity low computation task, thereby reducing the major computational task of obtaining word embeddings for all words present in the document. A plurality of frequency vectors is computed for a plurality of documents in a batch under consideration. All the frequency vectors corresponding to all the documents in the batch are normalized and represented as a frequency matrix.
[0044] Simultaneously, the dictionary is created comprising all words from all the documents and the vocabulary embedding matrix is generated for the dictionary words by computing word embedding for all the words in the dictionary based on the domain of interest. Vocabulary embedding matrix is an essential step for generating embeddings and serves as the trainable matrix for the average embedding to be generated using the frequency matrix.. Unlike the existing Keras™ embedding approach, where generating word embedding the sentences of unequal lengths requires padding/truncating to have uniform length for a matrix representation of the documents in a batch, the method disclosed herein uses the
same length of the vector irrespective of the length of the sentence, which is the dictionary length, eliminating the need for padding/truncating and the inaccuracies they introduce.
[0045] With the information of each document captured in the frequency matrix and the vocabulary embedding matrix capturing the information of the dictionary, the method provides a matrix multiplication based technique that directly computes average embeddings for all the documents without need for generating word embeddings for all words in the document. A matric multiplication is performed on a frequency matrix comprising the frequency vectors and the vocabulary embedding matrix to directly provide an average embedding matrix wherein each row of the average embedding matrix represents the average embedding corresponding to each of the plurality of rows of the document. Thus, the method disclosed improves the resource efficiency and time efficiency for computing average embeddings by a considerable saving computation time and resource. The generated average embedding can then be used for any specific use case that requires average embedding as training data to train Machine Learning models. One of the example applications of the method disclosed is to generate average embedding for patent applications to train a ML model for automated patent classification.
[0046] FIG. 2A and FIG. 2B are flow diagrams illustrating a method 200 for obtaining average word embeddings based on matrix multiplication approach, using the system of FIG. 1, in accordance with some embodiments of the present disclosure.
[0047] In an embodiment, the system 100 comprises one or more data storage devices or the memory 102 operatively coupled to the processor(s) 104 and is configured to store instructions for execution of steps of the method 200 by the processor(s) or one or more hardware processors 104. The steps of the method 200 of the present disclosure will now be explained with reference to the components or blocks of the system 100 as depicted in FIG. 1 and the steps of flow diagram as depicted in FIG. 2A and FIG. 2B. Although process steps, method steps, techniques or the like may be described in a sequential order, such processes, methods and
techniques may be configured to work in alternate orders. In other words, any sequence or order of steps that may be described does not necessarily indicate a requirement that the steps to be performed in that order. The steps of processes described herein may be performed in any order practical. Further, some steps may be performed simultaneously.
[0048] Referring to the steps of the method 200, at step 202, the one or more hardware processors 104 are configured to receive a plurality of rows of data corresponding to a plurality of documents to be embedded for obtaining an average embedding for each of the plurality of rows of data.
[0049] The table1 below provides practical example documents in a batch for generate average word embedding for patent documents P1, P2, P3 …Pn associated with respective CPC classifications as depicted in the table 1 below. Each row corresponds to a document in the batch. Content of each row may be referred as description associated with each patent.
TABLE 1
Patent number Patent classification Patent text …. ….. ….
P1 C1, C2 ….
P2 C2 …
P3 C2, C3, C4 ….
[0050] At step 204, the one or more hardware processors 104 are configured to preprocess the plurality of rows of data to determine a plurality of significant words and eliminate the non-significant words in each of the plurality of rows of data, wherein the plurality of significant words summarizes each of the plurality of
documents. An illustrative example of documents and the corresponding preprocessed documents is provided below. Preprocessing comprises:
1.Convert text to lowercase
2.Tokenize the text (breaking large text into individual words)
3.Remove stop words(example: that, having, being, after)
4.Lemmatize the tokens, to convert every word to base form. (For example,
e rocks to rock, better to good, and corpora to corpus
[0051] Example 1 below refers to a paragraph, where each line is to be considered a separate document received in the batch: The woods are lovely, dark and deep, But I have promises to keep, And miles to go before I sleep, And miles to go before I sleep
[0052] After preprocessing, the documents are represented as below: [{'description': ['wood', 'lovely', 'dark', 'deep']}, {'description': ['promise', 'keep']}, {'description': ['mile', 'sleep']}, {'description': ['mile', 'sleep']}]
[0053] Once the documents are preprocessed, at step 206, the one or more hardware processors 104 are configured to determine a frequency vector for each of the processed plurality of rows, alternatively referred as processed documents, to obtain a plurality of frequency vectors. Further, the frequency matrix comprising matrix rows equal to the plurality of rows of data and matrix columns equal to a frequency vector length of each of the plurality of frequency vectors is generated.
A) The frequency vector length is equal to a number of words in a dictionary created from the preprocessed plurality of rows of data, wherein words in the dictionary are arranged in a predefined sequence. The illustrative dictionary for example 1 is provided below
[0054] Dictionary or Vocabulary with predefined sequence of the words for example 1: ['wood', 'promise', 'keep', 'deep', 'dark', 'sleep', 'mile', 'lovely’]
B) Each of a plurality of elements in each of the plurality of frequency vectors corresponds to a word in the dictionary having one to one mapping with the predefined sequence.
C) A value of each of the plurality of elements of each of the plurality of frequency vectors is a) ‘0’ if a corresponding word from the predefined sequence of words is missing from a corresponding row among the plurality of rows of data; and b) is equal to number of times the word has occurred in the corresponding row if the word from the predefined sequence of words is present in the corresponding row.
[0055] The frequency vector for first preprocessed document ['wood', 'lovely', 'dark', 'deep'] is [1 0 0 1 1 0 0 1], wherein number of non-zero values are ‘4’.
[0056] At step 208, the one or more hardware processors 104 are configured to normalize each of the plurality of frequency vectors by scaling the value of each of the plurality of elements of the frequency vector based on a total number of non-zero elements present in a corresponding row of the frequency matrix to generate a normalized frequency vector matrix. The normalization provides the averaging effect. As can be understood for the illustrative example below (x1 * 0.25 + x2 * 0.25 + x3 * 0.25 + x4 * 0.25) is the average of x1, x2, x3 and x4.
[0057] An illustrative normalized frequency vector for example 1 above is provided below, wherein the values of the elements of frequency vectors are scaled down by number of non-zero elements ‘4’, for the subsequent rows, it is scaled down by 2 since there are only two words in each row after preprocessing
Normalized Frequency Vectors:
[{'description': array([0.25, 0. , 0. , 0.25, 0.25, 0. , 0. , 0.25], dtype=float32)} {'description': array([0. , 0.5, 0.5, 0. , 0. , 0. , 0. , 0. ], dtype=float32)}, {'description': array([0. , 0. , 0. , 0. , 0. , 0.5, 0.5, 0. ], dtype=float32)}, {'description': array([0. , 0. , 0. , 0. , 0. , 0.5, 0.5, 0. ], dtype=float32)}]
[0058] At step 210, the one or more hardware processors 104 are configured to create a vocabulary embedding matrix by generating word embeddings for the words in the dictionary. The vocabulary embedding matrix is generated using
pretrained embeddings for each of the plurality of unique words based on th domain of interest, such as patent documents example referred herein.
[0059] For example, embedding for ‘wood’ is as follows:
array([-4.08935547e-02, 1.43737793e-02, -2.92358398e-02, 2.27203369e-02,
5.73425293e-02, 1.09024048e-02, 1.69658661e-03, 9.86328125e-02,
5.63964844e-02, -6.32324219e-02, 2.41088867e-02, -9.28955078e-02,
-2.07519531e-01, 8.76617432e-03, -1.20162964e-02, -1.40502930e-01,
3.58276367e-02, 2.87475586e-02, 8.09326172e-02, -4.74243164e-02,
-2.46429443e-02, -2.48718262e-02, 5.33294678e-03, -3.68347168e-02,
2.46582031e-02, 5.31921387e-02, 3.85856628e-03, -2.06604004e-02,
-3.92761230e-02, 1.49536133e-02, -5.71594238e-02, -6.95800781e-02,
-1.15478516e-01, -6.25000000e-02, -7.91015625e-02, -2.12524414e-01,
1.66473389e-02, -6.35986328e-02, -1.22909546e-02, -8.06274414e-02,
6.19888306e-03, -3.44276428e-03, -1.25885010e-02, 7.25097656e-02,
1.31347656e-01, -5.24139404e-03, -4.31213379e-02, 4.56237793e-02,
-6.16149902e-02, 7.81860352e-02, 1.96075439e-02, 1.00479126e-02,
-5.98754883e-02, -3.15856934e-03, -6.71386719e-02, 3.03497314e-02,
-5.83190918e-02, 2.02369690e-03, -6.28280640e-03, -6.81152344e-02,
2.92510986e-02, 1.78813934e-03, -7.56835938e-03, -5.40161133e-02,
-1.57012939e-02, -1.97143555e-02, -4.00695801e-02, 6.01501465e-02,
8.16650391e-02, 9.99755859e-02, -1.20056152e-01, 5.53588867e-02,
2.44445801e-02, -4.28390503e-03, -1.35009766e-01, 5.05065918e-02,
-4.49218750e-02, -3.91845703e-02, -7.27539062e-02, -2.06604004e-02,
-3.45764160e-02, 4.42123413e-03, 2.42156982e-02, 1.11267090e-01,
1.67388916e-02, 1.71508789e-02, 2.97851562e-02, -2.04467773e-02,
1.65557861e-02, -1.35009766e-01, -5.23071289e-02, 1.36566162e-02,
1.86767578e-02, -6.83593750e-02, 1.90734863e-02, 1.51977539e-02,
1.61285400e-02, -1.23977661e-02, -2.25830078e-02, 6.49414062e-02,
1.15783691e-01, 7.65991211e-02, -8.64868164e-02, 1.08886719e-01,
-4.12597656e-02, 2.56958008e-02, 1.54418945e-01, 7.29560852e-04,
2.45971680e-02, -8.87451172e-02, 4.31442261e-03, -1.04919434e-01,
1.16348267e-02, -3.28979492e-02, 7.78961182e-03, 7.37380981e-03,
2.01873779e-02, -6.84814453e-02, -3.81469727e-02, 8.21533203e-02,
-7.27539062e-02, 6.51245117e-02, 3.11584473e-02, -2.76336670e-02,
-7.19070435e-04, -6.18896484e-02, -3.18527222e-03, -2.15301514e-02,
5.46264648e-02, -8.45336914e-03, 1.94244385e-02, 2.26135254e-02,
-1.33972168e-02, -5.67016602e-02, 9.79003906e-02, -2.08892822e-02,
-2.13623047e-02, -6.98852539e-02, -3.99780273e-02, 2.22969055e-03,
-6.11572266e-02, -9.42382812e-02, 1.13067627e-02, 4.43267822e-03,
-4.44641113e-02, -3.40881348e-02, 1.38549805e-01, 3.52783203e-02,
4.51965332e-02, -1.92260742e-02, -5.45654297e-02, 6.20117188e-02,
-1.68823242e-01, -1.03637695e-01, -8.82720947e-03, -2.07672119e-02,
1.87993050e-04, -1.07543945e-01, -5.13916016e-02, -2.09960938e-02,
8.16345215e-03, 1.45416260e-02, 1.06811523e-02, -2.33154297e-02,
1.59759521e-02, -9.11865234e-02, 3.73229980e-02, 6.50634766e-02,
-8.83178711e-02, -2.77557373e-02, 5.65185547e-02, 3.84521484e-02,
-7.64770508e-02, 7.39669800e-03, 3.88183594e-02, -1.47338867e-01,
-6.76269531e-02, -9.74121094e-02, 1.22192383e-01, -8.01391602e-02,
4.10156250e-02, -9.56420898e-02, 1.29776001e-02, 4.23889160e-02,
-8.64868164e-02, 1.58538818e-02, -9.62066650e-03, -4.30908203e-02,
1.18865967e-02, -1.19934082e-01, 7.37304688e-02, -1.68151855e-02,
7.28149414e-02, -3.04718018e-02, 1.87072754e-02, 1.01501465e-01,
-1.27105713e-02, -1.09481812e-02, 3.61022949e-02, 6.79931641e-02,
-6.03027344e-02, 5.79833984e-02, -2.93273926e-02, 1.03637695e-01,
2.09197998e-02, 4.23278809e-02, -1.17370605e-01, 2.68859863e-02,
-1.55639648e-02, -1.65863037e-02, 1.46728516e-01, -5.10025024e-03,
7.74383545e-03, -1.19201660e-01, 6.65893555e-02, 1.29394531e-01,
1.10855103e-02, -1.66168213e-02, 5.88989258e-02, 7.59887695e-02,
4.10156250e-02, -8.63647461e-02, 7.65380859e-02, -4.69360352e-02,
7.54394531e-02, -4.81872559e-02, 3.97949219e-02, 4.48913574e-02,
-5.15441895e-02, 3.66821289e-02, -2.05802917e-03, 1.25244141e-01,
-1.06750488e-01, 4.90417480e-02, 8.94165039e-03, -3.00598145e-02,
2.20184326e-02, -8.11767578e-02, -1.81732178e-02, 2.26135254e-02,
2.86254883e-02, -8.65173340e-03, 1.74865723e-02, -1.17919922e-01,
-8.89282227e-02, 8.91113281e-02, 2.58483887e-02, -5.70373535e-02,
2.64282227e-02, 1.41677856e-02, 1.67083740e-03, 1.96380615e-02,
-1.58843994e-02, -3.11584473e-02, -7.91015625e-02, -7.35473633e-02],
dtype=float32)
[0060] At step 212, the one or more hardware processors 104 are configured to determine an average embedding matrix by performing matrix multiplication of the normalized frequency matrix and the vocabulary embedding matrix, wherein each row of the average embedding matrix represents the average embedding corresponding to each of the plurality of documents in the batch.
[0061] Once the average embedding matrix is obtained, it is used as training data for a machine learning model for classification tasks such as classification of patent documents into CPC classes. In an embodiment, the method disclosed herein is implemented using TensorFlow graphs™, without using Keras™.
[0062] TEST RESULTS:
TABLE 2
* Method Embedding Trained Vocabulary Time per epoch F-measure
1 Matrix multiplication No Full 1.5 seconds 71.6
2 Keras Embedding No Full 1 hour 7 minutes incomplete
3 Matrix multiplication Yes Full 2.7 minutes 77.9
4 Keras Embedding Yes Full 1 hour 9 minutes incomplete
5 Keras Embedding Yes 500 6 minutes 43.9
6 Keras Embedding No 500 37 seconds 65.1
7 Keras Embedding No 50000 12 minutes 56.5
[0063] In TABLE 2, for the column ‘Method’, ‘Keras Embedding’ refers to the state of the art method of using the embedding layer that comes with Keras. ‘Matrix multiplication’ refers the method disclosed herein.. The ‘Embedding
Trained’ column indicates if the pre-trained word embeddings are further training during classification. For the ‘Vocabulary’ column, ‘Full’ indicates the word length of the longest document was considered for Keras Embedding, and if a number is indicated, for example ‘500’, then only the first 500 words of every document is considered for embedding and training. The method disclosed herein which utilizes matrix multiplication approach takes the full document. In the ‘F-measure’ column, ‘incomplete’ the experiment could not be completed to tabulate the F-score as the limits of the hardware resources used for experimentation were reached..
[0064] It can be observed from the test results that the method disclosed herein is 2680 times faster than state of the art when the embeddings were not trained and 26 times faster than state of the art when embeddings were trained. The F-measures for the method disclosed (Matrix multiplication) are also better than the state of the art within the limits of existing hardware. The increase in speed by the using the method disclosed can be attributed to lesser computations and lesser demand on memory, hence better F-measure to the ability to consider more data because of increased speed.
[0065] The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
[0066] It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g. any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g. hardware means
like e.g. an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g. an ASIC and an FPGA, or at least one microprocessor and at least one memory with software processing components located therein. Thus, the means can include both hardware means, and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g. using a plurality of CPUs.
[0067] The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
[0068] The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be
noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
[0069] Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
[0070] It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims.
We Claim:
1. A processor implemented method (200) for obtaining average word embeddings, wherein the method comprising:
receiving (202), by one or more hardware processors, a plurality of rows of data corresponding to a plurality of documents to be embedded for obtaining an average word embedding for each of the plurality of rows of data;
preprocessing (204), by the one or more hardware processors, the plurality of rows of data to determine a plurality of significant words and eliminate a plurality of non-significant words in each of the plurality of rows of data, wherein the plurality of significant words summarize each of the plurality of documents;
determining (206), by the one or more hardware processors, a frequency vector for each of the processed plurality of rows to obtain a plurality of frequency vectors and generate a frequency matrix comprising matrix rows equal to the plurality of rows of data and matrix columns equal to a frequency vector length of each of the plurality of frequency vectors,
wherein the frequency vector length is equal to a number of words in a dictionary created from the preprocessed plurality of rows of data, wherein words in the dictionary are arranged in a predefined sequence,
wherein each of a plurality of elements in each of the plurality of frequency vectors corresponds to a word in the dictionary having one to one mapping with the predefined sequence, and
wherein a value of each of the plurality of elements of each of the plurality of frequency vectors is a) ‘0’ if a corresponding word from the predefined sequence of words is missing from a corresponding row among the plurality of rows of data; and b) is equal to number of times the word has occurred in the corresponding
row if the word from the predefined sequence of words is present in
the corresponding row;
normalizing (208), by the one or more hardware processors, each of the plurality of frequency vectors by scaling the value of each of the plurality of elements based on a total number of non-zero elements present in a corresponding row of the frequency matrix to generate a normalized frequency vector matrix;
creating (210), by the one or more hardware processors, a vocabulary embedding matrix by generating word embeddings for the words in the dictionary for a domain of interest; and
determining (212), by the one or more hardware processors, an average embedding matrix by performing matrix multiplication of the normalized frequency matrix and the vocabulary embedding matrix, wherein each row of the average embedding matrix represents the average word embedding corresponding to each of the plurality of rows.
2. The method as claimed in claim 1, wherein the average embedding matrix is used to train a machine learning model for classification task.
3. The method as claimed in claim 1, wherein the vocabulary embedding matrix is generated using pretrained embeddings for each of the plurality of unique words.
4. A system (100) for obtaining average word embeddings, the system (100) comprising:
a memory (102) storing instructions;
one or more Input/Output (I/O) interfaces (106); and
one or more hardware processors (104) coupled to the memory (102) via the
one or more I/O interfaces (106), wherein the one or more hardware
processors (104) are configured by the instructions to:
receive a plurality of rows of data corresponding to a plurality of documents to be embedded for obtaining an average word embedding for each of the plurality of rows of data;
preprocess the plurality of rows of data to determine a plurality of significant words and eliminate a plurality of non-significant words in each of the plurality of rows of data, wherein the plurality of significant words summarize each of the plurality of documents;
determine a frequency vector for each of the processed plurality of rows to obtain a plurality of frequency vectors and generate a frequency matrix comprising matrix rows equal to the plurality of rows of data and matrix columns equal to a frequency vector length of each of the plurality of frequency vectors,
wherein the frequency vector length is equal to a number of words in a dictionary created from the preprocessed plurality of rows of data, wherein words in the dictionary are arranged in a predefined sequence;
wherein each of a plurality of elements in each of the plurality of frequency vectors corresponds to a word in the dictionary having one to one mapping with the predefined sequence; and
wherein a value of each of the plurality of elements of each of the plurality of frequency vectors is a) ‘0’ if a corresponding word from the predefined sequence of words is missing from a corresponding row among the plurality of rows of data; and b) is equal to number of times the word has occurred in the corresponding row if the word from the predefined sequence of words is present in the corresponding row;
normalize each of the plurality of frequency vectors by scaling the value of each of the plurality of elements based on a total number of non¬zero elements present in a corresponding row of the frequency matrix to generate a normalized frequency vector matrix;
create a vocabulary embedding matrix by generating word embeddings for the words in the dictionary for a domain of interest; and
determine an average embedding matrix by performing matrix multiplication of the normalized frequency matrix and the vocabulary embedding matrix, wherein each row of the average embedding matrix represents the average word embedding corresponding to each of the plurality of rows.
5. The system as claimed in claim 4, wherein the average embedding matrix is used to train a machine learning model for classification task.
6. The system as claimed in claim 4, wherein the vocabulary embedding matrix is generated using pretrained embeddings for each of the plurality of unique words.
| # | Name | Date |
|---|---|---|
| 1 | 202021026739-STATEMENT OF UNDERTAKING (FORM 3) [24-06-2020(online)].pdf | 2020-06-24 |
| 2 | 202021026739-REQUEST FOR EXAMINATION (FORM-18) [24-06-2020(online)].pdf | 2020-06-24 |
| 3 | 202021026739-FORM 18 [24-06-2020(online)].pdf | 2020-06-24 |
| 4 | 202021026739-FORM 1 [24-06-2020(online)].pdf | 2020-06-24 |
| 5 | 202021026739-FIGURE OF ABSTRACT [24-06-2020(online)].jpg | 2020-06-24 |
| 6 | 202021026739-DRAWINGS [24-06-2020(online)].pdf | 2020-06-24 |
| 7 | 202021026739-DECLARATION OF INVENTORSHIP (FORM 5) [24-06-2020(online)].pdf | 2020-06-24 |
| 8 | 202021026739-COMPLETE SPECIFICATION [24-06-2020(online)].pdf | 2020-06-24 |
| 9 | 202021026739-FORM-26 [23-10-2020(online)].pdf | 2020-10-23 |
| 10 | 202021026739-Proof of Right [07-12-2020(online)].pdf | 2020-12-07 |
| 11 | Abstract1.jpg | 2021-10-19 |
| 12 | 202021026739-FER.pdf | 2022-01-04 |
| 13 | 202021026739-FER_SER_REPLY [26-05-2022(online)].pdf | 2022-05-26 |
| 14 | 202021026739-REQUEST FOR EXAMINATION (FORM-18) [24-06-2020(online)].pdf | 2020-06-24 |
| 14 | 202021026739-PatentCertificate16-05-2024.pdf | 2024-05-16 |
| 15 | 202021026739-STATEMENT OF UNDERTAKING (FORM 3) [24-06-2020(online)].pdf | 2020-06-24 |
| 15 | 202021026739-IntimationOfGrant16-05-2024.pdf | 2024-05-16 |
| 1 | SearchStrategyE_04-01-2022.pdf |