System And Method For Retrofitting Synthetic Words From Multilingual

< Back

System And Method For Retrofitting Synthetic Words From Multilingual Data For Knowledge Base Generation

Abstract: The present disclosure provides a system and a method for retrofitting synthetic words from multilingual data for knowledge base generation. The system utilizes context-level modelling, character level modelling, and language model retrofitting to generate one or more trained embeddings. The trained embeddings are used to determine out-of-vocabulary (OOV) synthetic bilingual tokens in their vector space to enhance knowledge entities. Further, the trained embeddings are utilized for sentiment analysis, and query understanding for words obtained from real world scenarios.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

29 April 2022

Publication Number

44/2023

Publication Type

INA

Invention Field

PHYSICS

Status

Parent Application

Applicants

JIO PLATFORMS LIMITED

Office-101, Saffron, Nr. Centre Point, Panchwati 5 Rasta, Ambawadi, Ahmedabad - 380006, Gujarat, India.

Inventors

1. JOSHI, Prasad Pradip

Bungalow #34 & 35, ‘Pratisaad’, Meadow gate CHS, Lodha Heaven, Palava, Dombivli - 421204, Maharashtra, India.

2. GUPTA, Naman

64-A, Panchwati Colony, Airport Road, Bhopal - 462030, Madhya Pradesh, India.

3. CHEMUDUPATI, Rajiv

189 Skylite Vesta, NH207, Sarjapur, Anekal Taluk, Bangalore - 562125, Karnataka, India.

Specification

DESC:RESERVATION OF RIGHTS
[0001] A portion of the disclosure of this patent document contains material, which is subject to intellectual property rights such as but are not limited to, copyright, design, trademark, integrated circuit (IC) layout design, and/or trade dress protection, belonging to Jio Platforms Limited (JPL) or its affiliates (hereinafter referred as owner). The owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent files or records, but otherwise reserves all rights whatsoever. All rights to such intellectual property are fully reserved by the owner.

FIELD OF INVENTION
[0002] The embodiments of the present disclosure generally relate to systems and methods for natural language understanding (NLU) and language models. More particularly, the present disclosure relates to a system and a method for retrofitting synthetic words from multilingual data for a knowledge base to enable the generation of a robust natural language processing (NLP) pipeline.

BACKGROUND OF INVENTION
[0003] The following description of the related art is intended to provide background information pertaining to the field of the disclosure. This section may include certain aspects of the art that may be related to various features of the present disclosure. However, it should be appreciated that this section is used only to enhance the understanding of the reader with respect to the present disclosure, and not as admissions of the prior art.
[0004] Indian languages have rich sub word structures where new coined words are constantly used for communication. For example, new coined words follow a pattern with Hindi prefixes (root forms) and English suffix (verb form) from a gamut of Hinglish words. Multiple social media platforms involve the use of Hinglish words and other new coined words by diverse users. Once detected, extraction or association of such words with a proper English and/or Hindi or Hinglish word vector requires both contextual and a sub word level understanding.
[0005] With the advent of Internet technology, social media platform engagement has increased by multiple folds. The auto-discovery of characters segregated by language in a root/verb form has become a challenging task for various language models. Further, language models are unable to process out of vocabulary (OOV) tokens associated with the rich word sub structures.
[0006] There is, therefore, a need in the art to provide a system and a method that can mitigate the problems associated with the prior arts.

OBJECTS OF THE INVENTION
[0007] Some of the objects of the present disclosure, which at least one embodiment herein satisfies are listed herein below.
[0008] It is an object of the present disclosure to provide a system and a method that facilitates identification of a synthetic mixture of word vectors from two or more languages obtained via customer reviews, tweets, and posts on social media.
[0009] It is an object of the present disclosure to provide a system and a method that utilizes vocabulary vectors along with a combination of character level, phonetic vectors, and synthetic words to identify an equivalent English vocable.
[0010] It is an object of the present disclosure to provide a system and a method that generates a weighted score over character level embeddings to generate more intuitive character context level embeddings for a better understanding of a phrase.
[0011] It is an object of the present disclosure to provide a system and a method that enables vocabulary token segregation, token mapping, and token disambiguation to provide a better character/phonetic level understanding.
[0012] It is an object of the present disclosure to provide a system and a method that merges synthetic tokens with universal vocabulary tokens.
[0013] It is an object of the present disclosure to provide a system and a method that retrieves textual content, which could otherwise be treated as out of vocabulary (OOV) word and thereby retaining useful information.
[0014] It is an object of the present disclosure to provide a system and a method that facilitates efficient context mapping based on word disambiguation
[0015] It is an object of the present disclosure to provide a system and a method that facilitates behavioural pattern identification for users / identifying personalized vocabulary.

SUMMARY
[0016] This section is provided to introduce certain objects and aspects of the present disclosure in a simplified form that are further described below in the detailed description. This summary is not intended to identify the key features or the scope of the claimed subject matter.
[0017] In an aspect, the present disclosure relates to a system for generating one or more trained embeddings. The system may include a processor operatively coupled with a memory that stores instructions to be executed by the processor. The processor may receive one or more data parameters from one or more computing devices via a network. A user may operate the one or more computing devices. The received one or more data parameters may be based on one or more data sources. The processor may extract the received one or more data parameters to generate one or more tokens based on a knowledge base associated with the one or more data sources. The processor may encode the generated one or more tokens to generate at least a translated word and at least a transliterated word based on the extracted one or more data parameters. The processor may simultaneously train the at least translated word and the at least transliterated word based on the encoded one or more tokens and a plurality of predefined parameters. The processor may retrofit the trained at least translated word and the trained at least transliterated word with a pre-trained model to enable the generation of the one or more trained embeddings associated with the encoded one or more tokens.
[0018] In an embodiment, the one or more data sources may include at least one of a social media, a customer review, and a user sentiment.
[0019] In an embodiment, the knowledge base may include one or more universal tokens based on an urban vocabulary.
[0020] In an embodiment, the processor may be configured to extract the one or more data parameters via at least one of a lexical processing, a syntactic processing, and a semantic processing.
[0021] In an embodiment, the processor may be configured to utilize an application programming interface (API) for the encoding of the generated one or more tokens.
[0022] In an embodiment, the plurality of predefined parameters may include at least one of a character level model and a concept level model.
[0023] In an embodiment, the processor may be configured to utilize the API and the knowledge base to train the at least translated word and the at least transliterated word.
[0024] In an embodiment, the pre-trained model may include a language model to enable the generation of the one or more trained embeddings.
[0025] In an embodiment, the one or more trained embeddings may include at least an out-of-vocabulary (OOV) synthetic bilingual token in an associated vector space.
[0026] In an aspect, the present disclosure relates to a method for generating one or more trained embeddings. The method may include receiving, by a processor, one or more data parameters from a user via one or more computing devices. The received one or more data parameters may be based on one or more data sources. The method may include extracting, by the processor, the received one or more data parameters to generate one or more tokens based on a knowledge base associated with the one or more data sources. The method may include encoding, by the processor, the generated one or more tokens to generate at least a translated word and at least a transliterated word based on the extracted one or more data parameters. The method may include simultaneously training, by the processor, the at least translated word and the at least transliterated word based on the encoded one or more tokens and a plurality of predefined parameters. The method may include retrofitting, by the processor, the trained at least translated word and the trained at least transliterated word with a pre-trained model to enable the generation of the one or more trained embeddings associated with the encoded one or more tokens.
[0027] In an embodiment, the method may include extracting, by the processor, the one or more data parameters via at least one of a lexical processing, a syntactic processing, and a semantic processing.
[0028] In an embodiment, the method may include utilizing, by the processor, an application programming interface (API) for the encoding of the generated one or more tokens.
[0029] In an embodiment, the plurality of predefined parameters may include at least one of a character level model and a concept level model.
[0030] In an embodiment, the method may include utilizing, by the processor, the API and the knowledge base to train the at least translated word and the at least transliterated word.
[0031] In an embodiment, the pre-trained model may include a language model to enable the generation of the one or more trained embeddings.
[0032] In an embodiment, the one or more trained embeddings may include at least an out-of-vocabulary (OOV) synthetic bilingual token in an associated vector space.
[0033] In an aspect, the present disclosure relates to a user equipment (UE) for generating one or more trained embeddings. The UE may include one or more processors communicatively coupled to a processor in a system. The one or more processors may be coupled with a memory. The memory may store instructions to be executed by the one or more processors and may cause the one or more processors to transmit one or more data parameters to the processor via a network. The processor may be configured to receive the one or more data parameters from the UE. The received one or more data parameters may be based on one or more data sources. The processor may extract the received one or more data parameters to generate one or more tokens based on a knowledge base associated with the one or more data sources. The processor may encode the generated one or more tokens to generate at least a translated word and at least a transliterated word based on the extracted one or more data parameters. The processor may simultaneously train the at least translated word and the at least transliterated word based on the encoded one or more tokens and a plurality of predefined parameters. The processor may retrofit the trained at least translated word and the trained at least transliterated word with a pre-trained model to enable the generation of the one or more trained embeddings associated with the encoded one or more tokens.

BRIEF DESCRIPTION OF DRAWINGS
[0034] The accompanying drawings, which are incorporated herein, and constitute a part of this disclosure, illustrate exemplary embodiments of the disclosed methods and systems which like reference numerals refer to the same parts throughout the different drawings. Components in the drawings are not necessarily to scale, emphasis instead being placed upon clearly illustrating the principles of the present disclosure. Some drawings may indicate the components using block diagrams and may not represent the internal circuitry of each component. It will be appreciated by those skilled in the art that disclosure of such drawings includes the disclosure of electrical components, electronic components, or circuitry commonly used to implement such components.
[0035] FIG. 1 illustrates an exemplary network architecture (100) of a proposed system (110), in accordance with an embodiment of the present disclosure.
[0036] FIG. 2 illustrates an exemplary block diagram (200) of a proposed system (110), in accordance with an embodiment of the present disclosure.
[0037] FIGs. 3A-3B illustrate exemplary process (300) of a retrofit multilingual synthetic knowledge aware pipeline, in accordance with embodiments of the present disclosure.
[0038] FIG. 4 illustrates an exemplary block diagram (400) of an application programming interface (API) with translation and transliteration, in accordance with an embodiment of the present disclosure.
[0039] FIG. 5 illustrates an exemplary computer system (500) in which or with which embodiments of the present disclosure may be implemented.
[0040] The foregoing shall be more apparent from the following more detailed description of the disclosure.

BRIEF DESCRIPTION OF THE INVENTION
[0041] In the following description, for the purposes of explanation, various specific details are set forth in order to provide a thorough understanding of embodiments of the present disclosure. It will be apparent, however, that embodiments of the present disclosure may be practiced without these specific details. Several features described hereafter can each be used independently of one another or with any combination of other features. An individual feature may not address all of the problems discussed above or might address only some of the problems discussed above. Some of the problems discussed above might not be fully addressed by any of the features described herein.
[0042] The ensuing description provides exemplary embodiments only and is not intended to limit the scope, applicability, or configuration of the disclosure. Rather, the ensuing description of the exemplary embodiments will provide those skilled in the art with an enabling description for implementing an exemplary embodiment. It should be understood that various changes may be made in the function and arrangement of elements without departing from the spirit and scope of the disclosure as set forth.
[0043] Specific details are given in the following description to provide a thorough understanding of the embodiments. However, it will be understood by one of ordinary skill in the art that the embodiments may be practiced without these specific details. For example, circuits, systems, networks, processes, and other components may be shown as components in block diagram form in order not to obscure the embodiments in unnecessary detail. In other instances, well-known circuits, processes, algorithms, structures, and techniques may be shown without unnecessary detail to avoid obscuring the embodiments.
[0044] Also, it is noted that individual embodiments may be described as a process that is depicted as a flowchart, a flow diagram, a data flow diagram, a structure diagram, or a block diagram. Although a flowchart may describe the operations as a sequential process, many of the operations can be performed in parallel or concurrently. In addition, the order of the operations may be re-arranged. A process is terminated when its operations are completed but could have additional steps not included in a figure. A process may correspond to a method, a function, a procedure, a subroutine, a subprogram, etc. When a process corresponds to a function, its termination can correspond to a return of the function to the calling function or the main function.
[0045] The word “exemplary” and/or “demonstrative” is used herein to mean serving as an example, instance, or illustration. For the avoidance of doubt, the subject matter disclosed herein is not limited by such examples. In addition, any aspect or design described herein as “exemplary” and/or “demonstrative” is not necessarily to be construed as preferred or advantageous over other aspects or designs, nor is it meant to preclude equivalent exemplary structures and techniques known to those of ordinary skill in the art. Furthermore, to the extent that the terms “includes,” “has,” “contains,” and other similar words are used in either the detailed description or the claims, such terms are intended to be inclusive in a manner similar to the term “comprising” as an open transition word without precluding any additional or other elements.
[0046] Reference throughout this specification to “one embodiment” or “an embodiment” or “an instance” or “one instance” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. Thus, the appearances of the phrases “in one embodiment” or “in an embodiment” in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
[0047] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used herein, the singular forms “a”, “an”, and “the” are intended to include the plural forms as well, unless the context indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items.
[0048] The various embodiments throughout the disclosure will be explained in more detail with reference to FIGs. 1-5.
[0049] FIG. 1 illustrates an exemplary network architecture (100) of a proposed system (110), in accordance with an embodiment of the present disclosure.
[0050] As illustrated in FIG. 1, the network architecture (100) may include a system (110). The system (110) may be connected to one or more computing devices (104-1, 104-2…104-N) via a network (106). The one or more computing devices (104-1, 104-2…104-N) may be interchangeably specified as a user equipment (UE) (104) and be operated by one or more users (102-1, 102-2...102-N). Further, the one or more users (102-1, 102-2…102-N) may be interchangeably referred as a user (102) or users (102). In an embodiment, the system (110) may generate one or more trained embeddings based on one or more data parameters provided by the user (102).
[0051] In an embodiment, the computing devices (104) may include, but not be limited to, a mobile, a laptop, etc. Further, the computing devices (104) may include a smartphone, virtual reality (VR) devices, augmented reality (AR) devices, a general-purpose computer, desktop, personal digital assistant, tablet computer, and a mainframe computer. Additionally, input devices for receiving input from a user (102) such as a touch pad, touch-enabled screen, electronic pen, and the like may be used. A person of ordinary skill in the art will appreciate that the computing devices (104) may not be restricted to the mentioned devices and various other devices may be used.
[0052] In an embodiment, the network (106) may include, by way of example but not limitation, at least a portion of one or more networks having one or more nodes that transmit, receive, forward, generate, buffer, store, route, switch, process, or a combination thereof, etc. one or more messages, packets, signals, waves, voltage or current levels, some combination thereof, or so forth. The network (106) may also include, by way of example but not limitation, one or more of a wireless network, a wired network, an internet, an intranet, a public network, a private network, a packet-switched network, a circuit-switched network, an ad hoc network, an infrastructure network, a Public-Switched Telephone Network (PSTN), a cable network, a cellular network, a satellite network, a fiber optic network, or some combination thereof.
[0053] In an embodiment, the system (110) may receive the one or more data parameters from the user (102). The data parameters may be based on one or more data sources such as but not limited to a social media, a customer review, and a user sentiment.
[0054] In an embodiment, the system (110) may combine context-level modelling, character-level modelling, and language model retrofitting to generate the one or more trained embeddings. The generated one or more trained embeddings may be utilized by multiple business use cases such as sentiment analysis and query understanding for determining a weighted score over character level embeddings. Further, the system (110) may generate more intuitive “character context” level embeddings for a better understanding of a phrase from the one or more data sources.
[0055] In an exemplary embodiment, Hinglish words such as “padhing” (studying) or “kheloing” (playing) are some of the words mostly found in tweets or posts on social media platforms and blogs. These words have no actual meaning on their own but are aggregated by the Hindi root word “khelo” and the English suffix (verb form) “ing” to generate synthetic out of vocabulary (OOV) words.
[0056] In an exemplary embodiment, words like “study” and “padhai” may be extracted from the Hinglish - English concept-net model and retrofitted over a pre-trained language model to mobilize the words into a close vector plane. However, “ing” context in “kheloing” may require a deep understanding of a character level context and vowel mixing on bilingual corpora. Thus, character level models such as Fast-Text may bring “khelo” and “kheloing” in a same hyperplane with a root word “khelo” which may be then mapped with the English equivalent extracted from concept-net model “study.”
[0057] In an embodiment, the system (110) may extract the received one or more data parameters to generate one or more tokens based on a knowledge base associated with the one or more data sources. The knowledge base may include one or more universal tokens based on an urban vocabulary. Urban vocabulary may include, but not be limited to, semantic representation of Hindi and English words used for communication purposes. Further, the system (110) may sanitize the one or more data parameters via a lexical processing method, a syntactic processing method, or a semantic processing method. Sanitization may include data cleaning such as stop-word removal, tokenization, and phonetic characters (vowels) extraction.
[0058] In an embodiment, a translator application programming interface (API) may be utilized by the system (110) to encode the one or more tokens into a Hindi (translated form) form and a Hinglish word (transliterated form).
[0059] In an embodiment, aggregated features from the API and the knowledge base may be combined together to a bi-model (incorporating character level and concept level models) to train character-phonetic (vowel) level and context level embeddings. The trained bi-model may be retrofitted (training newly-discovered embeddings with universally recognized embeddings) with a pre-trained language model. Hence, synonym, antonym pairs, and local word embeddings may be generated to be a part of the retrofitting pipeline. The retrofitted model may understand all synthetic tokens and modulate the synthetic tokens as recognized universal tokens.
[0060] In an embodiment, the system (110) may encode the generated one or more tokens to generate at least a translated word and at least a transliterated word based on the extracted one or more data parameters. Further, the system (110) may simultaneously train) the at least translated word and the at least transliterated word based on the encoded one or more tokens and a plurality of predefined parameters. The plurality of predefined parameters may include at least one of a character level model and a concept level model. Word description pairs (including the at least translated word and the at least transliterated word) may be trained simultaneously using word concepts and character n-gram models to get a better understanding at lexical, syntactic level (from the character model), and semantic and linguistic level (from the concept-based model).
[0061] In an embodiment, the system (110) may retrofit the trained at least translated word and the trained at least transliterated word with a pre-trained model to enable the generation of the one or more trained embeddings associated with the encoded one or more tokens. The pre-trained model utilized by the system (110) may include a language model (comprising Hindi and English language) to enable the generation of the one or more trained embeddings. The one or more trained embeddings may be used to determine OOV synthetic bilingual tokens in their vector space by various knowledge entities.
[0062] In an exemplary embodiment, the system (110) may accurately retrofit the extracted new coined and certain OOV words in the right vector space and neighbourhood of other valid vocabulary words. For example, some new coined words may follow a pattern with Hindi prefixes (root forms) and English suffix (verb form) from a gamut of Hinglish words. Once detected, extraction or association of such words with a proper English and/or Hindi or Hinglish word vector may involve both contextual and a sub-word level understanding.
[0063] In an embodiment, the system (110) may extract a plurality of abstract tokens from an urban dictionary of Hindi / English words, and segregated Hindi - English contexts from a whole word form. Finally, a bootstrapped embedding with the one or more trained embeddings may generate natural language processing (NLP) pipelines that may work robustly in real-word scenarios.
[0064] In an embodiment, huge streams of data may be obtained where business use cases may be utilized for leveraging Hinglish vocabulary-based retrofitting pipelines. Business use cases may include, but not limited to, sentiment analysis, query understanding, and multi-lingual phrase understanding, which have been explained hereinafter.
[0065] Sentiment Analysis: Social platforms may receive huge loads of user data filled with customer reviews of products or services, and comments on user profiles, etc. User data may require an end-to-end way analysis. For example, the phrase “I am jhelofying this match” may have a negative sentiment associated with it. The word “jhelofying” comes from “jhelo” (word in Hindi context) which means endure (a negative impact) that may be easily extracted using the retrofitting pipelines.
[0066] Query Understanding: Internet technologies have grown immensely over the years. Businesses are digitized and demand for information retrieval or search engines have grown exponentially. As most of the people are from diversified dialects, there is a need for search engines that understand mixed character codes of various languages in a well differentiated manner. For example, searching “gulabish top” on an electronic commerce platform may retrieve results on pink or cherry or similar colour tops (apparels). The above pipeline may leverage retrofitted embeddings to appropriately identify context and results associated with multi-lingual search queries.
[0067] Multi-lingual Phrase Understanding: Diversified dialects and word vowel intermixing associated with various dialects may be complex in nature. Words such as “vach” in Marathi (means “read”), comedy wala show, action wali movie, embroidery wali dress (embroided dress) may be some classic anecdotal cases where retrieval of information may be difficult.
[0068] Although FIG. 1 shows exemplary components of the network architecture (100), in other embodiments, the network architecture (100) may include fewer components, different components, differently arranged components, or additional functional components than depicted in FIG. 1. Additionally, or alternatively, one or more components of the network architecture (100) may perform functions described as being performed by one or more other components of the network architecture (100).
[0069] FIG. 2 illustrates an exemplary block diagram (200) of a proposed system (110), in accordance with an embodiment of the present disclosure.
[0070] Referring to FIG. 2, the system (110) may comprise one or more processor(s) (202) that may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, logic circuitries, and/or any devices that process data based on operational instructions. Among other capabilities, the one or more processor(s) (202) may be configured to fetch and execute computer-readable instructions stored in a memory (204) of the system (110). The memory (204) may be configured to store one or more computer-readable instructions or routines in a non-transitory computer readable storage medium, which may be fetched and executed to create or share data packets over a network service. The memory (204) may comprise any non-transitory storage device including, for example, volatile memory such as random-access memory (RAM), or non-volatile memory such as erasable programmable read only memory (EPROM), flash memory, and the like.
[0071] In an embodiment, the system (110) may include an interface(s) (206). The interface(s) (206) may comprise a variety of interfaces, for example, interfaces for data input and output (I/O) devices, storage devices, and the like. The interface(s) (206) may also provide a communication pathway for one or more components of the system (110). Examples of such components include, but are not limited to, processing engine(s) (208), a database (210), a data ingestion engine (212), a context engine (214), a character level engine (216), and a retrofitting engine (218).
[0072] The processing engine(s) (208) may be implemented as a combination of hardware and programming (for example, programmable instructions) to implement one or more functionalities of the processing engine(s) (208). In examples described herein, such combinations of hardware and programming may be implemented in several different ways. For example, the programming for the processing engine(s) (208) may be processor-executable instructions stored on a non-transitory machine-readable storage medium and the hardware for the processing engine(s) (208) may comprise a processing resource (for example, one or more processors), to execute such instructions. In the present examples, the machine-readable storage medium may store instructions that, when executed by the processing resource, implement the processing engine(s) (208). In such examples, the system (110) may comprise the machine-readable storage medium storing the instructions and the processing resource to execute the instructions, or the machine-readable storage medium may be separate but accessible to the system (110) and the processing resource. In other examples, the processing engine(s) (208) may be implemented by electronic circuitry.
[0073] In an embodiment, the processor (202) may receive one or more data parameters from one or more computing devices (104) via a network (106). The processor (202) may store the received one or more data parameters in the database (210). In an embodiment, the received one or more data parameters may be based on one or more data sources. The one or more data sources may include, but not be limited to, a social media, a customer review, and a user sentiment.
[0074] In an embodiment, the processor (202) may extract the received one or more data parameters via the data ingestion engine (212) to generate one or more tokens based on a knowledge base associated with the one or more data sources. The knowledge base may be configured in the character level engine (216) and include one or more universal tokens based on an urban vocabulary. The processor (202) may extract one or more data parameters via at least one of a lexical processing, a syntactic processing, and a semantic processing configured in the context engine (214).
[0075] In an embodiment, the processor (202) may encode the generated one or more tokens to generate at least a translated word and at least a transliterated word based on the extracted one or more data parameters. The processor (202) may simultaneously train the at least translated word and the at least transliterated word based on the encoded one or more tokens and a plurality of predefined parameters. The plurality of predefined parameters may include, but not be limited to, a character level model and a concept level model.
[0076] In an embodiment, the processor (202) may retrofit the trained at least translated word and the trained at least transliterated word with a pre-trained model to enable the generation of the one or more trained embeddings associated with the encoded one or more tokens. The processor (202) may be configured with the retrofitting engine (218) to retrofit the trained at least translated word and the trained at least transliterated word. The pre-trained model may include a language model to enable the generation of the one or more trained embeddings. The one or more trained embeddings may include at least an OOV synthetic bilingual token in an associated vector space.
[0077] In an embodiment, the processor (202) may be configured to utilize an API for the encoding of the generated one or more tokens. Further, the processor (202) may be configured to utilize the API and the knowledge base to generate the trained at least translated word and the trained at least transliterated word.
[0078] FIGs. 3A-3B illustrate exemplary process (300) of a retrofit multilingual synthetic knowledge aware pipeline, in accordance with embodiments of the present disclosure.
[0079] As illustrated, in FIGs. 3A and 3B, an ingestion service (304, 306, 308) may receive data from various tokens (302-1, 302-2…302-N). Data sources may include, but not be limited to, social media, customer reviews, and user sentiments, etc. Data may be stored in a centralized knowledge base (312) for processing. Data may be further passed to a data cleaning and data transformation module (310) for lexical, syntactic, and semantical processing where most of the data cleaning part such as stop-word removal, tokenization, and phonetic characters (vowels) extraction may be performed. The sanitized data/generated tokens may be stored in the data transformation module (310).
[0080] The generated tokens may be detected by a knowledge concept stream (314) that may include universal tokens based on an urban vocabulary. The synset definition of these tokens may be retrieved by a concept model (316) in various contexts as the concept model (316) may handle various word meanings for the preparation of data from language disambiguation tasks. The concept model (316) may include concept model ensembles, bi-directional encoder representations from transformers (BERT), T5, synset definition, word meaning, and antonyms/synonyms, etc.
[0081] In an embodiment, the aggregated features from the concept-net API (316) and the knowledge concept stream (314) may be integrated to form a bi-model. The bi-model may include a concept model (318) and a character model (320). The concept model (318) may include synset definition and concept vectors. Further, the concept model (318) may include language models, BERT, long short term-memory (LSTM), T5, and ensembles. The concept model (318) may include model evaluation with edit distance, MAE, and MSE log-loss. The character model (320) may include language models, BERT, LSTM, T5, and ensembles. Further, the character model (320) may include Soundex features, a phonetic score, and a hamming distance.
[0082] The trained bi-model may be sent to an attract repel queuing for retrofitting pipeline module (324) via a token stream (322). Here, the trained bi-model may be retrofitted with a pre-trained language model. Hence, synonym pairs, antonym pairs, and local word embeddings may be generated to be a part of the retrofitting pipeline. The retrofitted model may understand all synthetic tokens and generate well recognized universal tokens.
[0083] Further, output from the attract repel queuing for retrofitting pipeline module (324) may be sent to a global training pipeline (326), where synthetic bilingual word vectors may be generated. The bilingual word vectors may be utilized for sentiment analysis, query understanding, etc. The global training pipeline (326) may include the pre-trained model with glove model, BERT, wordzvet, ensembles, and T5. The global training pipeline (326) may also include nugget retrofitting with lambda tuning, batch size, and epochs. The global training pipeline (326) may include model specification with token translation, attention layer, and character phonetics. Additionally, the global training pipeline (326) may include MAE and log-loss. Further, feedback from the global training pipeline (326) may be sent to the concept model (318) and the character model (320) for further analysis.
[0084] In an embodiment, the following steps may be utilized by the system (110) for generating the one or more trained embeddings.
[0085] In an embodiment, the ingestion service (304, 306, 308) may consume data from all data sources such as social media, customer reviews, user sentiments, etc.
[0086] Further, the data cleaning layer and data transformation (310) may receive data streams to tokenize, remove stopwords, generate n-grams, and handle missing data. The cleaned tokens may be mapped to its translated and transliterated form using the concept model (316) or in-house word mappings. The tokens from all possible translated / transliterated forms may be used to retrieve brief description of words from the concept model (316). These word description pairs may be simultaneously trained using word concepts and character n-gram models to get a better understanding at lexical, syntactic level (from character model), and semantic and linguistic level (from concept-based model.) The generated models may be retrofitted using pre-trained embeddings trained on large universal corpora. Further, the trained embeddings may be used to generate OOV bilingual tokens in their vector space to enhance knowledge entities. The trained embeddings may be stored in any database for multiple business use cases such as sentiment analysis, query understanding etc.
[0087] FIG. 4 illustrates an exemplary block diagram (400) of an API with translation and transliteration, in accordance with an embodiment of the present disclosure.
[0088] As illustrated, in FIG. 4, a translator API (404) may be utilized to translate a word (402) into a translated form (406). Further, the translated word (406) may be a word (408) where a translitrator API (410) may be used to generate a transliterated word (412). For example, the word (402) “fragrance” may be translated by the translator API (404) into its translated Hindi form “khushbu” (406). Further, the word (408) form “khushbu” may be transliterated via the translitrator API (410) into the transliterated word (412) “khushboo” or “khushbu”.
[0089] FIG. 5 illustrates an exemplary computer system (500) in which or with which the proposed system may be implemented, in accordance with an embodiment of the present disclosure.
[0090] As shown in FIG. 5, the computer system (500) may include an external storage device (510), a bus (520), a main memory (530), a read-only memory (540), a mass storage device (550), a communication port(s) (560), and a processor (570). A person skilled in the art will appreciate that the computer system (500) may include more than one processor and communication ports. The processor (570) may include various modules associated with embodiments of the present disclosure. The communication port(s) (560) may be any of an RS-232 port for use with a modem-based dialup connection, a 10/100 Ethernet port, a Gigabit or 10 Gigabit port using copper or fiber, a serial port, a parallel port, or other existing or future ports. The communication ports(s) (560) may be chosen depending on a network, such as a Local Area Network (LAN), Wide Area Network (WAN), or any network to which the computer system (500) connects.
[0091] In an embodiment, the main memory (530) may be Random Access Memory (RAM), or any other dynamic storage device commonly known in the art. The read-only memory (540) may be any static storage device(s) e.g., but not limited to, a Programmable Read Only Memory (PROM) chip for storing static information e.g., start-up or basic input/output system (BIOS) instructions for the processor (570). The mass storage device (550) may be any current or future mass storage solution, which can be used to store information and/or instructions. Exemplary mass storage solutions include, but are not limited to, Parallel Advanced Technology Attachment (PATA) or Serial Advanced Technology Attachment (SATA) hard disk drives or solid-state drives (internal or external, e.g., having Universal Serial Bus (USB) and/or Firewire interfaces).
[0092] In an embodiment, the bus (520) may communicatively couple the processor (570) with the other memory, storage, and communication blocks. The bus (520) may be, e.g. a Peripheral Component Interconnect PCI) / PCI Extended (PCI-X) bus, Small Computer System Interface (SCSI), universal serial bus (USB), or the like, for connecting expansion cards, drives, and other subsystems as well as other buses, such a front side bus (FSB), which connects the processor (570) to the computer system (500).
[0093] In another embodiment, operator and administrative interfaces, e.g., a display, keyboard, and cursor control device may also be coupled to the bus (520) to support direct operator interaction with the computer system (500). Other operator and administrative interfaces can be provided through network connections connected through the communication port(s) (560). Components described above are meant only to exemplify various possibilities. In no way should the aforementioned exemplary computer system (500) limit the scope of the present disclosure.
[0094] While considerable emphasis has been placed herein on the preferred embodiments, it will be appreciated that many embodiments can be made and that many changes can be made in the preferred embodiments without departing from the principles of the disclosure. These and other changes in the preferred embodiments of the disclosure will be apparent to those skilled in the art from the disclosure herein, whereby it is to be distinctly understood that the foregoing descriptive matter is to be implemented merely as illustrative of the disclosure and not as a limitation.

ADVANTAGES OF THE INVENTION
[0095] The present disclosure provides a system and a method that facilitates identification of a synthetic mixture of word vectors from two or more languages obtained via customer reviews, tweets, and posts on social media.
[0096] The present disclosure provides a system and a method that utilizes vocabulary vectors along with a combination of character level, phonetic vectors, and synthetic words to identify an equivalent English vocabulary.
[0097] The present disclosure provides a system and a method that generates a weighted score over character level embeddings to generate more intuitive character context level embeddings for a better understanding of a phrase.
[0098] The present disclosure provides a system and a method that enables vocabulary token segregation, token mapping, and token disambiguation to provide a better character/phonetic level understanding.
[0099] The present disclosure provides a system and a method that merges synthetic tokens with universal vocabulary tokens.
[00100] The present disclosure provides a system and a method that retrieves textual content, which could otherwise be treated as out of vocabulary (OOV) word and thereby retaining useful information.
,CLAIMS:1. A system (110) for generating one or more trained embeddings, the system (110) comprising:
a processor (202); and
a memory (204) operatively coupled with the processor (202), wherein said memory (204) stores instructions which when executed by the processor (202) causes the processor (202) to:
receive one or more data parameters from one or more computing devices (104) via a network (106), wherein a user (102) operates the one or more computing devices (104), and wherein the received one or more data parameters are based on one or more data sources;
extract the received one or more data parameters to generate one or more tokens based on a knowledge base associated with the one or more data sources;
encode the generated one or more tokens to generate at least a translated word and at least a transliterated word;
simultaneously train the at least translated word and the at least transliterated word based on the encoded one or more tokens and a plurality of predefined parameters; and
retrofit the trained at least translated word and the trained at least transliterated word with a pre-trained model to enable the generation of the one or more trained embeddings associated with the encoded one or more tokens.
2. The system (110) as claimed in claim 1, wherein the one or more data sources comprise at least one of: a social media, a customer review, and a user sentiment.
3. The system (110) as claimed in claim 1, wherein the knowledge base comprises one or more universal tokens based on an urban vocabulary.

4. The system (110) as claimed in claim 1, wherein the processor (202) is configured to extract the one or more data parameters via at least one of: a lexical processing, a syntactic processing, and a semantic processing.
5. The system (110) as claimed in claim 1, wherein the processor (202) is configured to utilize an application programming interface (API) for the encoding of the generated one or more tokens.
6. The system (110) as claimed in claim 1, wherein the plurality of predefined parameters comprises at least one of: a character level model and a concept level model.
7. The system (110) as claimed in claim 5, wherein the processor (202) is configured to utilize the API and the knowledge base to train the at least translated word and the at least transliterated word.
8. The system (110) as claimed in claim 1, wherein the pre-trained model comprises a language model to enable the generation of the one or more trained embeddings.
9. The system (110) as claimed in claim 1, wherein the one or more trained embeddings comprise at least an out-of-vocabulary (OOV) synthetic bilingual token in an associated vector space.
10. A method for generating one or more trained embeddings, the method comprising:
receiving, by a processor (202), one or more data parameters from a user (102) via one or more computing devices (104), wherein the received one or more data parameters are based on one or more data sources;
extracting, by the processor (202), the received one or more data parameters to generate one or more tokens based on a knowledge base associated with the one or more data sources;
encoding, by the processor (202), the generated one or more tokens to generate at least a translated word and at least a transliterated word based on the extracted one or more data parameters;
Simultaneously training, by the processor (202), the at least translated word and the at least transliterated word based on the encoded one or more tokens and a plurality of predefined parameters; and
retrofitting, by the processor (202), the trained at least translated word and the trained at least transliterated word with a pre-trained model to enable the generation of the one or more trained embeddings associated with the encoded one or more tokens.
11. The method as claimed in claim 10, comprising extracting, by the processor (202), the one or more data parameters via at least one of: a lexical processing, a syntactic processing, and a semantic processing.
12. The method as claimed in claim 10, comprising utilizing, by the processor (202), an application programming interface (API) for the encoding of the generated one or more tokens.
13. The method as claimed in claim 10, wherein the plurality of predefined parameters comprises at least one of: a character level model and a concept level model.
14. The method as claimed in claim 12, comprising utilizing, by the processor (202), the API and the knowledge base to train the at least translated word and the at least transliterated word.
15. The method as claimed in claim 10, wherein the pre-trained model comprises a language model to enable the generation of the one or more trained embeddings.
16. The method as claimed in claim 10, wherein the one or more trained embeddings comprise at least an out-of-vocabulary (OOV) synthetic bilingual token in an associated vector space.
17. A user equipment UE (104) for generating one or more trained embeddings, the UE (104) comprising:
one or more processors communicatively coupled to a processor (202) in a system (110), wherein the one or more processors are coupled with a memory, and wherein said memory stores instructions which when executed by the one or more processors causes the one or more processors to:
transmit one or more data parameters to the processor (202) via a network (106),
wherein the processor (202) is configured to:
receive the one or more data parameters from the UE (104), wherein the received one or more data parameters are based on one or more data sources;
extract the received one or more data parameters to generate one or more tokens based on a knowledge base associated with the one or more data sources;
encode the generated one or more tokens to generate at least a translated word and at least a transliterated word based on the extracted one or more data parameters;
simultaneously train the at least translated word and the at least transliterated word based on the encoded one or more tokens and a plurality of predefined parameters; and
retrofit the trained at least translated word and the trained at least transliterated word with a pre-trained model to enable the generation of the one or more trained embeddings associated with the encoded one or more tokens.

Documents

Application Documents

#	Name	Date
1	202221025306-STATEMENT OF UNDERTAKING (FORM 3) [29-04-2022(online)].pdf	2022-04-29
2	202221025306-PROVISIONAL SPECIFICATION [29-04-2022(online)].pdf	2022-04-29
3	202221025306-FORM 1 [29-04-2022(online)].pdf	2022-04-29
4	202221025306-DRAWINGS [29-04-2022(online)].pdf	2022-04-29
5	202221025306-DECLARATION OF INVENTORSHIP (FORM 5) [29-04-2022(online)].pdf	2022-04-29
6	202221025306-FORM-26 [09-06-2022(online)].pdf	2022-06-09
7	202221025306-ENDORSEMENT BY INVENTORS [28-04-2023(online)].pdf	2023-04-28
8	202221025306-DRAWING [28-04-2023(online)].pdf	2023-04-28
9	202221025306-CORRESPONDENCE-OTHERS [28-04-2023(online)].pdf	2023-04-28
10	202221025306-COMPLETE SPECIFICATION [28-04-2023(online)].pdf	2023-04-28
11	202221025306-FORM-26 [01-05-2023(online)].pdf	2023-05-01
12	202221025306-Covering Letter [01-05-2023(online)].pdf	2023-05-01
13	202221025306-FORM-8 [02-05-2023(online)].pdf	2023-05-02
14	202221025306-FORM 18 [02-05-2023(online)].pdf	2023-05-02
15	202221025306-CORRESPONDENCE (IPO)(WIPO DAS)-12-05-2023.pdf	2023-05-12
16	Abstract1.jpg	2023-06-20
17	202221025306-FORM-26 [28-02-2025(online)].pdf	2025-02-28
18	202221025306-FER.pdf	2025-04-02
19	202221025306-FORM 3 [02-07-2025(online)].pdf	2025-07-02
24	202221025306-ABSTRACT [01-10-2025(online)].pdf	2025-10-01

Search Strategy

1	5306E_18-11-2024.pdf