Abstract: Domain specific Neural Machine Translation (NMT) model can provide improved performance. However, it is difficult to always access a domain specific parallel corpus. Iterative Back-Translation can be used for fine-tuning an NMT model for a domain even if only a monolingual domain corpus is available. The quality of synthetic parallel corpora in terms of closeness to in-domain sentences can play an important role in the performance of the translation model. Recent works involve filtering at different stages of the back translation and weighting the sentences. However, these may lack in consistent performance. Embodiments of the present disclosure provide system and method that implement a translation of sentences and filtering approach based on a domain classifier to curate synthetic parallel data. The synthetic parallel data is then used to fine-tune trained NMT models for improving the translation accuracy for domain adaptation of NMT models. [To be published with FIG. 3]
FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION (See Section 10 and Rule 13)
Title of invention:
CLASSIFIER AUGMENTED FILTERED ITERATIVE BACK
TRANSLATION FOR DOMAIN ADAPTATION OF NEURAL
MACHINE TRANSLATION MODELS
Applicant
Tata Consultancy Services Limited A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th floor,
Nariman point, Mumbai 400021,
Maharashtra, India
Preamble to the description
The following specification particularly describes the invention and the manner in which it is to be performed.
TECHNICAL FIELD [001] The disclosure herein generally relates to domain adaptation for neural machine translation (NMT) models, and, more particularly, to classifier augmented filtered iterative back-translation for domain adaptation of neural machine translation models.
BACKGROUND [002] Neural Machine Translation (NMT) is a method for machine translation that uses an artificial neural network for predicting the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model. Conventionally, NMT systems have heavily relied on the availability of the parallel corpora to produce good quality translations. However, for high resource language pairs, in-domain parallel corpora are scarce. It is noted that the application of NMT is service desk automation system. Tickets are raised in service desk automation system. These tickets are raised in different languages, the service desk automation system only works for English language. Though, conventionally, attempts have been made to build NMT models, these are not capable of translating sensitive words. For instance, publicly available NMT or NMT trained on publicly available corpus are not able to translate sensitive words which are domain specific.
SUMMARY [003] Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one aspect, there is provided a processor implemented method for classifier augmented filtered iterative back-translation for domain adaptation of neural machine translation models. The method comprises obtaining, via one or more hardware processors, an input comprising (i) an out-of-domain parallel corpora, (ii) an in-domain monolingual corpora corresponding to a source language, and (iii) an in-domain monolingual corpora corresponding to a target language; training, via the one or more hardware processors, a first Neural Machine Translation (NMT) model
using the out-of-domain parallel corpora to obtain a first trained NMT model, wherein the first trained NMT model is based on translation of a first set of sentences from the source language to the target language; training, via the one or more hardware processors, a second Neural Machine Translation (NMT) model using the out-of-domain parallel corpora to obtain a second trained NMT model, wherein the second trained NMT model is based on translation of a second set of sentences from the target language to the source language; iteratively performing, until a convergence of the first trained NMT model and the second trained NMT model, is obtained: translating, via the one or more hardware processors, one or more sentences from the in-domain monolingual corpora in the target language and in-domain monolingual corpora in the source language using the second trained NMT model and the first trained NMT model respectively to obtain (i) one or more source based translated sentences, and (ii) one or more target based translated sentences; applying, via the one or more hardware processors, (i) a first classifier-based filtering model on the one or more source based translated sentences, and (ii) a second classifier-based filtering model on the one or more target based translated sentences to obtain a set of source based filtered sentences and a set of target based filtered sentences respectively; generating, via the one or more hardware processors, (i) a first synthetic parallel data based on the set of source based filtered sentences and the in-domain monolingual corpora corresponding to the target language, and (ii) a second synthetic parallel data based on the set of target based filtered sentences and the in-domain monolingual corpora corresponding to the source language; and fine-tuning, via the one or more hardware processors, (i) the first trained NMT model using the first synthetic parallel data and (ii) the second trained NMT model using the second synthetic parallel data.
[004] In an embodiment, the convergence of the first trained NMT model and the second trained NMT model, is obtained based on a comparison of (i) a fine-tuned output associated with the first trained NMT model and the second trained NMT model corresponding to a current iteration, and (ii) a fine-tuned output associated with the first trained NMT model and the second trained NMT model corresponding to a previous iteration.
[005] In an embodiment, the first classifier-based filtering model and the second classifier-based filtering model are applied on (i) the one or more source based translated sentences, and (ii) the one or more target based translated sentences respectively using a Convolutional Neural Network.
[006] In an embodiment, each of the first classifier-based filtering model and the second classifier-based filtering model comprises a pre-trained binary classifier. In an embodiment, the pre-trained binary classifier is configured to distinguish between one or more in-domain sentences and one or more out-of-domain sentences.
[007] In an embodiment, the method further comprises predicting, via the first classifier-based filtering model and the second classifier-based filtering model, a probability of at least one of (i) one or more source based translated sentences, and (ii) one or more target based translated sentences of being an in-domain sentence or an out-of-domain sentence; performing a comparison of (a) the probability of at least one of (i) one or more source based translated sentences, and (ii) one or more target based translated sentences as an in-domain sentence or an out-of-domain sentence and (b) a pre-defined threshold; identifying (i) one or more source based translated sentences, and (ii) one or more target based translated sentences as the in-domain sentence or the out-of-domain sentence based on the comparison.
[008] In an embodiment, the pre-defined threshold is based on the in-domain monolingual corpora corresponding to the source language, the in-domain monolingual corpora corresponding to the target language, an out-of-domain corpora corresponding to the source language and the target language, respectively.
[009] In another aspect, there is provided a system for classifier augmented filtered iterative back-translation for domain adaptation of neural machine translation models. The system comprises a memory storing instructions; one or more communication interfaces; and one or more hardware processors coupled to the memory via the one or more communication interfaces, wherein the one or more hardware processors are configured by the instructions to: obtain an input comprising (i) an out-of-domain parallel corpora, (ii) an in-domain monolingual
corpora corresponding to a source language, and (iii) an in-domain monolingual corpora corresponding to a target language; train a first Neural Machine Translation (NMT) model using the out-of-domain parallel corpora to obtain a first trained NMT model, wherein the first trained NMT model is based on translation of a first set of sentences from the source language to the target language; train a second Neural Machine Translation (NMT) model using the out-of-domain parallel corpora to obtain a second trained NMT model, wherein the second trained NMT model is based on translation of a second set of sentences from the target language to the source language; iteratively perform, until a convergence of the first trained NMT model and the second trained NMT model, is obtained: translating one or more sentences from the in-domain monolingual corpora in the target language and in-domain monolingual corpora in the source language using the second trained NMT model and the first trained NMT model respectively to obtain (i) one or more source based translated sentences, and (ii) one or more target based translated sentences; apply (i) a first classifier-based filtering model on the one or more source based translated sentences, and (ii) a second classifier-based filtering model on the one or more target based translated sentences to obtain a set of source based filtered sentences and a set of target based filtered sentences respectively; generate (i) a first synthetic parallel data based on the set of source based filtered sentences and the in-domain monolingual corpora corresponding to the target language, and (ii) a second synthetic parallel data based on the set of target based filtered sentences and the in-domain monolingual corpora corresponding to the source language; and fine-tune (i) the first trained NMT model using the first synthetic parallel data and (ii) the second trained NMT model using the second synthetic parallel data.
[010] In an embodiment, the convergence of the first trained NMT model and the second trained NMT model, is obtained based on a comparison of (i) a fine-tuned output associated with the first trained NMT model and the second trained NMT model corresponding to a current iteration, and (ii) a fine-tuned output associated with the first trained NMT model and the second trained NMT model corresponding to a previous iteration.
[011] In an embodiment, the first classifier-based filtering model and the second classifier-based filtering model are applied on (i) the one or more source based translated sentences, and (ii) the one or more target based translated sentences respectively using a Convolutional Neural Network.
[012] In an embodiment, each of the first classifier-based filtering model and the second classifier-based filtering model comprises a pre-trained binary classifier. In an embodiment, the pre-trained binary classifier is configured to distinguish between one or more in-domain sentences and one or more out-of-domain sentences.
[013] In an embodiment, the one or more hardware processors are further configured by the instructions to predict, via the first classifier-based filtering model and the second classifier-based filtering model, a probability of at least one of (i) one or more source based translated sentences, and (ii) one or more target based translated sentences of being an in-domain sentence or an out-of-domain sentence; perform a comparison of (a) the probability of at least one of (i) one or more source based translated sentences, and (ii) one or more target based translated sentences as an in-domain sentence or an out-of-domain sentence and (b) a pre¬defined threshold; identifying (i) one or more source based translated sentences, and (ii) one or more target based translated sentences as the in-domain sentence or the out-of-domain sentence based on the comparison.
[014] In an embodiment, the pre-defined threshold is based on the in-domain monolingual corpora corresponding to the source language, the in-domain monolingual corpora corresponding to the target language, an out-of-domain corpora corresponding to the source language and the target language respectively.
[015] In yet another aspect, there are provided one or more non-transitory machine-readable information storage mediums comprising one or more instructions which when executed by one or more hardware processors cause a method for classifier augmented filtered iterative back-translation for domain adaptation of neural machine translation models by: obtaining, via the one or more hardware processors, an input comprising (i) an out-of-domain parallel corpora, (ii) an in-domain monolingual corpora corresponding to a source language, and (iii) an
in-domain monolingual corpora corresponding to a target language; training, via the one or more hardware processors, a first Neural Machine Translation (NMT) model using the out-of-domain parallel corpora to obtain a first trained NMT model, wherein the first trained NMT model is based on translation of a first set of sentences from the source language to the target language; training, via the one or more hardware processors, a second Neural Machine Translation (NMT) model using the out-of-domain parallel corpora to obtain a second trained NMT model, wherein the second trained NMT model is based on translation of a second set of sentences from the target language to the source language; iteratively performing, until a convergence of the first trained NMT model and the second trained NMT model, is obtained: translating, via the one or more hardware processors, one or more sentences from the in-domain monolingual corpora in the target language and in-domain monolingual corpora in the source language using the second trained NMT model and the first trained NMT model respectively to obtain (i) one or more source based translated sentences, and (ii) one or more target based translated sentences; applying, via the one or more hardware processors, (i) a first classifier-based filtering model on the one or more source based translated sentences, and (ii) a second classifier-based filtering model on the one or more target based translated sentences to obtain a set of source based filtered sentences and a set of target based filtered sentences respectively; generating, via the one or more hardware processors, (i) a first synthetic parallel data based on the set of source based filtered sentences and the in-domain monolingual corpora corresponding to the target language, and (ii) a second synthetic parallel data based on the set of target based filtered sentences and the in-domain monolingual corpora corresponding to a source language; and fine-tuning, via the one or more hardware processors, (i) the first trained NMT model using the first synthetic parallel data and (ii) the second trained NMT model using the second synthetic parallel data.
[016] In an embodiment, the convergence of the first trained NMT model and the second trained NMT model, is obtained based on a comparison of (i) a fine-tuned output associated with the first trained NMT model and the second trained NMT model corresponding to a current iteration, and (ii) a fine-tuned output
associated with the first trained NMT model and the second trained NMT model corresponding to a previous iteration.
[017] In an embodiment, the first classifier-based filtering model and the second classifier-based filtering model are applied on (i) the one or more source based translated sentences, and (ii) the one or more target based translated sentences respectively using a Convolutional Neural Network.
[018] In an embodiment, each of the first classifier-based filtering model and the second classifier-based filtering model comprises a pre-trained binary classifier. In an embodiment, the pre-trained binary classifier is configured to distinguish between one or more in-domain sentences and one or more out-of-domain sentences.
[019] In an embodiment, the method further comprises predicting, via the first classifier-based filtering model and the second classifier-based filtering model, a probability of at least one of (i) one or more source based translated sentences, and (ii) one or more target based translated sentences of being an in-domain sentence or an out-of-domain sentence; performing a comparison of (a) the probability of at least one of (i) one or more source based translated sentences, and (ii) one or more target based translated sentences as an in-domain sentence or an out-of-domain sentence and (b) a pre-defined threshold; identifying (i) one or more source based translated sentences, and (ii) one or more target based translated sentences as the in-domain sentence or the out-of-domain sentence based on the comparison.
[020] In an embodiment, the pre-defined threshold is based on the in-domain monolingual corpora corresponding to the source language, the in-domain monolingual corpora corresponding to the target language, an out-of-domain corpora corresponding to the source language and the target language, respectively.
[021] It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS
[022] The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
[023] FIG. 1 depicts an exemplary system for classifier augmented filtered iterative back-translation for domain adaptation of neural machine translation models, in accordance with an embodiment of the present disclosure.
[024] FIG. 2A depicts an exemplary block diagram of the system illustrating training of a base neural machine translation (NMT) model as implemented by the system of FIG. 1, in accordance with an embodiment of the present disclosure.
[025] FIG. 2B depicts an exemplary block diagram of the system illustrating training of a classifier-based filtering model as implemented by the system of FIG. 1, in accordance with an embodiment of the present disclosure.
[026] FIG. 2C depicts an exemplary block diagram of the system illustrating classifier augmented filtered iterative back-translation for domain adaptation of NMT models, in accordance with an embodiment of the present disclosure.
[027] FIG. 3 depicts an exemplary flow chart illustrating a method for classifier augmented filtered iterative back-translation for domain adaptation of neural machine translation models, using the system of FIGS. 1 and 2, in accordance with an embodiment of the present disclosure.
[028] FIGS. 4A and 4B depict graphical representation illustrating selection of a pre-defined threshold for a specific domain, in accordance with an embodiment of the present disclosure.
DETAILED DESCRIPTION OF EMBODIMENTS [029] Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are
described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments.
[030] Neural Machine Translation (NMT) is a method for machine translation that uses an artificial neural network for predicting the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model. Conventionally, NMT systems have heavily relied on the availability of the parallel corpora to produce good quality translations. However, for high resource language pairs, in-domain parallel corpora are scarce.
[031] In the present disclosure, system and method are provided that are built top of the existing data centric approaches for domain adaptation (Chu and Wang, 2018 – refer “A survey of domain adaptation for neural machine translation. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1304–1319, Santa Fe, New Mexico, USA. Association for Computational Linguistics.”), i.e., Back-Translation (BT) (Sennrich et al., 2016a “Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 86–96, Berlin, Germany. Association for Computational Linguistics”) and Iterative Back-Translation (IBT) (Hoang et al., 2018 – refer “Iterative backtranslation for neural machine translation. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pages 18–24, Melbourne, Australia. Association for Computational Linguistics.”). IBT is a variant of BT, which leverage both source and target-side monolingual corpora along with the out-of-domain parallel corpora and train NMTs→t and NMTt→s in alternate fashion till convergence, where NMTs→t generates the synthetic parallel corpora for NMTt→s and vice versa.
[032] The performance of the NMT is influenced by the quality of synthetic parallel corpora as noted in the literature (e.g., refer Poncelas et al. (2018); Fadaee and Monz (2018)). Hence, for the domain adaptation task, literature (e.g., refer “Dynamic data selection and weighting for iterative back-translation”) has proposed a curriculum-based approach (DDSWIBT) for sentence selection from the in-domain monolingual corpora and use Junczys-Dowmunt (2018) (e.g., refer
“Dual conditional cross-entropy filtering of noisy parallel corpora. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 888– 895, Belgium, Brussels. Association for Computational Linguistics”) for weight assignment to synthetic parallel corpora. In initial iterations of IBT, DDSWIBT prefer simple sentences over representative in-domain sentences, in later iterations they use more representative sentences as compared to simple sentences. Meanwhile, Imankulova et al. (2017) use an in-domain language model (sent-LM) and “Round-Trip BLEU” score for synthetic parallel corpora filtering.
[033] In the present disclosure, embodiments herein provide a system and method that implement a “classifier augmented filtered iterative back-translation” (CFIBT) for the domain adaptation task. Two Convolutional Neural Network (CNN) based binary classifiers are trained, one in source and the other in the target language on the combination of in-domain and out-of-domain corpora. IBT is used for synthetic parallel corpora generation and classifier-based filtering to remove the pair of sentences where the synthetic sentence in the pair does not belong to the domain. Sentence selection is not employed over the monolingual corpora, or a weighting mechanism for the synthetic corpora, and neither any “Round-Trip” criteria is used for scoring the synthetic parallel corpora.
[034] More specifically, the present disclosure provides domain adaptation results for the German (de) - English (en) language pair on three different domains - Medical, Law and IT under low and high resource scenarios. In the low resource scenario, the proposed method CFIBT outperforms all the baselines in every domain. In the high resource settings, CFIBT outperforms the baselines in most of the scenarios, whereas it performs competitively with the best baseline results in the rest of the scenarios.
[035] Moreover, the objective of the present disclosure is to improve the performance of the NMT model on in-domain sentences given out-of-domain parallel corpora and in-domain monolingual corpora in both source and target language, which is known as domain-adaptation for NMT. Conventionally, approaches for domain-adaptation (e.g., Chu and Wang, 2018) have been categorized into data centric and model centric. Data centric approaches for domain
adaptation focuses on the use of in-domain monolingual corpora (Zhang and Zong, 2016; Cheng et al., 2016), synthetic corpora (Sennrich et al., 2016a; Hoang et al., 2018; Hu et al., 2019), or parallel corpora (Luong and Manning, 2015; Chu et al., 2017) along with the out-of-domain parallel corpora. On the other hand, model centric approaches modify the NMT architecture to include domain information i.e., domain-tags (Britz et al., 2017), domain embedding with word embeddings (Kobus et al., 2017) and assign higher weights to in-domain sentences as compared to out-of-domain sentences (Wang et al., 2017). In present disclosure, the data centric approach is used for domain adaptation to generate synthetic parallel data via Iterative Back-Translation.
[036] Some literatures have proposed different instance weighting based approaches for domain adaptation in NLP, where in-domain instances are assigned with more weight as compared to the out-of-domain instances. Conventionally, in NMT, it is observed that noisy sentences in synthetic parallel corpora can affect the performance of the translation model. However, Round-Trip BLEU score is used between the authentic and synthetic versions of the same sentence to filter noisy synthetic corpora. Further, language model trained on monolingual data has been used to filter the noisy sentences. While other approaches have used a semantic similarity technique based on the sentence embeddings of the source and synthetic corpus to filter out noisy pairs. Instead of filtering the noisy sentences from the training data, few other literatures (e.g., He et al. (2016); Zhang et al. (2018); Wang et al. (2019)) have assigned lower weight to them during model training. While other literatures (e.g., Dou et al. (2020)), have used a variant of Moore and Lewis (2010) (e.g., refer “Intelligent selection of language model training data. In Proceedings of the ACL 2010 Conference Short Papers, pages 220–224, Uppsala, Sweden. Association for Computational Linguistics.”) for data selection from in-domain monolingual corpora and use Junczys-Dowmunt (2018) for weight assignment to synthetic corpora generated by IBT. However, in the present disclosure, the whole in-domain monolingual data is used, and the noisy synthetic corpora is filtered with the help of a simple binary classifier which is trained on in-domain and out-of-domain corpora. The above lines of the present disclosure are
better understood by way of following description. More specifically, the classifier augmented filtered iterative back-translation (CFIBT) is implemented by the present disclosure, wherein the system has access to out-of-domain parallel corpora and in-domain monolingual corpora in both source and target languages. Firstly, NMTs→t and NMTt→s are trained using out-of-domain parallel corpora Dp. The trained NMT models, NMTs→t and NMTt→s , are used for translating in-domain sentences of source language to target language and vice versa. A classifier-based filtering model is then applied to these translated sentences to obtain source-based and target-based filtered sentences. The source-based and target-based filtered sentences along with their corresponding in-domain monolingual sentences are used to curate synthetic parallel data. Thereafter, NMTs→t and NMTt→s are fine-tuned on this synthetic parallel data. This entire process repeats until convergence. The NMT models, NMTs→t and NMTt→s are considered to be converged when there is no improvement in both the models when compared to its preceding iteration model.
[037] Referring now to the drawings, and more particularly to FIG. 1 through 4B, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
[038] FIG. 1 depicts an exemplary system for classifier augmented filtered iterative back-translation for domain adaptation of neural machine translation models, in accordance with an embodiment of the present disclosure. In an embodiment, the system 100 includes one or more hardware processors 104, communication interface device(s) or input/output (I/O) interface(s) 106 (also referred as interface(s)), and one or more data storage devices or memory 102 operatively coupled to the one or more hardware processors 104. The one or more processors 104 may be one or more software processing components and/or hardware processors. In an embodiment, the hardware processors can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries,
and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor(s) is/are configured to fetch and execute computer-readable instructions stored in the memory. In an embodiment, the system 100 can be implemented in a variety of computing systems, such as laptop computers, notebooks, hand-held devices, workstations, mainframe computers, servers, a network cloud and the like.
[039] The I/O interface device(s) 106 can include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like and can facilitate multiple communications within a wide variety of networks N/W and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. In an embodiment, the I/O interface device(s) can include one or more ports for connecting a number of devices to one another or to another server.
[040] The memory 102 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random-access memory (SRAM) and dynamic-random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. In an embodiment, a database 108 is comprised in the memory 102, wherein the database 108 comprises input received by the system 100. The input, for example, comprises, but is not limited to (i) an out-of-domain parallel corpora, (ii) an in-domain monolingual corpora corresponding to a source language, and (iii) an in-domain monolingual corpora corresponding to a target language.
[041] The database 108 further comprises various techniques (e.g., classifier-based filtering model(s)) which when executed perform one or more methodologies described herein by the present disclosure. The memory 102 further comprises (or may further comprise) information pertaining to input(s)/output(s) of each step performed by the systems and methods of the present disclosure. In other words, input(s) fed at each step and output(s) generated at each step are comprised in the memory 102 and can be utilized in further processing and analysis.
[042] FIGS. 2A through 2C, with reference to FIG. 1, depict an exemplary block diagram of the system 100 for classifier augmented filtered iterative back-translation for domain adaptation of neural machine translation models, in accordance with an embodiment of the present disclosure. More specifically, FIG. 2A, with reference to FIG. 1, depicts an exemplary block diagram of the system 100 illustrating training of a base neural machine translation (NMT) model as implemented by the system 100, in accordance with an embodiment of the present disclosure. FIG. 2B, with reference to FIG. 1 and 2A, depicts an exemplary block diagram of the system 100 illustrating training of a classifier-based filtering model as implemented by the system 100, in accordance with an embodiment of the present disclosure. FIG. 2C, with reference to FIG. 1 through 2B, depicts an exemplary block diagram of the system 100 illustrating classifier augmented filtered iterative back-translation for domain adaptation of NMT models, in accordance with an embodiment of the present disclosure.
[043] FIG. 3, with reference to FIGS. 1 through 2C, depicts an exemplary flow chart illustrating a method for classifier augmented filtered iterative back-translation for domain adaptation of neural machine translation models, using the system of FIG. 1 and 2A through 2C, in accordance with an embodiment of the present disclosure. In an embodiment, the system(s) 100 comprises one or more data storage devices or the memory 102 operatively coupled to the one or more hardware processors 104 and is configured to store instructions for execution of steps of the method by the one or more processors 104. The steps of the method of the present disclosure will now be explained with reference to components of the system 100 of FIG. 1, and the block diagram of the system 100 depicted in FIGS. 2A through 2C. In an embodiment, at step 202 of the present disclosure, the one or more hardware processors 104 obtain an input comprising (i) an out-of-domain parallel corpora (ODPC) Dp, (ii) an in-domain monolingual corpora corresponding to a source language (IDMCSL) Ms, and (iii) an in-domain monolingual corpora corresponding to a target language (IDMCTL) Mt. In the present disclosure, it is assumed that the system 100 has access to out-of-domain parallel corpora and in-domain monolingual corpora in both source and target languages. For better
understanding of the embodiments of the present disclosure, the present disclosure has considered the domain considered as Medical. Other examples of domain may include Law, IT, and the like. It is to be understood by a person having ordinary skill in the art or person skilled in the art that examples of such domains shall not be construed as limiting the scope of the present disclosure. Referring to step 202, example of out-of-domain parallel corpora Dp is illustrated in below Table 1:
Table 1
EN DE
ich erkläre die am freitag, dem 17.
i declare resumed the session of the
dezember 1999 unterbrochene
european parliament adjourned on friday
sitzungsperiode des europäischen
17 december 1999, and i would like once
parlaments für wiederaufgenommen,
again to wish you a happy new year in
wünsche ihnen nochmals alles gute zum
the hope that you enjoyed a pleasant
jahreswechsel und hoffe, daß sie schöne
festive period.
ferien hatten.
one of the people assassinated very zu den attentatsopfern, die es in jüngster
recently in sri lanka was mr kumar zeit in sri lanka zu beklagen gab, zählt
ponnambalam, who had visited the auch herr kumar ponnambalam, der dem
european parliament just a few months europäischen parlament erst vor wenigen
ago. monaten einen besuch abgestattet hatte.
[044] Example of in-domain monolingual corpora corresponding to a source language Ms is illustrated in below Table 2:
Table 2
Monolingual sentences (DE)
lesen sie die gesamte packungsbeilage sorgfältig durch , bevor sie mit der einnahme
dieses arzneimittels beginnen.
abilify gehört zu einer gruppe von arzneimitteln , die antipsychotika genannt
werden.
[045] Example of in-domain monolingual corpora corresponding to a target language Mt is illustrated in below Table 3:
Table 3:
Monolingual sentences (EN)
people with this condition may also feel depressed, guilty, anxious or tense.
if you notice you are gaining weight, experience any difficulty in swallowing or allergic symptoms, please tell your doctor.
[046] Referring to the steps of FIG. 3, in an embodiment, at step 204 of the present disclosure, the one or more hardware processors 104 train a first Neural Machine Translation (NMT) model using the out-of-domain parallel corpora Dp to obtain a first trained NMT model (FTNMTM). The first trained NMT model (FTNMTM) is based on translation of a first set of sentences from the source language to the target language. Example of translation of the first set of sentences from the source language to the target language to obtain the first trained NMT model (FTNMTM) is illustrated in below Table 4:
Table 4
Original sentences (DE) Translated sentences (EN)
das ist wirklich ein muss für unser land. this is really a true one for our country.
was sind die bedingungen für die entstehung von planeten und leben? what are the conditions for the planet and lives?
[047] Similarly, in an embodiment, at step 206 of the present disclosure, the one or more hardware processors 104 train a second Neural Machine Translation (NMT) model using the out-of-domain parallel corpora Dp to obtain a second trained NMT model (STNMTM). The second trained NMT model (STNMTM) is based on translation of a second set of sentences from the target language to the source language. It is to be understood by a person having ordinary skill in the art or person skilled in the art that the step of training the first NMT model and the second NMT model can either be sequentially performed or in parallel, in one embodiment of the present disclosure. It is to be understood by a person having ordinary skill in the art or person skilled in the art the second set shall not be construed with a literal meaning and shall refer to a set of sentences like any other set of sentences. Example of translation of the first set of sentences from the target language to the source language to obtain the second trained NMT model (STNMTM) is illustrated in below Table 5:
Table 5
Original sentences (EN) Translated sentences (DE)
a black box in your car? ein schwarzen in ihrem auto?
americans don't buy as much gas as they used to. die amerikaner kaufen nicht so viel gas wie früher.
[048] Referring to steps 204 and 206, the first NMT model (NMTs→t) and the second NMT model (NMTt→s) are trained using out-of-domain parallel corpora Dp. In an embodiment, at step 208 of the present disclosure, the one or more hardware processors 104 are configured to perform one or more steps iteratively until a convergence of the first trained NMT model (FTNMTM) and the second trained NMT model (STNMTM) is obtained. For instance, at step 208a of the present disclosure, the one or more hardware processors 104 translate one or more sentences from the in-domain monolingual corpora in the target language (IDMCTL) and the in-domain monolingual corpora in the source language (IDMCSL) using the second trained NMT model (STNMTM) and the first trained NMT model (FTNMTM) respectively to obtain (i) one or more source based translated sentences (SBTS), and (ii) one or more target based translated sentences (TBTS). In other words, one or more sentences from the in-domain monolingual corpora in the target language (IDMCTL) using the using the second trained NMT model (STNMTM) and one or more sentences from the in-domain monolingual corpora in the source language (IDMCSL) using the first trained NMT model (FTNMTM). The trained NMT models, NMTs→t and NMTt→s, are used for translating in-domain sentences of source language to target language and vice versa, in one example embodiment. Examples of (i) one or more source based translated sentences, and (ii) one or more target based translated sentences are illustrated in below Table 6:
Table 6
Direction Monolingual sentences Translated sentences
EN->DE concurrent administration of potentially nephrotoxic drugs should be avoided as there might be an increased risk of renal toxicity. die gegenwärtige regierung von potenziell nebulösen medikamenten sollte verhindert werden , dass es
zu einem höheren risiko der stadt käme.
experience from clinical studies. erfahrungen aus klinischen studien.
DE->EN nach Öffnen und Rekonstitution zur sofortigen Anwendung und zum einmaligen Gebrauch bestimmt . after open and reasserting itself, as well as to open use and to the one-time use of nations.
bei einigen Patienten treten
schwerwiegendere Symptome auf, die s die Stimmung oder die Fähigkeit , klar zu denken beeinträchtigen können. in a few patients, more serious symptoms on seeing the mood or the ability to think.
[049] At step 208b of the present disclosure, the one or more hardware processors 104 apply (i) a first classifier-based filtering model (FCBDM) on the one or more source based translated sentences (SBTS), and (ii) a second classifier-based filtering model (SCBDM) on the one or more target based translated sentences (TBTS) to obtain a set of source based filtered sentences (SBFS) and a set of target based filtered sentences (TBFS) respectively. In other words, a classifier-based filtering model is then applied to these translated sentences to obtain source-based and target-based filtered sentences. In an embodiment, the first classifier-based filtering model and the second classifier-based filtering model are applied on (i) the one or more source based translated sentences, and (ii) the one or more target based translated sentences respectively using a Convolutional Neural Network (e.g., neural network as known in the art). Examples of the set of source based filtered sentences and the set of target based filtered sentences are illustrated in below Table 7:
Table 7
Filtered sentences (DE) Filtered sentences (EN)
in a few patients, more serious
erfahrungen aus klinischen studien. symptoms on seeing the mood or the
ability to think.
patienten mit hepatitika sollten it is important that you take
sorgfältig für anzeichen des fketzes olanzapine teva tablets for as long as
vorsichtig sein. your doctor recommends.
[050] At step 208c of the present disclosure, the one or more hardware processors 104 generate (i) a first synthetic parallel data (FSPD) based on the set of source based filtered sentences (SBFS) and the in-domain monolingual corpora corresponding to the target language (IDMCTL), and (ii) a second synthetic parallel data (SSPD) based on the set of target based filtered sentences (TBFS) and the in-domain monolingual corpora corresponding to the source language (IDMCSL). The source-based and target-based filtered sentences along with their corresponding in-domain monolingual sentences are used to curate synthetic parallel data. Examples of the first synthetic parallel data and the second synthetic parallel data are illustrated in below Table 8:
Table 8
DE→EN
erfahrungen aus klinischen studien. experience from clinical studies.
patienten mit hepatitika sollten sorgfältig für anzeichen des fketzes vorsichtig sein. patients with hepatic impairment should be observed carefully for signs of fentanyl toxicity.
EN→ DE
in a few patients, more serious symptoms on seeing the mood or the ability to think. bei einigen Patienten treten schwerwiegendere Symptome auf, die s die Stimmung oder die Fähigkeit , klar zu denken beeinträchtigen können.
it is important that you take olanzapine teva tablets for as long as your doctor recommends. es ist wichtig , dass Sie Olanzapin Teva Tabletten so lange einnehmen , wie Ihr Arzt es Ihnen empfiehlt.
[051] At step 208c of the present disclosure, the one or more hardware processors 104 fine-tune (i) the first trained NMT model using the first synthetic parallel and (ii) the second trained NMT model using the second synthetic parallel data. The steps 208a through 208d are iteratively performed to determine improvisation of the first trained NMT model and the second trained NMT model in terms of accuracy of domain adaptation for NMT models over the iterations for reaching the convergence. The iterations may result in a first fine-tuned trained NMT model (FFTTNMTM) and a second fine-tuned trained NMT model (SFTTNMTM), in one embodiment of the present disclosure. In other words, NMTs→t and NMTt→s are fine-tuned on these synthetic parallel data. Examples of iterations of translations and filtering output are depicted in below Table 9:
Table 9
Source (DE) warnhinweis, dass das arzneimittel für kinder unerreichbar und nicht sichtbar aufzubewahren ist
Target (EN) (Ground truth) special warning that the medicinal product must be stored out of the reach and sight of children
FTNMTM translation warning that drugs for children is unvisible and not visible.
1st Iteration translation warning that the medicinal product is being unabsorbed and has not been visible.
2nd Iteration translation warning that the medicines for children are not being able and not visible.
3rd iteration translation warning that the medicinal product must be stored out of the reach and sight of children.
[052] In an embodiment of the present disclosure, the convergence of the
first trained NMT model and the second trained NMT model, is obtained based on
a comparison of (i) a fine-tuned output (e.g., BLEU score) associated with the first
trained NMT model and the second trained NMT model corresponding to a current
iteration, and (ii) a fine-tuned output (e.g., BLEU score) associated with the first
trained NMT model and the second trained NMT model corresponding to a
previous iteration. This entire process repeats until convergence. The NMT models,
NMTs→t and NMTt→s are considered to be converged when there is no
improvement in both the models when compared to its preceding iteration model. Example of various iterations are illustrated below in Table 10, wherein a comparison is depicted between current iteration and a previous iteration. For instance, fine-tuned output comparison of 1st iteration versus 2nd iteration, fine-
tuned output comparison of 2nd iteration versus 3rd iteration, and fine-tuned output comparison of 3rd iteration versus 4th iteration, and the like.
Table 10
Source (DE) warnhinweis, dass das arzneimittel für kinder unerreichbar und nicht sichtbar aufzubewahren ist
Target (EN) (Ground truth) special warning that the medicinal product must be stored out of the reach and sight of children
FTNMTM translation warning that drugs for children is unvisible and not visible.
1st Iteration translation warning that the medicinal product is being unabsorbed and hasnot been visible.
2nd Iteration translation warning that the medicines for children are not being able and not visible.
3rd iteration translation warning that the medicinal product must be stored out of the reach and sight of children.
4th iteration translation warning that the medicinal product must be stored out of the reach and sight of children.
[053] As can be observed from the above Table 10, that there is not much change seen in 3rd iteration and the 4th iteration. This is when the first trained NMT model and the second trained NMT model can be assumed to be converging. Further, each of the first classifier-based filtering model and the second classifier-based filtering model comprises a pre-trained binary classifier wherein the pre-trained binary classifier is configured to distinguish between one or more in-domain sentences and one or more out-of-domain sentences. Examples of in-domain
sentences and out-of-domain sentences being distinguished by the pre-trained binary classifier of the system 100, are illustrated in below Table 11:
Table 11
In-domain Out-of-domain
instructions for proper use cerezyme is given through a drip into a vein (by intravenous infusion). we propose a temporary commission composed of independent persons who are not themselves tainted.
concurrent administration of potentially nephrotoxic drugs should be avoided as there might be an increased risk of renal toxicity. there are also wardrobes with good storage space, a safety deposit box and iron.
wasser hat keinen einfluss auf die bioverfügbarkeit von neoclarityn schmelztabletten . alcudia liegt zwischen zwei buchten, mit einigen der schönsten strände des mittelmeeres.
andere mechanismen könnten ebenfalls zur zytotoxischen wirkung von nelarabin beitragen. pro jahr zählt der dresdner airport über 15,000 führungsteilnehmer.
[054] Below is an exemplary pseudo code/algorithm for classifier augmented filtered iterative back-translation for domain adaptation of neural machine translation models as implemented by the system 100 of the present disclosure:
Require: Dp – Out-of-domain parallel Corpora
Require: Ms – in-domain monolingual Corpora in source language Require: Mt – in-domain monolingual Corpora in target language
[055] Furthermore, the one or more hardware processors 104 are configured to predict, via the first classifier-based filtering model and the second classifier-based filtering model, a probability of at least one of (i) one or more source based translated sentences, and (ii) one or more target based translated sentences as the in-domain sentence or the out-of-domain sentence as mentioned above. Examples of the one or more source based translated sentences and the one or more target based translated sentences being predicting as the in-domain sentence or the out-of-domain sentence along with their corresponding probability are illustrated in below Table 12:
Table 12
DE Probability EN Probability
maternaltoxische effekte traten in dem dosisbereich auf, in dem auch toxische effekte auf die intrauterine entwicklung beobachtet worden waren. 0.981 blood and the lymphatic system disorders. 0.988
überempfindlichkeit gegen den wirkstoff oder einen der sonstigen bestandteile. 0.999 metabolism and nutrition disorders: 0.944
sie konnten die schäden teilweise begrenzen. 0.032 a republican strategy to counter the re-election of obama. 0.017
sie konnten die schäden teilweise begrenzen. 0.142 the republican authorities were quick to extend this practice to other states. 0.068
[056] One the above probability is predicted, the one or more hardware processors 104 perform a comparison of (a) the probability of at least one of (i) the one or more source based translated sentences, and (ii) the one or more target based translated sentences and (b) a pre-defined threshold. The pre-defined threshold is 0.6 for the medical domain considered as an example by the present disclosure, in one embodiment. Examples of the probability comparison with the pre-defined threshold is illustrated for the one or more source based translated sentences and the one or more target based translated sentences in below Table 13:
Table 13
DE Probability Threshold EN Probability Threshold
maternaltoxische
effekte traten in dem
dosisbereich auf, in blood and
dem auch toxische the
effekte auf die 0.981 0.6 lymphatic 0.988 0.6
intrauterine system
entwicklung disorders.
beobachtet worden
waren.
überempfindlichkeit gegen den wirkstoff oder einen der sonstigen bestandteile. 0.999 0.6 metabolism and
nutrition disorders: 0.944 0.6
a republican
sie konnten die strategy to
schäden teilweise 0.032 0.6 counter the 0.017 0.6
begrenzen. re-election of obama.
the
sie konnten die republican
schäden teilweise 0.142 0.6 authorities 0.068 0.6
begrenzen. were quick to extend this practice
to other states.
[057] Upon comparison, the one or more source based translated sentences, and the one or more target based translated sentences are identified (or classified) as the in-domain sentence or the out-of-domain sentence based on the comparison. In the present disclosure, as mentioned above, “classifier-based filtering model” using a Convolutional Neural Network is implemented by the system 100. The classifier-based filtering model consists of the binary classifier which is trained to distinguish between in-domain and out-of-domain sentences as described and depicted above. For each domain, two such models are trained - one for source and other for the target language. The binary classifier is trained only once at the beginning, and the same pre-trained binary classifier is utilized in each successive iteration. Given a translated sentence as input, the classifier predicts the probability of this sentence as being in-domain or out-of-domain. All translated sentences having a probability greater than or equal to a certain threshold (e.g., 0.6 as shown above) are considered in-domain sentences. Else, if the probability is less than the pre-defined threshold, then the translated sentences are identified as out-of-domain sentences. Examples of the one or more source based translated sentences, and the one or more target based translated sentences with the probability comparison with the pre-defined threshold to identify as in-domain sentence or out-of-domain sentence are illustrated in Tables 14 and 15, respectively.
Table 14
DE Sentences Probability condition Threshold Class
maternaltoxische effekte
traten in dem dosisbereich auf, in dem auch toxische 0.981 > 0.6 In-domain
effekte auf die intrauterine
entwicklung beobachtet worden waren.
überempfindlichkeit gegen den wirkstoff oder einen der sonstigen bestandteile. 0.999 > 0.6 In-domain
sie konnten die schäden teilweise begrenzen. 0.032 < 0.6 Out-of-domain
sie konnten die schäden teilweise begrenzen. 0.142 < 0.6 Out-of-domain
Table 15
EN Sentences Probability condition Threshold Class
blood and the lymphatic system disorders. 0.988 > 0.6 In-domain
metabolism and nutrition disorders: 0.944 > 0.6 In-domain
a republican strategy to counter the re-election of obama. 0.017 < 0.6 Out-of-domain
the republican authorities were quick to extend this practice to other states. 0.068 < 0.6 Out-of-domain
[058] In an embodiment of the present disclosure, the pre-defined threshold is based on (i) the in-domain monolingual corpora corresponding to the source language, (ii) the in-domain monolingual corpora corresponding to the target language, and (iii) an out-of-domain corpora corresponding to the source language and the target language, respectively. FIGS. 4A and 4B, with reference to FIGS. 1 through 3, depict graphical representation illustrating selection of a pre-defined threshold for a specific domain, in accordance with an embodiment of the present disclosure. More specifically, FIGS. 4A and 4B, with reference to FIGS. 1 through 3, depict graphical representation illustrating selection of the pre-defined threshold for the medical domain, in accordance with an embodiment of the present disclosure. In the present disclosure, for choosing threshold, the development dataset of in-domain and out-of-domain corpora were used. A histogram was plotted by the system 100 for the probability range versus the number of sentences of in-domain and out-of-domain in development dataset, having probability in provided range. From FIGS. 4A and 4B, it can be observed that from range 0.6-0.7 the number of in-domain sentences starts increasing and the number of out-of-domain sentences starts decreasing. Therefore, the system 100 has chosen 0.6 as the threshold (e.g., the pre-defined threshold) in both cases. RESULTS
[059] Here, datasets and the training details of experiments conducted by the present disclosure are described. DATASET DESCRIPTION
[060] Experiments have been performed by the present disclosure on German-English (de-en) language pair. In both low and high resource scenarios, the same out-of-domain News dataset was used as used by Dou et al. (2020), which is described in Table 16.
Table 16
Dataset Train Development Testing
High 4.5M 3K 3K
Low 100K 3K 3K
[061] For in-domain data, as described in Table 17 below, the same development dataset and testing dataset as used by Dou et al. (2020) for Medical (EMEA), Law (Acquis) was used by the present disclosure along with the same number of monolingual sentences during domain adaptation. In addition to Medical and Law, in the present disclosure, results on the IT domain are also reported and for that, the dataset described in Tiedemann (2012) was used.
Table 17
Dataset Monolingual Development Testing
Medical 400K 2K 2K
Law 500K 2K 2K
IT 240K 2.5K 1.8K
[062] The out-of-domain sentence pairs as well as in-domain sentences were tokenized using moses and byte-pair-encoding with 37K merge operations were applied. TRAINING DETAILS
[063] Two types of models were trained, i.e., filtering models and NMTs.
[064] Filtering Models: The filtering models were trained in English and German languages for each domain. Language model and classifier were used as filtering models for sent-LM and CFIBT respectively. For the language model, one layer Long-Short-Term Memory (LSTM) having 512 embedding size and 50 sequence length using TF-LM Toolkit were used. The model was trained till convergence with the patience of three. The models were trained on the tokenized monolingual in-domain dataset for each domain in English and German with vocabulary size 60K and 80K, respectively. For training the binary classifier, the present disclosure used sub-sampled out-of-domain data as one class and in-domain as another. The tokenized sentences are used with a vocabulary size of 50K. For filtering models, optimal values of thresholds based on the development set have
been obtained, where the objective is to maximize the true positives (i.e., in-domain sentences) and minimize the false positives (i.e., out-of-domain sentences) in the synthetic parallel corpus. The overall intuition is that the classifier should help to select the in-domain sentences, which could then be utilized to further train the NMT models. Use of development dataset to obtain optimal thresholds i.e., in-sample bias may inflate final BLEU scores a bit. For CFIBT, the present disclosure used 0.6 for Medical and 0.5 for Law (not shown in FIGS.) and IT (not shown in FIGS.) as a threshold over classifier probability for filtering sentences in English and German.
[065] NMT: Present disclosure used Base-Transformer as known in the art for the experiments. FairSeq (Ott et al., 2019) was used for training all the NMT models. Out-of-domain parallel corpora was used to train the initial NMT model i.e., BASE in both low and high resource scenarios. The NMT models obtained from BASE in CFIBT were fine-tuned with the synthetic parallel in-domain dataset, curated with respective approaches. Value of patience as five was used for all approaches. RESULTS AND ANALYSIS
[066] Here, the results of the method of the present disclosure are compared with other baseline methods as depicted in below Table 18.
Table 18
Domain Medical Law IT
LP de-en en-de de-en en-de de-en en-de
High Resource
BASE 33.61 24.98 33.07 23.33 21.93 16.27
BT 41.05 36.32 38.27 28.32 35.31 24.80
sent-LM 47.44 37.85 40.82 30.35 39.24 30.11
IBT 47.71 38.01 39.46 29.04 38.93 29.37
DDSWIBT 45.46 36.45 39.11 29.04 - -
CFIBT (present disclosure 47.59 37.61 40.99 30.38 40.06 29.93
Low Resource
BASE 10.05 6.53 8.52 6.84 5.05 3.70
BT 22.64 14.02 18.47 10.53 13.07 10.79
sent-LM 36.21 29.35 25.36 17.28 28.80 24.32
IBT 33.14 24.31 22.96 14.31 28.06 24.16
DDSWIBT 31.22 28.12 22.06 13.28 - -
CFIBT (present disclosure 37.61 30.63 27.18 18.88 29.56 25.82
[067] As shown in Table 18, BLEU scores are compared in two different scenarios, viz., high resource and low resource, on three different domains i.e., Medical, Law and IT in both directions for German-English (de-en) language pair on in-domain test set. With monolingual data only in both source and target language, performance gains of 27.56, 18.66 and 24.51 is obtained in terms of BLEU score for Medical, Law and IT in one direction (de-en), and 24.1, 12.04 and 22.12 in the other direction (en-de) in low resource scenario over the BASE. In the low resource scenario, CFIBT outperforms sent-LM in both directions and all the domains. CFIBT also outperform sent-LM in high resource scenario except in one direction for the IT domain. In the low resource scenario, filtering based approaches perform better than the IBT. And CFIBT outperforms in all the cases. Results of the present dsiclosure show that CFIBT is efficient when the base model is not adequately trained. In high resource scenario results of CFIBT are comparable with other baselines. CFIBT outperforms in both directions for Law and one direction IT domain, respectively.
[068] It was observed through experiments that IBT trained with all synthetic bilingual sentences without filtering, may hurt the performance of IBT in subsequent iterations because the current model is used to generate the data for the next iteration. But in CFIBT, filtering prevents training of the NMT model on out-of-domain sentence pairs which leads to a better domain model in subsequent
iterations. Hence, CFIBT as implemented by the system and method of the present disclosure has proven to be out-performing other conventionally known approaches.
[069] In the context of domain adaptation for NMT, embodiments of the present disclosure implement a system which executes a method that is simple and effective approach for filtering the synthetic parallel corpus, which is as good as more involved approaches for the same task. In the low resource scenario, method of the present disclosure outperforms all the existing baselines whereas similar results are obtained to baselines in the high resource scenario.
[070] The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
[071] It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g., any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g., hardware means like e.g., an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g., an ASIC and an FPGA, or at least one microprocessor and at least one memory with software processing components located therein. Thus, the means can include both hardware means and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software
means. Alternatively, the embodiments may be implemented on different hardware devices, e.g., using a plurality of CPUs.
[072] The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
[073] The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
[074] Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-
readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
[075] It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims.
We Claim:
1. A processor implemented method, comprising:
obtaining, via one or more hardware processors, an input comprising (i) an
out-of-domain parallel corpora, (ii) an in-domain monolingual corpora
corresponding to a source language, and (iii) an in-domain monolingual corpora
corresponding to a target language (202);
training, via the one or more hardware processors, a first Neural Machine
Translation (NMT) model using the out-of-domain parallel corpora to obtain a first
trained NMT model, wherein the first trained NMT model is based on translation
of a first set of sentences from the source language to the target language (204);
training, via the one or more hardware processors, a second Neural Machine
Translation (NMT) model using the out-of-domain parallel corpora to obtain a
second trained NMT model, wherein the second trained NMT model is based on
translation of a second set of sentences from the target language to the source
language (206);
iteratively performing, until a convergence of the first trained NMT model
and the second trained NMT model, is obtained (208):
translating, via the one or more hardware processors, one or more
sentences from the in-domain monolingual corpora in the target language
and in-domain monolingual corpora in the source language using the second
trained NMT model and the first trained NMT model respectively to obtain
(i) one or more source based translated sentences, and (ii) one or more target
based translated sentences (208a);
applying, via the one or more hardware processors, (i) a first
classifier-based filtering model on the one or more source based translated
sentences, and (ii) a second classifier-based filtering model on the one or
more target based translated sentences to obtain a set of source based filtered
sentences and a set of target based filtered sentences respectively (208b);
generating, via the one or more hardware processors, (i) a first
synthetic parallel data based on the set of source based filtered sentences
and the in-domain monolingual corpora corresponding to the target
language, and (ii) a second synthetic parallel data based on the set of target based filtered sentences and the in-domain monolingual corpora corresponding to the source language (208c); and
fine-tuning, via the one or more hardware processors, (i) the first trained NMT model using the first synthetic parallel data and (ii) the second trained NMT model using the second synthetic parallel data (208d).
2. The processor implemented method of claim 1, wherein the convergence of the first trained NMT model and the second trained NMT model, is obtained based on a comparison of (i) a fine-tuned output associated with the first trained NMT model and the second trained NMT model corresponding to a current iteration, and (ii) a fine-tuned output associated with the first trained NMT model and the second trained NMT model corresponding to a previous iteration.
3. The processor implemented method of claim 1, wherein the first classifier-based filtering model and the second classifier-based filtering model are applied on (i) the one or more source based translated sentences, and (ii) the one or more target based translated sentences respectively using a Convolutional Neural Network.
4. The processor implemented method of claim 1, wherein the each of the first classifier-based filtering model and the second classifier-based filtering model comprises a pre-trained binary classifier, and wherein the pre-trained binary classifier is configured to distinguish between one or more in-domain sentences and one or more out-of-domain sentences.
5. The processor implemented method of claim 1, further comprising:
predicting, via the first classifier-based filtering model and the second
classifier-based filtering model, a probability of at least one of (i) one or more source based translated sentences, and (ii) one or more target based translated sentences;
performing a comparison of (a) the probability of at least one of (i) one or more source based translated sentences, and (ii) one or more target based translated sentences as an in-domain sentence or an out-of-domain sentence and (b) a pre-defined threshold;
identifying (i) one or more source based translated sentences, and (ii) one or more target based translated sentences as an in-domain sentence or an out-of-domain sentence based on the comparison.
6. The processor implemented method of claim 5, wherein the pre-defined threshold is based on the in-domain monolingual corpora corresponding to the source language, the in-domain monolingual corpora corresponding to the target language, an out-of-domain corpora corresponding to the source language and the target language respectively.
7. A system (100), comprising:
a memory (102) storing instructions;
one or more communication interfaces (106); and
one or more hardware processors (104) coupled to the memory (102) via the one or more communication interfaces (106), wherein the one or more hardware processors (104) are configured by the instructions to:
obtain an input comprising (i) an out-of-domain parallel corpora, (ii) an in-domain monolingual corpora corresponding to a source language, and (iii) an in-domain monolingual corpora corresponding to a target language (202);
train a first Neural Machine Translation (NMT) model using the out-of-domain parallel corpora to obtain a first trained NMT model, wherein the first trained NMT model is based on translation of a first set of sentences from the source language to the target language;
train a second Neural Machine Translation (NMT) model using the out-of-domain parallel corpora to obtain a second trained NMT model, wherein the second trained NMT model is based on translation of a second set of sentences from the target language to the source language;
iteratively perform, until a convergence of the first trained NMT model and the second trained NMT model, is obtained:
translating one or more sentences from the in-domain monolingual corpora in the target language and in-domain monolingual corpora in the source language using the second trained NMT model and the first trained NMT model respectively to obtain (i) one or more source based translated sentences, and (ii) one or more target based translated sentences;
applying (i) a first classifier-based filtering model on the one or more source based translated sentences, and (ii) a second classifier-based filtering model on the one or more target based translated sentences to obtain a set of source based filtered sentences and a set of target based filtered sentences respectively;
generating, via the one or more hardware processors, (i) a first synthetic parallel data based on the set of source based filtered sentences and the in-domain monolingual corpora corresponding to the target language, and (ii) a second synthetic parallel data based on the set of target based filtered sentences and the in-domain monolingual corpora corresponding to the source language; and
fine-tuning (i) the first trained NMT model using the first synthetic parallel data and (ii) the second trained NMT model using the second synthetic parallel data.
8. The system of claim 7, wherein the convergence of the first trained NMT model and the second trained NMT model, is obtained based on a comparison of (i) a fine-tuned output associated with the first trained NMT model and the second trained NMT model corresponding to a current iteration, and (ii) a fine-tuned output associated with the first trained NMT model and the second trained NMT model corresponding to a previous iteration.
9. The system of claim 7, wherein the first classifier-based filtering model and the second classifier-based filtering model are applied on (i) the one or more source
based translated sentences, and (ii) the one or more target based translated sentences respectively using a Convolutional Neural Network.
10. The system of claim 7, wherein the each of the first classifier-based filtering model and the second classifier-based filtering model comprises a pre-trained binary classifier, and wherein the pre-trained binary classifier is configured to distinguish between one or more in-domain sentences and one or more out-of-domain sentences.
11. The system of claim 7, wherein the one or more hardware processors are further configured by the instructions to:
predict, via the first classifier-based filtering model and the second classifier-based filtering model, a probability of at least one of (i) one or more source based translated sentences, and (ii) one or more target based translated sentences;
perform a comparison of (a) the probability of at least one of (i) one or more source based translated sentences, and (ii) one or more target based translated sentences as an in-domain sentence or an out-of-domain sentence and (b) a pre-defined threshold;
identify (i) one or more source based translated sentences, and (ii) one or more target based translated sentences as an in-domain sentence or an out-of-domain sentence based on the comparison.
12. The system of claim 11, wherein the pre-defined threshold is based on the
in-domain monolingual corpora corresponding to the source language, the in-
domain monolingual corpora corresponding to the target language, an out-of-
domain corpora corresponding to the source language and the target language
respectively
| # | Name | Date |
|---|---|---|
| 1 | 202121017854-STATEMENT OF UNDERTAKING (FORM 3) [17-04-2021(online)].pdf | 2021-04-17 |
| 2 | 202121017854-REQUEST FOR EXAMINATION (FORM-18) [17-04-2021(online)].pdf | 2021-04-17 |
| 3 | 202121017854-FORM 18 [17-04-2021(online)].pdf | 2021-04-17 |
| 4 | 202121017854-FORM 1 [17-04-2021(online)].pdf | 2021-04-17 |
| 5 | 202121017854-FIGURE OF ABSTRACT [17-04-2021(online)].jpg | 2021-04-17 |
| 6 | 202121017854-DRAWINGS [17-04-2021(online)].pdf | 2021-04-17 |
| 7 | 202121017854-DECLARATION OF INVENTORSHIP (FORM 5) [17-04-2021(online)].pdf | 2021-04-17 |
| 8 | 202121017854-COMPLETE SPECIFICATION [17-04-2021(online)].pdf | 2021-04-17 |
| 9 | 202121017854-Proof of Right [20-07-2021(online)].pdf | 2021-07-20 |
| 10 | Abstract1.jpg | 2021-10-18 |
| 11 | 202121017854-FORM-26 [22-10-2021(online)].pdf | 2021-10-22 |
| 12 | 202121017854-FER.pdf | 2022-11-10 |
| 13 | 202121017854-OTHERS [03-02-2023(online)].pdf | 2023-02-03 |
| 14 | 202121017854-FER_SER_REPLY [03-02-2023(online)].pdf | 2023-02-03 |
| 15 | 202121017854-CLAIMS [03-02-2023(online)].pdf | 2023-02-03 |
| 16 | 202121017854-PatentCertificate30-10-2024.pdf | 2024-10-30 |
| 17 | 202121017854-IntimationOfGrant30-10-2024.pdf | 2024-10-30 |
| 1 | search017854E_10-11-2022.pdf |