System And Method For Speech Recognition

< Back

System And Method For Speech Recognition

Abstract: A system and method for recognition is provided. The method comprises generating at least one dictionary containing dialectically variated pronunciations of a predefined set of words. The method further comprises storing the generated at least one dictionary in a memory of a user device. The method further comprises analysing user to identify dialectically variated phonemes characteristically pronounced by a user of the user device. The method further comprises determining dialectically variated pronunciations corresponding to the identified dialectically variated phonemes for one or more of the predefined set of words. The method further comprises deleting other of the dialectically variated pronunciations than the determined dialectically variated pronunciations to provide at least one optimized dictionary. The method further comprises implementing the at least one optimized dictionary for recognition on the user device. FIG. 5

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

31 December 2019

Publication Number

27/2021

Publication Type

INA

Invention Field

ELECTRONICS

Status

sujit@jupiterlawpartners.com

Parent Application

Applicants

Mihup Communications Private Limited

Module No- 3A & 3B, Tower-II, Millennium City IT Park, DN-62, DN Block, Sector-V, Salt Lake, Kolkata-700 091, India

Inventors

1. Sandipan Mandal

c/o Mr. H. N. Mandal, near Vivekananda Eye Hospital, Tinamtala, P.O. Maslandapur, P.S. Habra, North 24 Parganas, West Bengal - 743289, India

2. Soumya Sankar Ghosh

64/5/1 Suren Sarkar Road, Beliaghata, Kolkata, West Bengal - 700010, India

Specification

WE CLAIM: -

1. A method for speech recognition, comprising:
generating at least one dictionary containing dialectically variated pronunciations of a predefined set of words;
storing the generated at least one dictionary in a memory of a user device;
analysing user speech to identify dialectically variated phonemes characteristically pronounced by a user of the user device;
determining dialectically variated pronunciations, from the at least one dictionary, corresponding to the identified dialectically variated phonemes for one or more of the predefined set of words;
deleting other of the dialectically variated pronunciations than the determined dialectically variated pronunciations corresponding to the identified dialectically variated phonemes for the one or more of the predefined set of words from the at least one dictionary to provide at least one optimized dictionary, stored in the memory of the user device; and
implementing the at least one optimized dictionary for speech recognition on the user device.

2. The method of claim 1 further comprises generating multiple dictionaries, with each dictionary containing one type of the dialectical variations of the predefined set of words.

3. The method of claim 2 further comprises deleting other of the multiple dictionaries than the one or more of the multiple dictionaries associated with the determined dialectically variated pronunciations for the one or more of the predefined set of words.

4. The method of claim 1 further comprises:
recognizing user speech pattern based on the analysis of the user speech;
determining dialectically variated pronunciations, from the at least one dictionary, corresponding to the recognized user speech pattern for the one or more of the predefined set of words;
deleting other of the dialectically variated pronunciations than the determined dialectically variated pronunciations corresponding to the recognized user speech pattern for the one or more of the predefined set of words from the at least one dictionary to provide at least one further optimized dictionary, stored in the memory of the user device; and
implementing the at least one further optimized dictionary for speech recognition on the user device.

5. The method of claim 1, wherein analysis of the user speech comprises comparing the identified dialectically variated phonemes for the one or more of the predefined set of words with phonemes making up each of the dialectically variated pronunciations of the corresponding one or more of the predefined set of words contained in the at least one dictionary.

6. The method of claim 1, wherein the at least one dictionary is generated utilizing grapheme-to-phoneme (G2P) conversion technique implemented via an attention-based sequence-to-sequence (Seq2Seq) model on a deep neural network.

7. The method of claim 6, wherein the deep neural network includes a multi-layered recurrent neural network based encoder and a Long short-term memory (LSTM) Cell based decoder.

8. The method of claim 1 further comprising utilizing morphological segmentation technique involving:
segmenting a compound word, from the predefined set of words, into two or more strings, with each of the two or more strings being one of commonly known prefixes, commonly known suffixes and a valid word;
determining pronunciations of each of the two or more strings; and
concatenating the determined pronunciations of each of the two or more strings of the segmented compound word, to generate pronunciation of the targeted compound word.

9. The method of claim 1, wherein the predefined set of words includes out-of-vocabulary (OOV) words.

10. The method of claim 9, wherein the predefined set of words further includes commonly accepted abbreviations of the OOV words.

11. A system for speech recognition, comprising:
a server configured to train a model to generate at least one dictionary containing dialectically variated pronunciations of a predefined set of words; and
a user device comprising:
an audio-recording unit;
a memory having stored the generated at least one dictionary therein; and
a processor,
wherein the memory further having stored therein program instructions which when accessed by the processor cause the processor to:
record user speech, via the audio-recording unit, over a period of time;
analyse the user speech to identify dialectically variated phonemes characteristically pronounced by a user of the user device;
determine dialectically variated pronunciations, from the at least one dictionary, corresponding to the identified dialectically variated phonemes for one or more of the predefined set of words;
delete other of the dialectically variated pronunciations than the determined dialectically variated pronunciations corresponding to the identified dialectically variated phonemes for the one or more of the predefined set of words from the at least one dictionary to provide at least one optimized dictionary, stored in the memory of the user device; and
implement the at least one optimized dictionary for speech recognition on the user device.

12. The system of claim 11, wherein the processor of the user device is caused to generate the at least one dictionary utilizing the trained model.

13. The system of claim 11, wherein one or more of the server and the user device is configured to generate multiple dictionaries, with each dictionary containing one type of the dialectical variations of the predefined set of words.

14. The system of claim 13, wherein the memory having further stored the generated multiple dictionaries therein, and wherein the processor is further caused to delete other of the multiple dictionaries than the one or more of the multiple dictionaries associated with the determined dialectically variated pronunciations for the one or more of the predefined set of words.

15. The system of claim 11, wherein the processor is further caused to:
recognize user speech pattern based on the analysis of the user speech;
determine dialectically variated pronunciations, from the at least one dictionary, corresponding to the recognized user speech pattern for the one or more of the predefined set of words;
delete other of the dialectically variated pronunciations than the determined dialectically variated pronunciations corresponding to the recognized user speech pattern for the one or more of the predefined set of words from the at least one dictionary to provide at least one further optimized dictionary, stored in the memory of the user device; and
implement the at least one further optimized dictionary for speech recognition on the user device.

16. The system of claim 11, wherein, for analysis of the user speech, the processor is further caused to compare the identified dialectically variated phonemes for the one or more of the predefined set of words with phonemes making up each of the dialectically variated pronunciations of the corresponding one or more of the predefined set of words contained in the at least one dictionary.

17. The system of claim 11, wherein the server is implemented as a deep neural network and is caused to train the model to generate the at least one dictionary by utilizing grapheme-to-phoneme (G2P) conversion technique implemented via an attention-based sequence-to-sequence (Seq2Seq) model thereon.

18. The method of claim 17, wherein the server comprises a multi-layered recurrent neural network based encoder and a Long Short-term Memory (LSTM) Cell based decoder.

19. The system of claim 11, wherein the server is further caused to utilize morphological segmentation technique involving:
segmenting a compound word, from the predefined set of words, into two or more strings, with each of the two or more strings being one of commonly known prefixes, commonly known suffixes and a valid word;
determining pronunciations of each of the two or more strings; and
concatenating the determined pronunciations of each of the two or more strings of the segmented compound word, to generate pronunciation of the targeted compound word.

20. The system of claim 11, wherein the predefined set of words includes out-of-vocabulary (OOV) words and commonly accepted abbreviations of the OOV words. , Description:SYSTEM AND METHOD FOR SPEECH RECOGNITION

FIELD OF THE PRESENT DISCLOSURE

[0001] The present disclosure relates to speech recognition and, more particularly, to a system and a method for handling dialect and dynamic vocabulary in offline speech recognition via a user device, such as a user smartphone.

BACKGROUND
[0002] A speech recognition system analyses a user's speech to determine what the user said. Speech recognition systems are usually implemented in one of two different forms: online over the Internet or offline embedded on a user's device (e.g., smartphone or other suitable computing device). In other words, speech recognition involving mobile devices, such as mobile phones, is conventionally carried out either by sending the speech waveform over the network to a server or entirely locally on the processor fitted in the device/phone itself. Online speech recognition has the advantage that it can benefit from significant processing resources on a large server (the cloud), and provides a data feed to the service provider that makes improvements and customization possible. However, online processing requires continuing network connectivity, which cannot be guaranteed in all locations or is not desirable in some instances due to network lag, data costs or privacy/security concerns. In particular, with the online systems, when the network connection is poor, undesirable delays (latencies) are introduced into the speech recognition response, which can be frustrating for the user. As alternative deployment, speech recognition systems can be delivered as software running embedded locally on the smartphone itself, which has comparatively low footprint. Such offline embedded speech recognition capability is the preferred deployment for many if not most practical situations, as networks may not be available, intermittent or too expensive.
[0003] As speech recognition systems are deployed in ever increasing numbers and situations, they need to be sufficiently flexible to cope with a wide range of user responses and behaviour. Speech recognition systems also have to face situation where users speak out-of-vocabulary (OOV) words, i.e. words which do not belong to the speech recognition system's vocabulary, for example, contact’s name, title of a song, name of a place, or similar words which may be specific to a professional field, or words mumbled or spoken with an accent in an audio recording. Most conventional speech recognition systems, especially offline systems, are not designed to handle OOV words. To solve this problem, speech recognition mechanisms may be configured to search for OOV words by expanding its dictionary. However, creating such expanded dictionary is a problem in itself. Moreover, expanding the dictionary may increase memory resources required by the speech recognition system which is not always possible with offline systems because of limited resources, and may further makes the search cumbersome, for example, increasing search time.
[0004] Therefore, in light of the foregoing discussion, there exists a need to overcome problems associated with conventional speech recognition systems, and particularly for offline speech recognition systems which needs to be able to handle OOV words.

SUMMARY
[0005] In an aspect, a method for speech recognition is provided. The method comprises generating at least one dictionary containing dialectically variated pronunciations of a predefined set of words. The method further comprises storing the generated at least one dictionary in a memory of a user device. The method further comprises analysing user speech to identify dialectically variated phonemes characteristically pronounced by a user of the user device. The method further comprises determining dialectically variated pronunciations, from the at least one dictionary, corresponding to the identified dialectically variated phonemes for one or more of the predefined set of words. The method further comprises deleting other of the dialectically variated pronunciations than the determined dialectically variated pronunciations corresponding to the identified dialectically variated phonemes for the one or more of the predefined set of words from the at least one dictionary to provide at least one optimized dictionary, stored in the memory of the user device. The method further comprises implementing the at least one optimized dictionary for speech recognition on the user device.
[0006] In one or more embodiments, the method comprises generating multiple dictionaries, with each dictionary containing one type of the dialectical variations of the predefined set of words. The method further comprises deleting other of the multiple dictionaries than the one or more of the multiple dictionaries associated with the determined dialectically variated pronunciations for the one or more of the predefined set of words.
[0007] In one or more embodiments, the method further comprises recognizing user speech pattern based on the analysis of the user speech. The method further comprises determining dialectically variated pronunciations, from the at least one dictionary, corresponding to the recognized user speech pattern for the one or more of the predefined set of words. The method further comprises deleting other of the dialectically variated pronunciations than the determined dialectically variated pronunciations corresponding to the recognized user speech pattern for the one or more of the predefined set of words from the at least one dictionary to provide at least one further optimized dictionary, stored in the memory of the user device. The method further comprises implementing the at least one further optimized dictionary for speech recognition on the user device.
[0008] In one or more embodiments, analysis of the user speech comprises comparing the identified dialectically variated phonemes for the one or more of the predefined set of words with phonemes making up each of the dialectically variated pronunciations of the corresponding one or more of the predefined set of words contained in the at least one dictionary.
[0009] In one or more embodiments, the at least one dictionary is generated utilizing grapheme-to-phoneme (G2P) conversion technique implemented via an attention-based sequence-to-sequence (Seq2Seq) model on a deep neural network. Herein, the deep neural network includes a multi-layered recurrent neural network based encoder and a Long short-term memory (LSTM) Cell based decoder.
[0010] In one or more embodiments, the method further comprises utilizing morphological segmentation technique involving segmenting a compound word, from the predefined set of words, into two or more strings, with each of the two or more strings being one of commonly known prefixes, commonly known suffixes and a valid word; determining pronunciations of each of the two or more strings; and concatenating the determined pronunciations of each of the two or more strings of the segmented compound word, to generate pronunciation of the targeted compound word.
[0011] In one or more embodiments, the predefined set of words includes out-of-vocabulary (OOV) words. Herein, the predefined set of words further includes commonly accepted abbreviations of the OOV words.
[0012] In another aspect, a system for speech recognition is provided. The system comprises a server configured to train a model to generate at least one dictionary containing dialectically variated pronunciations of a predefined set of words. The system also comprises a user device. The user device an audio-recording unit. The user device also comprises a memory having stored the generated at least one dictionary therein, and a processor. Herein, the memory further having stored therein program instructions which when accessed by the processor cause the processor to: record user speech, via the audio-recording unit, over a period of time; analyse the user speech to identify dialectically variated phonemes characteristically pronounced by a user of the user device; determine dialectically variated pronunciations, from the at least one dictionary, corresponding to the identified dialectically variated phonemes for one or more of the predefined set of words; delete other of the dialectically variated pronunciations than the determined dialectically variated pronunciations corresponding to the identified dialectically variated phonemes for the one or more of the predefined set of words from the at least one dictionary to provide at least one optimized dictionary, stored in the memory of the user device; and implement the at least one optimized dictionary for speech recognition on the user device.
[0013] In one or more embodiments, the processor of the user device is caused to generate the at least one dictionary utilizing the trained model.
[0014] In one or more embodiments, one or more of the server and the user device is configured to generate multiple dictionaries, with each dictionary containing one type of the dialectical variations of the predefined set of words. The memory stores the generated multiple dictionaries therein, and the processor is further caused to delete other of the multiple dictionaries than the one or more of the multiple dictionaries associated with the determined dialectically variated pronunciations for the one or more of the predefined set of words.
[0015] In one or more embodiments, the processor is further caused to: recognize user speech pattern based on the analysis of the user speech; determine dialectically variated pronunciations, from the at least one dictionary, corresponding to the recognized user speech pattern for the one or more of the predefined set of words; delete other of the dialectically variated pronunciations than the determined dialectically variated pronunciations corresponding to the recognized user speech pattern for the one or more of the predefined set of words from the at least one dictionary to provide at least one further optimized dictionary, stored in the memory of the user device; and implement the at least one further optimized dictionary for speech recognition on the user device.
[0016] In one or more embodiments, for analysis of the user speech, the processor is further caused to compare the identified dialectically variated phonemes for the one or more of the predefined set of words with phonemes making up each of the dialectically variated pronunciations of the corresponding one or more of the predefined set of words contained in the at least one dictionary.
[0017] In one or more embodiments, the server is implemented as a deep neural network and is caused to train the model to generate the at least one dictionary by utilizing grapheme-to-phoneme (G2P) conversion technique implemented via an attention-based sequence-to-sequence (Seq2Seq) model thereon. Herein, the server comprises a multi-layered recurrent neural network based encoder and a Long short-term memory (LSTM) Cell based decoder.
[0018] In one or more embodiments, the server is further caused to utilize morphological segmentation technique involving: segmenting a compound word, from the predefined set of words, into two or more strings, with each of the two or more strings being one of commonly known prefixes, commonly known suffixes and a valid word; determining pronunciations of each of the two or more strings; and concatenating the determined pronunciations of each of the two or more strings of the segmented compound word, to generate pronunciation of the targeted compound word.
[0019] In one or more embodiments, the predefined set of words includes out-of-vocabulary (OOV) words. Herein, the predefined set of words further includes commonly accepted abbreviations of the OOV words.
[0020] The foregoing summary is illustrative only and is not intended to be in any way limiting. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features will become apparent by reference to the drawings and the following detailed description.

BRIEF DESCRIPTION OF THE FIGURES
[0021] For a more complete understanding of example embodiments of the present disclosure, reference is now made to the following descriptions taken in connection with the accompanying drawings in which:
[0022] FIG. 1 illustrates a system that may reside on and may be executed by a computer, which may be connected to a network, in accordance with one or more exemplary embodiments of the present disclosure;
[0023] FIG. 2 illustrates a diagrammatic view of a user device, in accordance with one or more exemplary embodiments of the present disclosure;
[0024] FIG. 3 illustrates a block diagram of a system for speech recognition, in accordance with one or more exemplary embodiments of the present disclosure;
[0025] FIG. 4 illustrates a block diagram of a deep neural network implemented for generating dictionaries for speech recognition, in accordance with one or more exemplary embodiments of the present disclosure; and
[0026] FIG. 5 illustrates a flowchart depicting steps involved in a method for speech recognition, in accordance with one or more exemplary embodiments of the present disclosure.

DETAILED DESCRIPTION
[0027] In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. It will be apparent, however, to one skilled in the art that the present disclosure is not limited to these specific details.
[0028] Reference in this specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. The appearance of the phrase “in one embodiment” in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Further, the terms “a” and “an” herein do not denote a limitation of quantity, but rather denote the presence of at least one of the referenced item. Moreover, various features are described which may be exhibited by some embodiments and not by others. Similarly, various requirements are described which may be requirements for some embodiments but not for other embodiments.
[0029] The present disclosure relates to speech recognition, more specifically, to a system and a method for speech recognition. In disclosed implementations, the present disclosure may be embodied as a system, method, or computer program product. Accordingly, in some implementations, the present disclosure may take the form of an entirely hardware implementation, an entirely software implementation (including firmware, resident software, micro-code, etc.) or an implementation combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, in some implementations, the present disclosure may take the form of a computer program product on a computer-usable storage medium having computer-usable program code embodied in the medium.
[0030] In some implementations, any suitable computer usable or computer readable medium (or media) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer-usable, or computer-readable, storage medium (including a storage device associated with a computing device) may be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable medium may include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fibre, a portable compact disc read-only memory (CD-ROM), an optical storage device, a digital versatile disk (DVD), a static random access memory (SRAM), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, a media such as those supporting the internet or an intranet, or a magnetic storage device. Note that the computer-usable or computer-readable medium could even be a suitable medium upon which the program is stored, scanned, compiled, interpreted, or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory. In the context of the present disclosure, a computer-usable or computer-readable, storage medium may be any tangible medium that can contain or store a program for use by or in connection with the instruction execution system, apparatus, or device.
[0031] In some implementations, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. In some implementations, such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. In some implementations, the computer readable program code may be transmitted using any appropriate medium, including but not limited to the internet, wireline, optical fibre cable, RF, etc. In some implementations, a computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
[0032] In some implementations, computer program code for carrying out operations of the present disclosure may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Java®, Smalltalk, C++ or the like. Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates. However, the computer program code for carrying out operations of the present disclosure may also be written in conventional procedural programming languages, such as the "C" programming language, PASCAL, or similar programming languages, as well as in scripting languages such as JavaScript, PERL, or Python. In present implementations, the used language for training may be one of Python, TensorflowTM, Bazel, C, C++. Further, decoder in user device (as will be discussed) may use C, C++ or any processor specific ISA. Furthermore, assembly code inside C/C++ may be utilized for specific operation. Also, ASR (automatic speech recognition) and G2P decoder along with entire user system can be run in embedded Linux (any distribution), Android, iOS, Windows, or the like, without any limitations. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user’s computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user’s computer through a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the internet using an Internet Service Provider). In some implementations, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGAs) or other hardware accelerators, micro-controller units (MCUs), or programmable logic arrays (PLAs) may execute the computer readable program instructions/code by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present disclosure.
[0033] In some implementations, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus (systems), methods and computer program products according to various implementations of the present disclosure. Each block in the flowchart and/or block diagrams, and combinations of blocks in the flowchart and/or block diagrams, may represent a module, segment, or portion of code, which comprises one or more executable computer program instructions for implementing the specified logical function(s)/act(s). These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the computer program instructions, which may execute via the processor of the computer or other programmable data processing apparatus, create the ability to implement one or more of the functions/acts specified in the flowchart and/or block diagram block or blocks or combinations thereof. It should be noted that, in some implementations, the functions noted in the block(s) may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
[0034] In some implementations, these computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function/act specified in the flowchart and/or block diagram block or blocks or combinations thereof.
[0035] In some implementations, the computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed (not necessarily in a particular order) on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions/acts (not necessarily in a particular order) specified in the flowchart and/or block diagram block or blocks or combinations thereof.
[0036] Referring now to the example implementation of FIG. 1, there is shown a system 100 that may reside on and may be executed by a computer (e.g., computer 12), which may be connected to a network (e.g., network 14) (e.g., the internet or a local area network). Examples of computer 12 may include, but are not limited to, a personal computer(s), a laptop computer(s), mobile computing device(s), a server computer, a series of server computers, a mainframe computer(s), or a computing cloud(s). In some implementations, each of the aforementioned may be generally described as a computing device. In certain implementations, a computing device may be a physical or virtual device. In many implementations, a computing device may be any device capable of performing operations, such as a dedicated processor, a portion of a processor, a virtual processor, a portion of a virtual processor, portion of a virtual device, or a virtual device. In some implementations, a processor may be a physical processor or a virtual processor. In some implementations, a virtual processor may correspond to one or more parts of one or more physical processors. In some implementations, the instructions/logic may be distributed and executed across one or more processors, virtual or physical, to execute the instructions/logic. Computer 12 may execute an operating system, for example, but not limited to, Microsoft® Windows®; Mac® OS X®; Red Hat® Linux®, or a custom operating system. (Microsoft and Windows are registered trademarks of Microsoft Corporation in the United States, other countries or both; Mac and OS X are registered trademarks of Apple Inc. in the United States, other countries or both; Red Hat is a registered trademark of Red Hat Corporation in the United States, other countries or both; and Linux is a registered trademark of Linus Torvalds in the United States, other countries or both).
[0037] In some implementations, the instruction sets and subroutines of system 100, which may be stored on storage device, such as storage device 16, coupled to computer 12, may be executed by one or more processors (not shown) and one or more memory architectures included within computer 12. In some implementations, storage device 16 may include but is not limited to: a hard disk drive; a flash drive, a tape drive; an optical drive; a RAID array (or other array); a random access memory (RAM); and a read-only memory (ROM).
[0038] In some implementations, network 14 may be connected to one or more secondary networks (e.g., network 18), examples of which may include but are not limited to: a local area network; a wide area network; or an intranet, for example.
[0039] In some implementations, computer 12 may include a data store, such as a database (e.g., relational database, object-oriented database, triplestore database, etc.) and may be located within any suitable memory location, such as storage device 16 coupled to computer 12. In some implementations, data, metadata, information, etc. described throughout the present disclosure may be stored in the data store. In some implementations, computer 12 may utilize any known database management system such as, but not limited to, DB2, in order to provide multi-user access to one or more databases, such as the above noted relational database. In some implementations, the data store may also be a custom database, such as, for example, a flat file database or an XML database. In some implementations, any other form(s) of a data storage structure and/or organization may also be used. In some implementations, system 100 may be a component of the data store, a standalone application that interfaces with the above noted data store and/or an applet / application that is accessed via client applications 22, 24, 26, 28. In some implementations, the above noted data store may be, in whole or in part, distributed in a cloud computing topology. In this way, computer 12 and storage device 16 may refer to multiple devices, which may also be distributed throughout the network.
[0040] In some implementations, computer 12 may execute application 20 for speech recognition. In some implementations, system 100 and/or application 20 may be accessed via one or more of client applications 22, 24, 26, 28. In some implementations, system 100 may be a standalone application, or may be an applet / application / script / extension that may interact with and/or be executed within application 20, a component of application 20, and/or one or more of client applications 22, 24, 26, 28. In some implementations, application 20 may be a standalone application, or may be an applet / application / script / extension that may interact with and/or be executed within system 100, a component of system 100, and/or one or more of client applications 22, 24, 26, 28. In some implementations, one or more of client applications 22, 24, 26, 28 may be a standalone application, or may be an applet / application / script / extension that may interact with and/or be executed within and/or be a component of system 100 and/or application 20. Examples of client applications 22, 24, 26, 28 may include, but are not limited to, a standard and/or mobile web browser, an email application (e.g., an email client application), a textual and/or a graphical user interface, a customized web browser, a plugin, an Application Programming Interface (API), or a custom application. The instruction sets and subroutines of client applications 22, 24, 26, 28, which may be stored on storage devices 30, 32, 34, 36, coupled to user devices 38, 40, 42, 44, may be executed by one or more processors and one or more memory architectures incorporated into user devices 38, 40, 42, 44.
[0041] In some implementations, one or more of storage devices 30, 32, 34, 36, may include but are not limited to: hard disk drives; flash drives, tape drives; optical drives; RAID arrays; random access memories (RAM); and read-only memories (ROM). Examples of user devices 38, 40, 42, 44 (and/or computer 12) may include, but are not limited to, a personal computer (e.g., user device 38), a laptop computer (e.g., user device 40), a smart/data-enabled, cellular phone (e.g., user device 42), a notebook computer (e.g., user device 44), a tablet (not shown), a server (not shown), a television (not shown), a smart television (not shown), a media (e.g., video, photo, etc.) capturing device (not shown), and a dedicated network device (not shown). User devices 38, 40, 42, 44 may each execute an operating system, examples of which may include but are not limited to, Android®, Apple® iOS®, Mac® OS X®; Red Hat® Linux®, or a custom operating system.
[0042] In some implementations, one or more of client applications 22, 24, 26, 28 may be configured to effectuate some or all of the functionality of system 100 (and vice versa). Accordingly, in some implementations, system 100 may be a purely server-side application, a purely client-side application, or a hybrid server-side / client-side application that is cooperatively executed by one or more of client applications 22, 24, 26, 28 and/or system 100.
[0043] In some implementations, one or more of client applications 22, 24, 26, 28 may be configured to effectuate some or all of the functionality of application 20 (and vice versa). Accordingly, in some implementations, application 20 may be a purely server-side application, a purely client-side application, or a hybrid server-side / client-side application that is cooperatively executed by one or more of client applications 22, 24, 26, 28 and/or application 20. As one or more of client applications 22, 24, 26, 28, system 100, and application 20, taken singly or in any combination, may effectuate some or all of the same functionality, any description of effectuating such functionality via one or more of client applications 22, 24, 26, 28, system 100, application 20, or combination thereof, and any described interaction(s) between one or more of client applications 22, 24, 26, 28, system 100, application 20, or combination thereof to effectuate such functionality, should be taken as an example only and not to limit the scope of the disclosure.
[0044] In some implementations, one or more of users 46, 48, 50, 52 may access computer 12 and system 100 (e.g., using one or more of user devices 38, 40, 42, 44) directly through network 14 or through secondary network 18. Further, computer 12 may be connected to network 14 through secondary network 18, as illustrated with phantom link line 54. System 100 may include one or more user interfaces, such as browsers and textual or graphical user interfaces, through which users 46, 48, 50, 52 may access system 100.
[0045] In some implementations, the various user devices may be directly or indirectly coupled to network 14 (or network 18). For example, user device 38 is shown directly coupled to network 14 via a hardwired network connection. Further, user device 44 is shown directly coupled to network 18 via a hardwired network connection. User device 40 is shown wirelessly coupled to network 14 via wireless communication channel 56 established between user device 40 and wireless access point (i.e., WAP) 58, which is shown directly coupled to network 14. WAP 58 may be, for example, an IEEE 802.11a, 802.11b, 802.11g, Wi-Fi®, RFID, and/or BluetoothTM (including BluetoothTM Low Energy) device that is capable of establishing wireless communication channel 56 between user device 40 and WAP 58. User device 42 is shown wirelessly coupled to network 14 via wireless communication channel 60 established between user device 42 and cellular network / bridge 62, which is shown directly coupled to network 14.
[0046] In some implementations, some or all of the IEEE 802.11x specifications may use Ethernet protocol and carrier sense multiple access with collision avoidance (i.e., CSMA/CA) for path sharing. The various 802.11x specifications may use phase-shift keying (i.e., PSK) modulation or complementary code keying (i.e., CCK) modulation, for example. BluetoothTM (including BluetoothTM Low Energy) is a telecommunications industry specification that allows, e.g., mobile phones, computers, smart phones, and other electronic devices to be interconnected using a short-range wireless connection. Other forms of interconnection (e.g., Near Field Communication (NFC)) may also be used.
[0047] Referring also to the example implementation of FIG. 2, there is shown a diagrammatic view of user device 200. While user device 200 is shown in this figure, this is for example purposes only and is not intended to be a limitation of this disclosure, as other configurations are possible. Additionally, any computing device capable of executing, in whole or in part, system 100 may be substituted for user device 38 (in whole or in part) within FIG. 2, examples of which may include but are not limited to computer 12 and/or one or more of user devices 38, 40, 42, 44.
[0048] User device 200 according to an exemplary embodiment may be embodied in various electronic devices capable of voice recognition such as a digital assistant device (such as, AlexaTM, SiriTM, Google AssistantTM, CortanaTM), a remote controller device, a smart TV, a smart phone, a tablet PC, an audio device, navigation, and the like.
[0049] In some implementations, user device 200 may include a processor and/or microprocessor (e.g., microprocessor 202) configured to, e.g., process data and execute the above-noted code / instruction sets and subroutines. Microprocessor 202 may be coupled via a storage adaptor (not shown) to the above-noted storage device(s) (e.g., storage device 30). An I/O controller (e.g., I/O controller 204) may be configured to couple microprocessor 202 with various devices, such as input device, pointing/selecting device (e.g., touchpad, touchscreen, mouse, etc.), additional devices like USB ports and printer ports (not shown). A display adaptor (e.g., display adaptor 206) may be configured to couple display 208 (e.g., touchscreen monitor(s), plasma, CRT, or LCD monitor(s), etc.) with microprocessor 202, while network controller/adaptor 210 (e.g., an Ethernet adaptor) may be configured to couple microprocessor 202 to the above-noted network 14 (e.g., the Internet or a local area network). Herein, display 208 may be embodied in a Liquid Crystal Display (LCD), an Organic Light Emitting Display (OLED), or a Plasma Display Panel (PDP), and the like, and provide various display screens that can be provided through user device 200. Especially, display 208 may display a response message corresponding to the user's voice in a text or an image.
[0050] User device also includes a memory 212. Memory 212 may be a storing medium that stores the user pronunciation lexicon, which is used to perform voice recognition and may be embodied in a storage device, such as a Hard Disk Drive (HDD), and the like. For example, memory 212 may equip a ROM to store programs to perform an operation of microprocessor 202 and a RAM to store data according to an operation performance of microprocessor 202 temporarily, and the like. In addition, memory 212 may further equip Electrically Erasable and Programmable ROM (EEPROM) to store various reference data.
[0051] In present implementations, user device 200 further includes an audio-recording unit 214. Audio-recording unit 214 may receive a user's uttered voice. For example, audio-recording unit 214 may be embodied in a microphone to receive the user's uttered voice. Audio-recording unit 214 may be embodied in an integrated form with user device 200 by being equipped with or in a separate form by being apart from. Audio-recording unit 214 may further process the received user's uttered voice. For example, audio-recording unit 214 may remove a noise from the user's voice. Audio-recording unit 214, when a user's voice in an analog form is inputted, may perform sampling and convert it to a digital signal. Further, audio-recording unit 214 may calculate energy of the converted digital signal and determine whether the energy of the digital signal is more than a predetermined value. Audio-recording unit 214, in a case in which the energy of the digital signal is more than a predetermined value, may further remove a noise element from the digital signal and deliver to microprocessor 202 and the like. For example, the noise element is an accidental noise that can occur in a home environment, which may include sounds of an air conditioner, vacuum cleaner, music, and the like. Meanwhile, in a case in which the energy of the digital signal is less than a predetermined value, audio-recording unit 214 does not perform any particular process procedures toward the digital signal but waits for other input, so that the whole audio process procedures are not activated due to other sounds rather than a user's uttered voice, and thereby unnecessary power consumption may be prevented.
[0052] In present implementations, user device 200 further includes an audio-output unit 216. Audio-output unit 216 may be embodied in an output port like a jack or a speaker and may output a response message corresponding to the user's voice in voice.
[0053] Referring now to FIG. 3, there is shown a block diagram of a system 300 for speech recognition, in accordance with one or more exemplary embodiments of the present disclosure. As illustrated, the system 300 includes a server 302. The server 302 is configured to train a model to generate at least one dictionary 304 containing dialectically variated pronunciations of a predefined set of words. In one implementation, the server 302 may be configured to train a model which in turn can be implemented in a user device (as discussed in the preceding paragraphs) to generate the said at least one dictionary 304, in consideration of dynamic content received by the user device. This way the present system 100 would be able to handle dynamic vocabulary in user device (i.e. dynamic words which may be specific to the user vocabulary, as the user device would be receiving sound signals from the user generally in a regular manner) using in-device user dictionary creation by G2P model (trained by the server 302) and linguistic rules. The dictionary 304 provides a custom lexicon which, in turn, provides an association between words and their phonetic sequences. At software build time, a developer can identify a list of words that he/she wishes to add to the dictionary 304. At software runtime, if the speech recognition engine determines that a phonetic description of a speech signal obtained from a user matches a phonetic description stored in the dictionary 304, then it will recognize the word associated with the matching phonetic description from the dictionary 304. Thus, by providing the dictionary 304 beforehand, the developer can ensure that the speech recognition engine will recognize certain uniquely-pronounced words. In one or more embodiments, the predefined set of words includes out-of-vocabulary (OOV) words. That is, the dictionary 304 is particularly trained on contact names, songs and other dynamic content, which are typically OOV words. In the present implementations, the predefined set of words further includes commonly accepted abbreviations of the OOV words.
[0054] In the present embodiments, the server 302 is implemented as a deep neural network for training the said model for generation of the dictionary 304 containing dialectically variated pronunciations of the predefined set of words. The server 302, implemented as the deep neural network, is caused to generate the dictionary 304 by utilizing grapheme-to-phoneme (G2P) conversion technique. FIG. 4 illustrates a schematic of a deep neural network 400 as implemented by the server 302. The server 302 utilizes a novel hybrid approach for improving the performance of grapheme-to-phoneme (G2P) conversion using rules over an attention based sequence to sequence deep neural network (Seq2Seq DNN) model. That is, an attention-based Seq2Seq DNN is utilized to build the G2P model. Herein, the automated G2P generation module is utilized to produce accurate pronunciation dictionary. In particular, the proposed G2P model helps to produce more accurate and fast speech recognition output for offline scenario, for example, for detecting contact name, song name and other dynamic content in a smartphone or the like. It is to be appreciated that most of such dynamic content is not included in pronunciation dictionary and training of traditional speech recognition models.
[0055] In particular, as illustrated, the server 302, or specifically the deep neural network 400, comprises a multi-layered recurrent neural network based encoder (encoder 402) and a Long Short-term Memory (LSTM) Cell based decoder (decoder 404). As illustrated in FIG. 4, the decoder 404 has an attention mechanism over outputs of the encoder 402 at different time steps which helps in both accuracy, speed and alignment to the input. Herein, the input graphemes are tokenized, padded and bucketed before passing to the network, for helping in variable length inputs, outputs. The bucketing also reduces the padding that needs to be done for small sized inputs, reducing the computational complexity. This model may be developed using the TensorflowTM toolkit or the like, and the models graph with saved weights may be deployed on the different devices for online and offline inference. That is, as discussed, in one implementation, the server 302 may be configured to train a model which in turn can be implemented in a user device (as discussed in the preceding paragraphs) to generate the said at least one dictionary 304, in consideration of dynamic content received by the user device.
[0056] In one or more embodiments, the server 302 is configured to generate multiple dictionaries, with each dictionary containing one type of the dialectical variations of the predefined set of words. For example, in some implementations, the server 302 can be configured to generate dictionaries in consideration of location based dialectical variations, where the present system is to be deployed. For instance, in the Indian context for speech recognition, the model can be used to generate three base dictionaries for the speech recognition decoding process. Three types of G2P generated dictionaries are Hinglish dictionary (which is a blend of Hindi and English, in particular a variety of English used by speakers of Hindi, characterized by frequent use of Hindi vocabulary or constructions), Hindi dictionary and English dictionary. Herein, for example, Hindi G2P may be used for all words in contact list. It may be understood that Hindi dictionary may be trained on a transliterated form (English form of Hindi). Hindi dictionary is trained on a transliterated form (English form of Hindi). Hindi dictionary may have two parts, one is transliterated form (like, SMS language) and another is Hindi font form. Hindi font is converted using offline Itrans converter, as known in the art. Itrans-based seq2seq may be used for Hindi dictionary conversion. Further, in the present example, English and Hinglish dictionaries would cover all words in contact list except those in Hindi font.
[0057] These generated three dictionaries are considered as base dictionaries for the speech recognition decoding process. Initially, those three dictionaries are treated as a single dictionary for a user. As discussed, three types of G2P generated dictionaries are a Hinglish dictionary, Hindi dictionary and English dictionary (trained on some English names and songs). It may be appreciated that these dictionaries are particularly trained on contact names, songs and other dynamic content, which are typically OOV words. For the purpose of the present disclosure, a training dictionary containing 20,000 words was utilized.
[0058] Dialectical variation and correction are performed on Seq2Seq G2P using rules. For example, the present Seq2Seq G2P server 302 is capable of unconventional variation handling for words. For instance, if a word is received with four number of consonants sequentially therein, then it may be assumed that such word is an acronym, because in English generally only three number of consonants can exist in a word. For example, in case of the word "apppp", the server 302 would go for the character-wise phoneme. In another example, suppose the server get the word "SNDP" for Sandip in the contact list, then it would make a file for all contact names which do not have a vowel, such as “sndp” being equivalent to Sandip, “sndp” being equivalent to Sandipa, “sadpn” being equivalent to Sandipan. Then it will choose "sndp" for Sandip and Sandipa. This rule is also applicable for words with three alphabets (like, consonant) with some exceptions. The server 302 may further be trained to cover the most frequent terms which are universally accepted, such as “frnd” being equivalent to friend, “clg” being equivalent to college, “scl” being equivalent to school, “Dr.” being equivalent to doctor, “Mr.” being equivalent to Mister, “Col.” being equivalent to Colonel, and the like. Furthermore, two alphabet words will be treated as acronyms as well as word. If a “verb” is followed by a “consonant” (i.e. V+C) or a “consonant” followed by a “verb” (i.e. C+V), it will be assumed that it is a word; however, if it is a “consonant” followed by another “consonant” (i.e. C+C), then it will be considered as an acronym.
[0059] In some implementations, more accuracy can be achieved by applying morphological segmentation of complex words. That is, the server 302 is further caused to utilize morphological segmentation technique for complex words. The hybrid architecture of G2P model, incorporating attention based sequence to sequence neural network and rules, allows for morphological segmentation of complex words and generating their corresponding pronunciations. Firstly, compound words are segmented into simple words and fed to attention based Seq2Seq DNN to produce the initial G2P conversion. In next step outputs are corrected using the phonetic rules.
[0060] The present server 302 implements new techniques of extracting morphological identity of the words to improve the rule-based approach by sub-word method. But, only selecting the most common suffixes for morphological segmentation of the words has proved to be insufficient. The reason is a lack of knowledge about the other sub-words. Moreover, the sub-word method can segment the word into only two parts. In that case, handling the compound words having more than two word parts will be difficult. The performance of the above mentioned rule-based approaches are quite motivating. The knowledge acquisition techniques and handling the orthographic rules for the words are challenging for further improvement of the rule-based approaches for grapheme-to-phoneme conversion. Thus, the proposed rule-based approach of G2P conversion consists of two basic steps, i.e. morphological segmentation and pronunciation generation. In other implementations, morphological segmentation can be performed offline, for example in a user device (as discussed below), and in such case, the rules, prefix, suffix or morpheme list can be updated via the server 302.
[0061] Morphological segmentation is a process of segmenting a compound word into its valid word parts. For example, a compound word (/srirAmkriSno/) will be segmented into the word-parts (/sri/), (/rAm/) and (/kriSno/). The three words-parts, generated by the segmentation process, individually are valid words. The morphological segmentation algorithm finds validity of the word parts and segments accordingly. The steps of the segmentation procedure involve building the dictionary 304 (i.e. word list containing all the 20000 collected words) with most common prefixes and suffixes being added in the list. Then, it involves taking a graph from the test word and searching in the word list to check whether the selected string is a valid word or not. In case, the string is a valid word, the string is appended in an array. Next, it deals with the remaining portion of the word. It will select the remaining portion as the full word and segmentation process of the proposed sub-word method. When a new word will come, its last part will be checked with the suffixes one by one from top to bottom position of the suffix list. On success, the suffix part will be segmented from the word. Further, it finds the pronunciation of both the word segments. The suffix part’s pronunciation will be fetched from the suffix dictionary. For obtaining the pronunciation of the remaining part, known as the main part of the word, the main pronunciation dictionary will be used. Subsequently, it will concatenate both the pronunciations for finding the pronunciation for the target word.
[0062] To summarize, the morphological segmentation technique involves segmenting a compound word, from the predefined set of words, into two or more strings, with each of the two or more strings being one of commonly known prefixes, commonly known suffixes and a valid word. Further, the morphological segmentation technique involves determining pronunciations of each of the two or more strings. Furthermore, the morphological segmentation technique involves concatenating the determined pronunciations of each of the two or more strings of the segmented compound word, to generate pronunciation of the targeted compound word.
[0063] In some implementations, the present server 302 further employs rule-based pronunciation correction. Herein, pronunciation generation is the step where phone sequences are passed through the series of orthographic rules which are based on the factors that influence regional pronunciation. For example, even though Bengali has been originated from Sanskrit, a study of Bengali language will indicate that it has more phonetic variation compared to other languages originated from Sanskrit. Therefore, it is needed to take care of these phonetic variations in order to build the grapheme-to-phoneme system. It may be appreciated that herein Bengali language is only being used as a case study and shall not be construed as limiting to the present disclosure in any manner.
[0064] In Bengali each consonant is associated with an inherent vowel (/O/) which is not explicitly represented. The problem that becomes evident while building the Bengali grapheme to phoneme system is that this inherent vowel is pronounced sometimes as (/O/), sometimes as (/o/) and sometimes it is dropped. For example in (/kOlom/) (which stands for ‘pen’ in English language), the inherent vowel following /k/ is pronounced as (/O/), following /l/ is pronounced as (/o/) and in the word final position the inherent vowel following /m/ is deleted. In addition to this inherent vowel deletion/retention along with the word morphology determines the syllable structure as well. Hence it is important to categorize the irregularity pattern of inherent vowel.
[0065] In an example, if a consonant with (/O/) is followed by a consonant with japhala or japhala with ligature /a/ then (/O/) will be pronounced as (/o/), such as /kollEn/ (which stands for ‘prosperity’ in English language). The inherent (/O/) vowel of (Ya/) will be pronounced as (/o/) if it is preceded by front high vowels, such as /jAtiYo/ (which stands for ‘national’ in English language). The inherent vowel (/O/) will be pronounced as (/o/) in the word final position if it is preceded by /h/, such as /niriho/ (which stands for ‘innocent’ in English language). Also, (/O/) will be pronounced as (/o/) in any position of a word if it is preceded by a consonant cluster or conjugate syllable, such as /rukkhotA/ (which stands for ‘roughness’ in English language), but this rule has an exception, such as /SikkhOk/ (which stands for ‘teacher’ in English language). If a consonant with (/O/) is preceded by a consonant with /ri/ then (/O/) will be pronounced as (/o/), such as /krishno/ (which transliterates to ‘krishna’). It is to be understood that the inherent vowel in the first syllable is never deleted. In the case of verbs, if the word ends with /l/, /b/, /ch/, and /t/ then the following inherent vowel will be pronounced as (/o/), such as /korlo/ or /lorto/.
[0066] Similar to inherent dependent vowel (/O/), Bengali independent vowel (/O/) also has a tendency to pronounce sometimes as (/O/) and sometimes as (/o/). For example, if independent vowel (/O/) is a negative prefix then (/O/) will be pronounced as (/O/), else it will be pronounced as /o/, such as, in (/Onil/). Herein, (/O/) is a negative prefix but in (/ogni/) it is pronounced as (/o/) as it is not a negative prefix. In the prefixes like, (/Ona/), (/Opo/), (/Obo/), (/Onto/), (/O/) is pronounced as (/O/) but in the prefixes like (/oti/), (/odhi/), (/opi/), (obhi/), (/O/) is pronounced as (/o/). Apart from this, the phonetic variations can be observed in case of consonants as well. For example, in spontaneous speech word final aspiration will be dropped if the final consonant does not have any ligature or inherent dependent (/O/) vowel, such as (/mukh/>/muk/), (/cokh/>/cok/), and the like. If a consonant comes with japhala in word medial or final position then that consonant will be reduplicated such as, (/jOghonno/), (/onnotro/), etc. But in initial position it will be pronounced as (/E/). For example, (/bEbSa/). It is important to mention here that if /textitjaphala appears after /h/ then it will be pronounced as /jjho/, (Sojjho/). It may also be considered that even though nasalization is pure phonemic in Bengali, but in spontaneous speech the nasalized vowel often denasalized, e.g., (/cA.Md/>/cAd/).
[0067] The present server 302 further employs morphophonemics (also known as morphophonology) which is an interface that studies both morphological and phonological processes simultaneously. The principal focus of this process is to observe the sound changes that take place in morphemes when they combine to form words in spontaneous speech. It considers the accusative marker in which /ke/ has a tendency to change the word final voiced consonant to a voiceless one, e.g., (/biplOb/+/ke/) (¯/biplOpke/), and singular marker, such as /TA/ and /Ti/ have a tendency to change the word final voiced consonant to a voiceless one, e.g., (/bAgh/+/TA/) (/bAkTA/).
[0068] Herein, only base word in the present dictionary 304 are being considered according to pronunciation of a Hindi native speaker, so there is a need to apply some rules for covering dialectical pronunciation, such as standard Bengali language pronunciation. For this purpose, first level phone-breaks are prepared by G2P mixed graph. Then, if schwa /a/ is received in 1st level dictionary, then /o/, /O/ have to be inserted; e.g., “manohar" being “m a n o h a r” (1st level dictionary), “manohar" being “m o n o h o r” (as one variation), “manohar" being “m O n o h O r” (as another variation), “manohar" being “m o n o h O r” (as yet another variation) and “Manohar” being “m O n o h o r” (as still another variation). Similarly, if /a/ is received in last syllable, then /A/ is inserted in order to accommodate Bengali dialectical pronunciation; e.g., “teacher” being “T i ch a r” (1st level dictionary), “teacher” being “T i ch A r” (as one variation), “Trainer” being “T r ee n a r” (1st level dictionary) and “Trainer” being “T r ee n A r” (as one variation). On the other hand, if Hindi /v/ is not received in initial position, then it is replaced with /b/ in case of Bengali dialectical pronunciation; e.g., “pallavi” being “p a l l a v i" (1st level dictionary) and “pallavi” being “p a l l a b I” (as a variation). Unlike Hindi, if Bengali /R/, /Rh/ is received in any position in word, then it is replaced with /r/; e.g., “tamilnadu” being “tt A m i l n A R u” (1st level dictionary) and “tamilnadu” being “tt A m i l n A r u” (as one variation). Further, if /s/ is received before any vowel and word final position, then it is replaced with /sh/; e.g., “saheli” being “s a h ee l I” (1st level dictionary) and “saheli” being “sh aa h ee l i (as a variation), and “somenath” being “s o m n A tth” (1st level dictionary) and “somenath” being “sh o m n A tth” (as one variation). Furthermore, if /a/ is received in phoneme and /u/ in grapheme, then the /a/ would be replaced with /A/; e.g., “Punjab” being “p a n j A b” (1st level dictionary) and “punjab” being “p A n j A b” (as a variation), or “umbrella” being “a m b r ee l A” (1st level dictionary) and “umbrella” being “A m b r ee l A” (as a variation).
[0069] Referring back to FIG. 3, as illustrated, the system 300 also includes a user device 306. The user device 306 includes an audio-recording unit 308, as discussed in reference to FIG. 2. The user device 306 also includes a memory 310 having stored the generated at least one dictionary therein. The user device 306 further includes a processor 312. As may be seen, the audio-recording unit 308 and the memory 310 are disposed in communication with the processor 312. Herein, the memory 310 further having stored therein program instructions which when accessed by the processor 312 cause the processor 312 to perform various functions in accordance with embodiments of the present disclosure. As discussed, in one implementation, the server 302 may be configured to train a model which in turn can be implemented in the user device 302 to generate the said at least one dictionary 304, in consideration of dynamic content received by the user device. Herein, the processor 312 of the user device 306 is caused to generate the at least one dictionary 304 utilizing the trained model (G2P model) by the server 302. This way the present system 100 would be able to handle dynamic vocabulary in the user device 306 (i.e. dynamic words which may be specific to the user vocabulary, as the user device would be receiving sound signals from the user generally in a regular manner) using in-device user dictionary creation by G2P model (trained by the server 302) and linguistic rules. In a similar manner, the user device 306 is configured to generate multiple dictionaries, with each dictionary containing one type of the dialectical variations of the predefined set of words.
[0070] In particular, the processor 312 is cause to perform steps of a method for speech recognition. FIG. 5 illustrates a flowchart 500 depicting steps involved in a method for speech recognition, in accordance with one or more exemplary embodiments of the present disclosure. It may be appreciated that the steps as depicted are non-exclusive and shall not be construed as limiting to the present disclosure.
[0071] At step 502, the server 302 is caused to train a model to generate at least one dictionary 304 containing dialectically variated pronunciations of a predefined set of words, as discussed in detail in the preceding paragraphs. Also, as discussed, in one implementation, the server 302 may be configured to train a model which in turn can be implemented in the user device 306 to generate the said at least one dictionary 304, in consideration of dynamic content received by the user device.
[0072] At step 504, the processor 312 is caused to record user speech over a period of time. The processor 312 records the user speech via the audio-recording unit 308. The recorded speech may be stored in the memory 310 of the user device 306, either temporarily or permanently.
[0073] At step 506, the processor 312 is caused to analyse the user speech to identify dialectically variated phonemes characteristically pronounced by a user of the user device 306. Herein, the user may have a unique profile in the user device 306. The user profile may be in the form of user account or the like, with or without user’s login credentials. For analysis of the user speech, the processor 312 utilises the generated at least one dictionary 304 containing dialectically variated pronunciations of a predefined set of words, as discussed in the preceding paragraphs. It may be appreciated that the various users of the user devices (like the user device 306) may have a unique pronunciation, with dialectically variated phonemes, for at least some of the words in the dictionary 304. The processor 312 is caused to identify dialectically variated phonemes characteristically pronounced by each of the users.
[0074] At step 508, the processor 312 is caused to determine dialectically variated pronunciations, from the at least one dictionary, corresponding to the identified dialectically variated phonemes for one or more of the predefined set of words. For this purpose, the processor 312 compares the identified dialectically variated phonemes for the one or more of the predefined set of words with phonemes making up each of the dialectically variated pronunciations of the corresponding one or more of the predefined set of words contained in the at least one dictionary 304. The procedure for and the steps involved in such comparison may be appreciated by a person skilled in the art and thus have not been explained herein for the brevity of the present disclosure.
[0075] At step 510, the processor 312 is caused to delete other of the dialectically variated pronunciations than the determined dialectically variated pronunciations corresponding to the identified dialectically variated phonemes for the one or more of the predefined set of words from the at least one dictionary 304 to provide at least one optimized dictionary, stored in the memory 310 of the user device 306. That is, all other dialectic variations of the identified words, other than the identified pronunciation(s) corresponding to the user profile, may be deleted. The resultant optimized dictionary, after deletion of the dialectic variations of the identified words, is stored in the memory 310 of the user device 306.
[0076] To summarize, for deleting non-functional word pronunciation, if any dialectically variated phoneme is determined to be used by user, such phoneme may be referred to in the dictionary 304 and other variations of phonemes that are not usable by user are identified, and all words containing other form of that particular phoneme are removed from the dictionary 304. Further, if the user uses some particular patterns (rules); e.g., “s” instead of “sh” in beginning of a word or other position (detected from which variation is output of speech recognition), then other words with the same pattern are retained and rest of the variations are removed from the dictionary 304.
[0077] In some implementations, where the server 302 generates multiple dictionaries, with each dictionary containing one type of the dialectical variations of the predefined set of words, the processor 312 is caused to delete other of the multiple dictionaries than the one or more of the multiple dictionaries associated with the determined dialectically variated pronunciations for the one or more of the predefined set of words. For example, two dictionaries may be generated catering to two different dialects, one for Bengali language and other for Punjabi language, and are stored in the memory 310 of the user device 306. In such case, say, if the user is identified to predominantly use dialectically variated phonemes corresponding to the Bengali language over a period of time, then the dictionary with dialectically variated pronunciations for Punjabi language may be deleted from the memory 310 of the user device 306.
[0078] At step 512, the processor 312 is caused to implement the at least one optimized dictionary for speech recognition on the user device 306. That is, for a particular user profile, the corresponding optimized dictionary as generated in step 508 may further be implemented for processing of all speech signals related to that user profile for speech recognition purposes or the like.
[0079] In particular implementations, the processor 312 is caused to recognize user speech pattern based on the analysis of the user speech, as received via the audio-recording unit 308. The processor 312 is further caused to determine dialectically variated pronunciations, from the at least one dictionary 304, corresponding to the recognized user speech pattern for the one or more of the predefined set of words. The processor 312 is further caused to delete other of the dialectically variated pronunciations than the determined dialectically variated pronunciations corresponding to the recognized user speech pattern for the one or more of the predefined set of words from the at least one dictionary 304 to provide at least one further optimized dictionary, stored in the memory 310 of the user device 306. The processor 312 is further caused to implement the at least one further optimized dictionary, as stored in the memory 310, for speech recognition on the user device 306.
[0080] Detection of OOV words (e.g., named entity, song name) is a very well-known problem in domain of speech recognition. One of the solutions to this problem is to use an automated G2P generation module to produce accurate pronunciation dictionary, which is later utilized for speech recognition, especially for offline scenario. However, producing such pronunciation dictionary with required accuracy is a problem in itself. The presently proposed hybrid approach for improving the performance of grapheme-to-phoneme (G2P) conversion using rules over an attention based sequence to sequence deep neural network (Seq2Seq DNN) model helps to produce more accurate and fast speech recognition output for offline scenario. With this approach, results show that attention based Seq2Seq model achieves more accuracy than existing methods of G2P conversion, and applying rules with attention based model further improves G2P conversion accuracy. It has been observed that by using attention based Seq2Seq model, Word Error Rate (WER) reduces at least by 6% as compared with the previously reported G2P conversion method, and further rectification by applying rules, WER reduces to 2.88%.
[0081] Using this method, the present server 302 may generate multiple base dictionaries for the speech recognition decoding process. For example, different types of G2P generated dictionaries are a Hinglish dictionary, Hindi dictionary and English dictionary (trained on some English names and songs). As discussed, dialectical variation and correction are performed on Seq2Seq G2P using rules. Initially, those dictionaries are treated as a single dictionary for a user. The initial dictionary is big enough, but contains all variations leading to a little less accurate and more time-consuming speech recognition system. But after using the speech recognition system based on speech input and recognised text, the processor 312 deletes non-functional word pronunciation from the dictionary to provide the optimized dictionary, based on the user profile in the user device 306. Such optimized dictionary can be advantageously implemented for speech recognition. The present system 100 keeps on learning and size of the pronunciation dictionary keeps on becoming less and less based on the user profile, and thus system becomes more accurate and fast for speech recognition as searching space becomes lower. It is to be noted that optimization of dictionary according to user profile takes place offline in the user device. Therefore, the system and method of present disclosure provides means for less memory utilization and more accurate ASR using optimized dictionary, thereby resulting in fast and accurate offline speech recognition.
[0082] The present system and method can beneficially be implemented for alternate pronunciation generation method via linguistic findings. The present system and method can also be implemented for learnability feature in offline for OOV words.
[0083] The foregoing descriptions of specific embodiments of the present disclosure have been presented for purposes of illustration and description. They are not intended to be exhaustive or to limit the present disclosure to the precise forms disclosed, and obviously many modifications and variations are possible in light of the above teaching. The exemplary embodiment was chosen and described in order to best explain the principles of the present disclosure and its practical application, to thereby enable others skilled in the art to best utilize the present disclosure and various embodiments with various modifications as are suited to the particular use contemplated.

Documents

Application Documents

#	Name	Date
1	201931054843-FORM FOR STARTUP [31-12-2019(online)].pdf	2019-12-31
2	201931054843-FORM FOR SMALL ENTITY(FORM-28) [31-12-2019(online)].pdf	2019-12-31
3	201931054843-FORM 1 [31-12-2019(online)].pdf	2019-12-31
4	201931054843-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [31-12-2019(online)].pdf	2019-12-31
5	201931054843-DRAWINGS [31-12-2019(online)].pdf	2019-12-31
6	201931054843-DECLARATION OF INVENTORSHIP (FORM 5) [31-12-2019(online)].pdf	2019-12-31
7	201931054843-COMPLETE SPECIFICATION [31-12-2019(online)].pdf	2019-12-31
8	201931054843-Proof of Right [16-03-2020(online)].pdf	2020-03-16
9	201931054843-FORM-26 [16-03-2020(online)].pdf	2020-03-16
10	201931054843-FORM 18 [02-05-2022(online)].pdf	2022-05-02
11	201931054843-FER.pdf	2022-10-31
12	201931054843-Response to office action [11-04-2023(online)].pdf	2023-04-11
13	201931054843-FER_SER_REPLY [11-04-2023(online)].pdf	2023-04-11
14	201931054843-DRAWING [11-04-2023(online)].pdf	2023-04-11
15	201931054843-CORRESPONDENCE [11-04-2023(online)].pdf	2023-04-11
16	201931054843-COMPLETE SPECIFICATION [11-04-2023(online)].pdf	2023-04-11
17	201931054843-CLAIMS [11-04-2023(online)].pdf	2023-04-11
18	201931054843-ABSTRACT [11-04-2023(online)].pdf	2023-04-11

Search Strategy

1	Search_843AE_29-06-2024.pdf
2	SearchE_31-10-2022.pdf