Abstract: Systems and methods for non-contact voice-based information filtering and generation. The traditional systems and methods provide for an automated or voice based data entry based upon a dictation but as a vocabulary grows, an utterance of a word from the vocabulary may be misrecognized as corresponding to another similar sounding word from the vocabulary. Embodiments of the present disclosure provide for a non-contact voice-based information filtering and generation by extracting a closest matching subset of words by obtaining multiple subsets of close matching words from an application engine, conversion engine and historical decoded words, performing a comparison of the multiple subset of words by assigning weightages based upon a set of decoded values.
Claims:1. A method for non-contact voice-based information filtering and generation, the method comprising a processor implemented steps of:
communicating, by one or more hardware processors, a set of images to a speech recognition device, wherein the set of images comprise text;
converting, by the one or more hardware processors, the set of images to one or more decoded files based upon a first subset and a second subset of words for non-contact voice-based information filtering and generation, wherein the first subset and the second subset of words comprise text matching with the set of images; and
comparing, by the one or more hardware processors, the first subset and the second subset of words and a set of source words to generate a set of decoded values for non-contact voice-based information filtering and generation.
2. The processor implemented method of claim 1, wherein the step of converting the set of images to the one or more decoded files comprises:
(i) generating, by the one or more hardware processors, the set of decoded values based upon the one or more decoded files for identifying one or more matching words with the set of images as text to be input; and
(ii) assigning, by the one or more hardware processors, a set of weightages based upon the set of decoded values for non-contact voice-based information filtering and generation.
3. The processor implemented method of claim 1, wherein the step of comparing the first subset and the second subset of words and the set of source words is preceded by:
(i) extracting, by the one or more hardware processors, the first subset and the second subset of words based upon an input data, wherein the input data comprises text to be converted to one or more words matching with the set of images;
(ii) determining, by the one or more hardware processors, the set of source words based upon historical decoded data for identifying the one or more matching words, and wherein the set of source words comprises one or more sets of words matching with the set of images.
4. A system comprising:
a memory storing instructions;
one or more communication interfaces; and
one or more hardware processors coupled to the memory via the one or more communication interfaces, wherein the one or more hardware processors are configured by the instructions to:
communicate, by one or more hardware processors, a set of images to a speech recognition device, wherein the set of images comprise text;
convert, by the one or more hardware processors, the set of images to one or more decoded files based upon a first subset and a second subset of words for non-contact voice-based information filtering and generation, wherein the first subset and the second subset of words comprise text matching with the set of images; and
compare, by the one or more hardware processors, the first subset and the second subset of words and a set of source words to generate a set of decoded values for non-contact voice-based information filtering and generation.
5. The system of claim 4, wherein the wherein the one or more hardware processors are further configured to:
generate, by the one or more hardware processors, the set of decoded values based upon the one or more decoded files for identifying one or more matching words with the set of images as text to be input; and
assign, by the one or more hardware processors, a set of weightages based upon the set of decoded values for non-contact voice-based information filtering and generation.
6. The system of claim 4, wherein the one or more hardware processors are further configured to compare the first subset and the second subset of words and the set of source words by:
(i) extracting, by the one or more hardware processors, the first subset and the second subset of words based upon an input data, wherein the input data comprises text to be converted to one or more words matching with the set of images; and
(ii) determining, by the one or more hardware processors, the set of source words based upon historical decoded data to identify the one or more matching words, and wherein the set of source words comprises one or more sets of words matching with the set of images.
, Description:FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION
(See Section 10 and Rule 13)
Title of perform:
SYSTEMS AND METHODS FOR NON-CONTACT VOICE-BASED INFORMATION FILTERING AND GENERATION
Applicant:
Tata Consultancy Services Limited
A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th Floor,
Nariman Point, Mumbai 400021,
Maharashtra, India
The following specification particularly describes the invention and the manner in which it is to be performed.
TECHNICAL FIELD
[0001] The present disclosure generally relates to non-contact voice-based information filtering and generation. More particularly, the present disclosure relates to systems and methods for non-contact voice-based information filtering and generation.
BACKGROUND
[0002] In digital and big data era, it is extremely important to maintain highest level of accuracy in data entry, as this data is being used by organizations to make key business decisions. However, the job of data entry specialists is certainly not easy as they handle enormous data on a day-to-day basis. Besides, they are under the constant pressure of completing their job quickly. This increases the chances of inaccurate data entries. Even with automation, ensuring accurate data entry is still a big challenge. Inaccurate data entry is one of the most expensive mistakes that businesses frequently make today. With the advent of big data, the need for maintaining data entry accuracy has increased. Big data may involve terabytes, petabytes or even exabytes of data captured over the time. Such voluminous data can come from myriad different sources, such as business sales records, the collected results of scientific experiments or real-time sensors used in the internet of things. Data may be raw or preprocessed using separate software tools before analytics are applied.
[0003] Data may also exist in a wide variety of file types, including structured data, such as database stores, unstructured data, such as document files, or streaming data from sensors. Further, big data may involve multiple, simultaneous data sources, which may not otherwise be integrated. For example, a big data analytics project may attempt to gauge a product's success and future sales by correlating past sales data, return data and online buyer review data for that product. Thus, data entry has to be automated with highest level of accuracy to be maintained as the manual data entry suffers from multiple drawbacks such as need for a specialized staff to be hired, double checking of the data entered manually, leveraging multiple software tools etc. The traditional systems and methods provide for a dictation based non-manual data entry where a complete sentence or few lines comprising of multiple words are read or pronounced by a user and the software compares the multiple words dictated or pronounced at a time with the set of words stored in a pre-built dictionary and then comes up with one or more closest matching words. However this leads to inaccuracy as the multiple dictated or pronounced are compared with multiple words stored. Further, the dictionary and probability model corresponds to the hypertext elements that are in the hypertext document which is being accessed by the user. Accordingly, the dictionary and probability models are discarded and the next dictionary and probability models are obtained as the next hypertext document is accessed. Storage of recent or important dictionaries and probability models are also provided. However, the voice recognition in such a method is mainly restricted to the hyperlinks used in the hypertext document being accessed by the user. Other hyperlinks which are not visible on the hypertext document being accessed cannot be recognized.
[0004] Some traditional systems and methods employ speech-to-text conversion techniques. However, when existing speech recognition techniques are applied, the recording device captures lots of background noise in the speech. This results in loss of words and/or misinterpretation of words, thereby causing overall decline in accuracy and reliability of speech recognition. This lack of accuracy and reliability of existing speech recognition techniques in turn hamper the reliability of the applications employing the speech recognition techniques.
SUMMARY
[0005] The following presents a simplified summary of some embodiments of the disclosure in order to provide a basic understanding of the embodiments. This summary is not an extensive overview of the embodiments. It is not intended to identify key/critical elements of the embodiments or to delineate the scope of the embodiments. Its sole purpose is to present some embodiments in a simplified form as a prelude to the more detailed description that is presented below.
[0006] Systems and methods of the present disclosure enable non-contact voice-based information filtering and generation. In an embodiment of the present disclosure, there is provided a method for non-contact voice-based information filtering and generation, the method comprising: communicating, by one or more hardware processors, a set of images to a speech recognition device, wherein the set of images comprise text; converting, by the one or more hardware processors, the set of images to one or more decoded files based upon a first subset and a second subset of words for non-contact voice-based information filtering and generation, wherein the first subset and the second subset of words comprise text matching with the set of images; comparing, by the one or more hardware processors, the first subset and the second subset of words and a set of source words to generate a set of decoded values for non-contact voice-based information filtering and generation; generating, by the one or more hardware processors, the set of decoded values based upon the one or more decoded files for identifying one or more matching words with the set of images as text to be input; assigning, by the one or more hardware processors, a set of weightages based upon the set of decoded values for non-contact voice-based information filtering and generation; extracting, by the one or more hardware processors, the first subset and the second subset of words based upon an input data, wherein the input data comprises text to be converted to one or more words matching with the set of images; and determining, by the one or more hardware processors, the set of source words based upon historical decoded data for identifying the one or more matching words, and wherein the set of source words comprises one or more sets of words matching with the set of images.
[0007] In an embodiment of the present disclosure, there is provided a system for non-contact voice-based information filtering and generation, the system comprising one or more processors; one or more data storage devices operatively coupled to the one or more processors and configured to store instructions configured for execution by the one or more processors to: communicate, by one or more hardware processors, a set of images to a speech recognition device, wherein the set of images comprise text; convert, by the one or more hardware processors, the set of images to one or more decoded files based upon a first subset and a second subset of words for non-contact voice-based information filtering and generation, wherein the first subset and the second subset of words comprise text matching with the set of images; compare, by the one or more hardware processors, the first subset and the second subset of words and a set of source words to generate a set of decoded values for non-contact voice-based information filtering and generation; generate, by the one or more hardware processors, the set of decoded values based upon the one or more decoded files for identifying one or more matching words with the set of images as text to be input; assign, by the one or more hardware processors, a set of weightages based upon the set of decoded values for non-contact voice-based information filtering and generation; and compare the first subset and the second subset of words and the set of source words by: (i) extracting, by the one or more hardware processors, the first subset and the second subset of words based upon an input data, wherein the input data comprises text to be converted to one or more words matching with the set of images; and (ii) determining, by the one or more hardware processors ,the set of source words based upon historical decoded data to identify the one or more matching words, and wherein the set of source words comprises one or more sets of words matching with the set of images.
[0008] It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS
[0009] The embodiments herein will be better understood from the following detailed description with reference to the drawings, in which:
[0010] Fig. 1 illustrates a block diagram of a system for non-contact voice-based information filtering and generation according to an embodiments of the present disclosure;
[0011] Fig. 2 is an architecture illustrating the components of a system for non-contact voice-based information filtering and generation according to an embodiments of the present disclosure; and
[0012] Fig. 3 is a flowchart illustrating the steps involved for non-contact voice-based information filtering and generation according to an embodiments of the present disclosure.
DETAILED DESCRIPTION OF THE EMBODIMENTS
[0013] The embodiments herein and the various features and advantageous details thereof are explained more fully with reference to the non-limiting embodiments that are illustrated in the accompanying drawings and detailed in the following description. The examples used herein are intended merely to facilitate an understanding of ways in which the embodiments herein may be practiced and to further enable those of skill in the art to practice the embodiments herein. Accordingly, the examples should not be construed as limiting the scope of the embodiments herein.
[0014] The embodiments of the present disclosure provides systems and methods for non-contact voice-based information filtering and generation. Automated or voice based data entry is becoming increasing popular in digital, robotics and big data era due to a very large volume of complex data to be entered and processed accurately as the manual data entry suffers from multiple drawbacks such as need for a specialized staff to be hired, double checking of the data entered manually, leveraging multiple software tools etc. Some traditional systems and methods employ a handwriting recognition software or merely an optical code reader, however the software suffers from inaccurate word recognition, and the user is required to write on the screen using a specialized set of characters or in a manner that is easily recognizable by the program. Automated data entry may comprise, inter-alia, of a computer-implemented method of effecting communication between a user and an artificial intelligence software agent (executed on a computer) by receiving at a speech recognition system, a voice input from a user which gets identified by the speech recognition system, individual words in the voice input during speech intervals thereof, and storing the identified words in memory and generating, by the software agent, based on the words stored in the memory an audio response for outputting to the user. The traditional systems and methods provide for a voice based dictation entry, however, as a vocabulary grows, the number of words that are similar in sound also tends to grow. As a result, there is an increased likelihood that an utterance of a given word from the vocabulary will be misrecognized as corresponding to another similar sounding word from the vocabulary.
[0015] Referring now to the drawings, and more particularly to FIGS. 1 through FIGS. 3, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
[0016] FIG. 1 illustrates an exemplary block diagram of a system 100 for non-contact voice-based information filtering and generation. In an embodiment, the system 100 includes one or more processors 104, communication interface device(s) or input/output (I/O) interface(s) 106, and one or more data storage devices or memory 102 operatively coupled to the one or more processors 104. The one or more processors 104 that are hardware processors can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor(s) is configured to fetch and execute computer-readable instructions stored in the memory. In an embodiment, the system 100 can be implemented in a variety of computing systems, such as laptop computers, notebooks, hand-held devices, workstations, mainframe computers, servers, a network cloud and the like.
[0017] The I/O interface device(s) 106 can include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like and can facilitate multiple communications within a wide variety of networks N/W and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. In an embodiment, the I/O interface device(s) can include one or more ports for connecting a number of devices to one another or to another server.
[0018] The memory 102 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes.
[0019] According to an embodiment of the present disclosure, referring to FIG. 2, the architecture and components of the system for non-contact voice-based information filtering and generation may now be understood in detail. A client application 201 may comprise of one or more software programs that may integrate with processing capabilities of another program and may help in creating one or more applications (for example, windows based applications) and one or more application services (for example, a profile application service). The client application 201 communicates with one or more servers 202 (referred as servers in the figure) through a hypertext transfer protocol (HTTP). The client application 201 and the one or more servers 202 communication is a bidirectional communication between two computing entities, where a client (not to be used interchangeably with the client application 201) initiates a transaction by opening a network channel to the one or more servers 202. Typically, the client sends a request or set of requests via a set of networking protocols over that network channel, and the request or requests are processed by the one or more servers 202, returning responses. The one or more servers 202 may comprise of an application engine 204 and a conversion engine 206. The one or more servers 202 may, apart from communicating with the client application 201, may communicate or interact with other devices like an interface or a printer. For example, a user may authorize the server 202 to access data stored on a social networking site. Similarly, as an example, the one or more servers 202 may be configured to present the audio conference and a conference interface to client device(s) via the client (e.g., a browser, one or more browser plug-ins, and/or a special-purpose client). The application engine 204 employs various devices (for example, an optical code reader) and helps in extracting a set of filtered words from one or more images close to an output to be generated voice-based information filtering and generation. The conversion engine 206 helps in extracting another subset of words closely matching to an output to be generated voice-based information filtering and generation. The conversion engine 206 may also comprise of a logic configured to extract words from the speech signal.
[0020] FIG. 3, with reference to FIGS. 1 and 2, illustrates an exemplary flow diagram of a method for non-contact voice-based information filtering and generation. In an embodiment the system 100 comprises one or more data storage devices of the memory 102 operatively coupled to the one or more hardware processors 104 and is configured to store instructions for execution of steps of the method by the one or more processors 104. The steps of the method of the present disclosure will now be explained with reference to the components of the system 100 as depicted in FIG. 1 and the flow diagram. In the embodiments of the present disclosure, the hardware processors 104 when configured the instructions performs one or more methodologies described herein.
[0021] In an embodiment of the present disclosure, at step 301, the one or more hardware processors 104 communicate a set of images to a speech recognition device, wherein the set of images comprise text. The text may comprise of a user hand-written text or any text typed by the user on a piece of paper (for example, words in English, Hindi or any other language or anything in the form of characters, symbols, graphs, drawings, calligraphy and similar hand-written information defined by typing or a hand movement of the user). It may be noted that the hand-written information is to be input to the system 100 for filtering and generation without any information simultaneously being “written” with ink or some colorant from a pen tip on a writing surface.
[0022] The user reads the set of images and the speech recognition device (not shown in the figure) for example, a microphone connected with the system 100 records the audio (that is spoken text from the set of images). The speech recognition device may comprise of one or more devices that combine speech recognition technology with wireless (or power line) networking technology to provide hands-free voice control and recording in appliances, electronic equipment or machines. The speech recognition device may comprise of an embedded speech recognition software and circuitry capable of quickly recognizing spoken speech commands from an expected vocabulary or flexible grammar. The speech recognition device after recording stores the audio in a local machine (not shown in the figure) for example, a local host or a local domain, of the system 100 as a data.
[0023] According to an embodiment of the present disclosure, the client application 201 (for example, a Representational stare transfer or a node.js client application) transfers the data (comprising of the recoded audio) obtained above from the local machine to the one or more servers 202 (for example, a web server or a hyper-text transfer protocol server) for processing. According to an embodiment, the client application 201 may comprise of one or more elements (for example, an application-client). The client application 201 may use the framework provided by an underlying client to access resources provided by an application server (for example, a websphere™). It may be noted that according to an embodiment, the client and the one or more servers 202 may communicate over the system 100 through a separate hardware, but both the client and the one or more servers 202 may reside in the same system 100. The one or more servers 202 host runs one or more server programs which share their resources with the clients. The client does not share any of its resources, but requests the one or more servers 202 content or service function. The client therefore may initiate communication sessions with the one or more servers 202 which await incoming requests. Further, the client and the one or more servers 202 may exchange messages in a request–response messaging pattern. The client sends a request, and the one or more servers 202 return a response.
[0024] The language and rules of communication may be defined in a communications protocol. All client-server protocols may operate in an application layer. The application layer protocol defines the basic patterns of the dialogue. To formalize the data exchange even further, the server may implement an application programming interface (API). The API may be abstraction layer for accessing a service. By restricting communication to a specific content format, it facilitates parsing. By abstracting access, it facilitates cross-platform data exchange.
[0025] According to an embodiment of the present disclosure, the data may be transferred to the one or more servers 202, inter-alia, in blocks of data, using the encoding specified while opening session between the client and the one or more servers 202. The data may be encoded, inter-alia, using any codec recognized by a multimedia framework (for example, a GStreamer™ or FFmpeg™). For example, the recorded audio may be encoded using the javascript (JS) libraries. After transferring the last block of the data, the client may send a message to the one or more servers 202 though any method (for example, by 3 or 4-byte encoding string) that no more data is getting transferred.
[0026] According to an embodiment of the present disclosure, the one or more hardware processors 104, using the one or more servers 202, process the data transferred by the client application 201 for converting the set of images to one or more decoded files. The one or more decoded files are obtained for extracting subset of words for identifying one or more closest matching words with the set of images. The one or more servers 202 may covert the set of images to the one or decoded files using any of the decoders (for example, Kalidi’s™ kaldinnet2onlinedecoder decoder or Hypertext Transfer Protocol (HTTP) based Application Programming Interface (API) or a java audiodecoder). For example, using the java audiodecoder, the one or more servers 202 may perform decoding as:
java AudioDecoder XYZencodedfile ABCpcmfile, where XYZencodedfile is the name of the encoded input file and ABCpcmfile is the name of the output file in Pulse-code Modulation (PCM) format.
[0027] According to an embodiment of the present disclosure, at step 302, one or more hardware processors 104 convert the set of images to the one or more decoded files based upon a first subset and a second subset of words for non-contact voice-based information filtering and generation, wherein the first subset and the second subset of words comprise text matching with the set of images. It may be noted that the word “matching” may be understood to be meaning as “closely matching” throughout the disclosure. The method of extracting the first subset and the second subset of words may now be considered in detail. The one or more hardware processors 104, obtain using an application engine 204, the first subset of words for extracting a filtered subset of words for extracting one or more words from the set of images for converting the set of images to the one or more decoded files for non-contact voice-based information filtering and generation. The first subset of words are extracted based upon an input data, wherein the input data comprises text to be converted to word/s matching with the set of images. The application engine 204 may use an optical character recognition (OCR) software (for example, Adobe Acrobat®) for analyzing the text in the one or more images and then converting the text analyzed from the one or more images into a machine-readable / editable text. The OCR may first use one or more embedded software to remove the artifacts and blurs to make scanning as accurate as possible. The OCR may then scan the set of images and attempt to determine if black and white dots represent any letter or number. The OCR scans commonly-occurring symbols in the set of images and breaks them down into their basic shapes, picking out the letters it has high confidence about. The OCR may further use one or more algorithms like K-nearest neighbors or a scripting language (for example, shell script or python script) for performing a feature extraction while obtaining the first subset of words for extracting the filtered subset of words.
[0028] According to an embodiment, for extracting the first subset of words using the application engine 204, a file or a document containing the set of images (comprising mostly the hand-written text) saved in the system 100 may first be uploaded by selecting an appropriate option from the OCR (like upload file) and then appropriate option may be selected for extracting the first subset of words using the OCR. For example, the first subset of words extracted from the set of images may be:
“FIRST RANKED PERSON TO BE SELECTED”
[0029] According to an embodiment of the present disclosure, at step 302, the one or more hardware processors 104, obtain using a conversion engine 206, the second subset of words for extracting a matching set of words with the set of images, wherein the second subset of words comprises text for identifying one or more closest matching words with the set of images as text to be input. According to an embodiment, the second subset of words are obtained from the data (comprising of the recorded audio) obtained in the step 301 for extracting the matching set of words with the set of images. The one or more hardware processors 104 use the conversion engine 206 (for example, Kaldi ™or IBM Watson®) for extracting the set of words matching with the set of images.
[0030] The extraction of the second subset of words using the conversion engine 206 may comprise of a series of step and the embodiments of the present disclosure support a plurality of conversion engines, methods, tools, cloud platforms, scripting languages, file formats etc. (or any combinations thereof) for extracting the second subset of words. For example, according to an embodiment, the extraction may involve, firstly, unzipping one or more mp3 or .wav files using any programming language (for example, python), transcoding or transcribing the one or more mp3 files unzipped to an application programming interface (API) compatible format by sending an asynchronous request to the API and polling the API until it completes the transcribing, uploading the one or more mp3 files to a cloud platform for accessing by the API either manually or programmatically, making the API call using the API’s client library polling the API for results and finally generating and analyzing the text generated. For example, if the data (comprising of the recorded audio) which was given as an input to the conversion engine 206 was:
“FIRST RANKED PERSON TO BE SELECTED”
The second subset of words may be generated (along with the other information) as below using the conversion engine 206:
You said: “FIRST RANKED PERSON TO BE SELECTED”
Total number of words – 6
Total number of characters – 29
Space count – 5
It may be noted that the scope of the present disclosure does not restricts generating the second subset of words using the conversion engine 206 only and may support a plurality of other methods like writing a code in any of the programming languages (like C#, Java) for generating the second subset of words.
[0031] According to an embodiment of the present disclosure, at step 303, the one or more hardware processors 104 compare the first subset and the second subset of words and a set of source words to generate a set of decoded values for non-contact voice-based information filtering and generation. According to an embodiment, comparing the first subset and the second subset of words and the set of source words is preceded by determining the set of source words (including their alternatives or synonyms, if any) based upon historical decoded data for identifying the one or more closest matching words, and wherein the set of source words comprises one or more words synonymous with the set of images. The historical decoded data may comprise of a set of any kind of information obtained, extracted or produced from any previously decoded elements by any means whatsoever (for example, using the conversion engine 206 or through any programming language code). The historical (previously) decoded elements may comprise of (but not limited to) previously decoded words, symbols, glyph, characters, numerals which were hand-written or typed on any machine and which are capable of being encoded or decoded by any means whatsoever.
[0032] Further, the historical decoded data may also comprise of any kind of information obtained, extracted or produced from one or more images using the OCR or any other means. It may be noted that the previous decoding means may also comprise using, at least to a limited degree, one or more shape analysis techniques or methods (for example, fourier transform) to provide some additional information which may be useful under certain circumstances, thus augmenting the decoding process.
[0033] According to an embodiment of the present disclosure, the one or more hardware processors 104 may, through the one or more servers 202, extract the set of source words or sentences and their alternatives for performing a comparison (discussed later) between the first and the second subset of words and the set of source words for identifying the one or more closest matching words with the set of images. For example, if ‘building name’ is an entity (the one or more images to be decoded) and if ‘street name’ is the previously decoded element (decoded using the conversion engine 206 or by any other method) to obtain ‘street 1’ then all possible buildings on the street forms the set of source words and alternatives. For example, the output, i.e. the set of source words that the one or more servers 202 may consider for performing the comparison may be as:
Mr. XYZ, WALTER HOUSE, STREET NO-1, BUILDING – NATIONAL RESIDENCY;
Mr. ABC, AMBROSE HUT, STREET-1, INTERNATIONAL APARTMENTS
Obtaining the set of source words helps in improving accuracy while identifying the one or more closest matching words with the set of images as its considers the list of words or sentences, including their alternatives, if any, as all the previously decoded elements are considered by the one or more servers 202 leaving no scope for un-identified or unconsidered words.
[0034] The comparison of the first subset of words, the second subset of words and the set of source words may now be considered in detail with the help of an example. Suppose, the first and the second subset of words and the set of source words are generated as:
THE FIRST SUBSET OF WORDS (using the application engine 204): PETER, RICKPATERSON & NETRICK
THE SECOND SUBSET OF WORDS (using the conversion engine 206): PAN, ROME, & NEWTOWN PATRICK; and
THE SET OF SOURCE WORDS (obtained from the historical decoded elements): SUZAN, PATRICK & PANKAJ.
[0035] According to an embodiment, the one or more hardware processors 104 perform, using fuzzy algorithm, a fuzz matching between each of the subset of words (or entries) obtained using the application engine 204, the conversion engine 206 and the historical decoded elements. This may comprise of generating a set of decoded values to assign a set of weightages to the first and the second subset of words and the set of source words for identifying the one or more closest matching words as text to be input. Therefore, performing the fuzzy matching, the one or more hardware processors 104, first attempts to match the word “PETER” (first one amongst the first subset of words) with each of the words from the second subset and the set of source words. The word “PETER” may be considered matching to some extent with “PAN”, “PANKAJ” and “PATRICK”. Hence the word “Peter” may be assigned the decoded value as 0.5 in the scale of 0 to 1, where 1 means the highest value assigned on perfect match between the subset of words. The remaining of the first subset of words, that is, “RICKPATERSON” and “NETRICK” may be assigned the decoded values as 0.6 and 0.7 respectively as based upon the fuzzy matching with the second and the set of source words. From the second subset of words, the words “ROME” and “PAN” may be assigned the decoded values as 0 and 0.2 respectively with the word “ROME” not matching with any of the words (when compared with the first and the second subset of words and the set of source words) and the word “PAN” matches with the word “PANKAJ” to some extent. From amongst the third subset of words, the words “SUZAN” and “PANKAJ” may be assigned the decoded values as 0 and 0.2 respectively based on the same logic (fuzzy matching). However, the word “PATRICK” be assigned the decoded value of 0.9 or 1 as it matches completely with the word “PATRICK” (that is the word “NEWTOWN PATRICK”) from amongst the second subset of words and the word “NETRICK” from amongst the first subset of words. The word “PATRICK” may thus be given highest weightage (for example, A from the scale of A to D, where A implies the highest weightage and D the least weightage) based upon the highest decoded value assigned and thus the word “PATRICK” may be considered as the closest matching word from amongst the first and the second subset of words and the set of source words. The one or more hardware processors 104 may then display the word “PATRICK” to the user as the data entered automatically (that is without manual data entry or on the basis of the one or more spoken words) or as the non-contact voice-based information filtered and generated.
[0036] According to an embodiment of the present disclosure, a complete working example of the present disclosure may now be considered in details. Suppose, the user needs to perform an automated data entry or non-contact voice-based information filtering and generation for the set of images which read as “STATE CODE-MAHARASTRA, PIN CODE-110022, STREET – NTY, BUILDING NAME –ABC TOWERS, COMPANY NAME- ABC LTD AND CONTACT PERSON NAME- JOHN. The following one or more images or elements are to be decoded (that is, the set of images are to be converted to the one or more decoded files) by the one or more servers 202:
STATE CODE
PIN CODE
STREET
BUILDING NAME
COMPANY NAME; and
CONTACT PERSON
[0037] Out of the above one or more images, only “CONTACT PERSON” appears resembling to any person or a human being and be considered as exhaustive while the remaining one or more images resemble to anything (like a place, an address etc.). From the remaining one or more images, there may be a very big list (obtained from the application engine 204 (using the OCR), the conversion engine 206 the historical or previously decoded data) comprising of the first subset of words, the second subset of words and the set of source words.
[0038] According to an embodiment of the present disclosure, using the application engine 204 or the conversion engine 206, from the above set of images, firstly, one or more state codes pertaining to the states (in India) are to be decoded or identified. The user may first read the set of images, which may be recorded by the speech recognition device, and then gets converted into the audio file. Using the application engine 204, the first subset of words may be obtained as “MAHA”, “MAHARASHTRA”, “MAHARAS” or “MARASTRA” etc. Further, using the conversion engine 206, the second subset of words may be obtained as “MAH”, “MAHARASHTR”, “MATRA” or ““MAHARASTRA”. It may be noted that the historical decoded data may not be considered at this stage as for obtaining the first subset of words (that is “MAHARASTRA”) as it is not possible to consider any historically decoded data or words due to practically no possibility of matching between unidentified state codes and the historical decoded data. The one or more hardware processors 104 may then perform the comparison based upon the fuzzy matching by the assigning the set of decoded values for further assigning the weightages to each of the first and the second subset of words as explained above. So, the word “MAHARASHTRA” or the word “MAHARASTRA” may be assigned the highest decoded value as 0.9 or 1.0 as these subset of words closely resemble the word “MAHARASHTRA” and accordingly be given the highest weightage (for example A) by the one or more hardware processors 104 and is to be considered for the automated data entry or for the non-contact voice-based information filtering and generation. The one or more hardware processors 104 may then display the word “MAHARASHTRA” to the user as the data entered automatically (that is without manual data entry or on the basis of the one or more spoken words) or as the non-contact voice-based information filtered and generated.
[0039] Now suppose, one or more “PIN CODES” are to be decoded based upon one or more “PIN CODES” read by the user (that is “110022”) from the set of images, recorded by the speech recognition device and then finally extracted by the application engine 204 or the conversion engine 206 from the set of images and the recorded audio respectively. There may be huge number of “PIN CODES” in India which are to be decoded. However, the state code has been decoded above and “MAHARASHTRA” has been identified. Thus, large number of “PIN CODES” gets eliminated as we need to look for the “PIN CODES” in “MAHARASHTRA”. Therefore, according to the embodiments of the present disclosure, the one or more hardware processors, 104, using the application engine 204 and the conversion engine 206 may generate the first and the second subset of words (or the “PIN CODES”) respectively. Further, the set of source words (with respect to the “PIN CODES”) may also be generated based upon the historical decoded data. The one or more hardware processors 104 may then perform a comparison of the first and the second subset of words and the set of source words based upon the set of decoded values and thereby assigning the weightages for identifying the one or more closest matching words as text to be input. So, the word (or the “PIN CODE’) “110022” or the word “11002” may be assigned the highest decoded value as 1.0 or 0.9 and the weightage as A as these subset of words closely or exactly resemble the word “110022”, which is to be considered for the automated data entry or for the non-contact voice-based information filtering and generation. The one or more hardware processors 104 may then display the word “110022” to the user as the data entered automatically (that is without manual data entry or on the basis of the one or more spoken words) or as the non-contact voice-based information filtered and generated.
[0040] According to an embodiment, one or more “STREETS” are to be decoded for identifying the exact or closest matching “STREET” that has been spoken by the user (that is, “NTY”). However, the decoded elements have been identified as “MAHARASHTRA” and “110022”. Therefore, the “STREET” now needs to be identified and decoded from the decoded elements “MAHARASHTRA” and “110022” which further makes the decoding selection of the closest or exact matching subset of words faster and more accurate. The first and the second subset of words and the set of source words may then again be generated by the application engine 204, the conversion engine 206 and based upon the historical decoded data respectively and the comparison may be performed for assigning the weightages based upon the set of decoded values for identifying the one or more closest matching words with the set of images as text to be input. The one or more closest matching words may be obtained for the “STREET” as “NITY”, “NETY” or “NTY” etc. The one or more hardware processors 104 may then display the word “NTY” (or if “NETY” is selected as “NETY”) to the user (after assigning the set of decoded value/s as 0.8 or 0.9 and the set of weightages as A or B) as the data entered automatically (that is without manual data entry or on the basis of the one or more spoken words) or as the non-contact voice-based information filtered and generated.
[0041] According to an embodiment of the present disclosure, further the “BUILDING NAME” may be decoded from the state code “MAHARASHTRA”, the “PIN CODE” “110022” and the “STREET” “NTY” or “NETY” based upon the first and the second and the set of source words obtained using the application engine 204, the conversion engine 206 and the historical decoded data and by performing the comparison and by assigning the set of decoded values and the weightages. Further, from the decoded the state code “MAHARASHTRA”, the “PIN CODE” “110022” and the “STREET” “NTY” or “NETY” “BUILDING NAME”, the “COMPANY NAME” may be decoded (in the similar manner) and finally, from the state code “MAHARASHTRA”, the “PIN CODE” “110022” and the “STREET” “NTY” or “NETY”, the decoded “BUILDING NAME” and decoded the “COMPANY NAME”, “CONTACT PERSON” may be decoded based upon the first and the second and the set of source words obtained. The one or more hardware processors 104 may then display the all words (or the text) to the user as the data entered automatically (that is without manual data entry or on the basis of the one or more spoken words based upon the set of decoded values and weightages assigned after the comparison) or as non-contact voice-based information filtered and generated as:
“STATE CODE-MAHARASTRA, PIN CODE-110022, STREET – NTY (or NETY), BUILDING NAME –ABC TOWERS, COMPANY NAME- ABC LTD AND CONTACT PERSON NAME- JOHN
Thus, the present invention facilitates completely automating the data entry and / or perform non-contact voice-based information filtering and generation as there is no need to manual enter anything in the computer / laptop etc. On the basis of the spoken words, the system 100 may perform non-contact voice-based information filtering and generation and thus completely automating the data entry.
[0042] According to an embodiment, advantages of the present disclosure may now be considered in detail. The present disclosure offers saving a lot of time as the data entered is completely automated and performed by speaking one or more words by the user from the set of images. As speaking is much faster than the manual data entry, a lot of time may be saved as the information may be stored and generated much faster to perform a variety of functions by the user. The present disclosure also increasing accuracy to a large extent as its offers a different solution from the traditional systems and methods. The traditional systems and methods provide for converting and generating a text using the conversion engines (or the OCR techniques) based upon the audio file where the comparison to find and generate a closest matching word/s is performed by comparing the audio file words with hundreds or thousands of a dictionary words which leads to low accuracy.
[0043] However, according to an embodiments the present disclosure, instead of reading the entire text in one attempt or as a dictation, the user first selects entity to be spoken (for example, a building name) from the set of images by selecting the relevant portion from the one or more files in the system 100 (which contains the set of images) and then speaks the corresponding word/s (for example, ABC TOWERS). The user only speaks the corresponding word/s which enables the one or more hardware processors to use the application engine 204, the conversion engine 206 and the set of source words to generate the first, the second and the set of source words respectively and perform the comparison for assigning the set of decoded values and the weightages for identifying the one or more closest matching words. Therefore, instead of comparing the large number of words with the pronounced in one attempt as in the dictation, the present disclosure, unlike the traditional systems and methods decodes and identifies the closest or exact matching words by comparing only the one or more words spoken by the user (corresponding to the entity in question) and therefore leading to a very high accuracy. Before, the next word/s is/are spoken by the user (corresponding to the next entity in question), the one or more closest or exact matching words are decoded. Further, the present disclosure generates the first and the second subset of words and the set of source words using the application engine 204 (through the OCR), the conversion engine 206 and the historical decoded data respectively and then by performing the comparison and assigning the set of decoded values and the weightages which leads to scanning the more options of the words for decoding and thus leading to a higher accuracy (unlike the traditional systems and methods which employ the OCR based scanning or use the conversion engine only with no comparison performed).
[0044] The proposed disclosure further leads to higher efficiency and productivity by eliminating risks involved in the manual data entry (for example, entering a wrong value in the computer in a bank by an agent) by automating the data entry and facilitating non-contact voice-based information filtering and generation. It also eliminates the need for manual intervention in a variety of functions. For example, in bank account maintenance related the data entry, the customer calls in for making necessary changes to personal information like address and email id. The bank agent makes required changes by typing in modified details as the customer speaks. However, by the proposed disclosure, this may be automated by automatically populating required fields by converting the customer’s speech to text. Also, in compliance checks for the bank account related customer calls, for every customer call, the bank agents are supposed to speak mandatory script depending on what customer is requesting for.
[0045] The proposed disclosure may be leveraged to check whether the agent has spoken entire mandatory script. In call monitoring for various customer calls, the calls are recorded and monitored to check various compliance and quality attributes. At present this is done on sampling basis. The proposed disclosure can be used to convert recorded calls to text and subsequently automate the monitoring process. Automation would enable to increase the sampling percentage. Further, in drafting clinical trial documents, manual efforts are involved. The proposed disclosure may be implemented to create these documents. The user can dictate the text to be populated in the document. Thus, the proposed disclosure may enhance overall efficiency, increase productivity and accuracy in performing various operations in the banking, insurance, retail and various other domains. By automating and increasing accuracy in various tasks, the proposed disclosure may help also people with physical disabilities and with repetitive stress injury handle activities like data entry operation.
[0046] It may be noted that the output of all the steps performed above (that is, steps 301 to 302) for example, the first and the second subset of words, the set of source words, the set of decoded values and the set of weightages gets stored in the memory 102 of the system 100.
[0047] The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
[0048] It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g. any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g. hardware means like e.g. an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g. an ASIC and an FPGA, or at least one microprocessor and at least one memory with software modules located therein. Thus, the means can include both hardware means and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g. using a plurality of CPUs.
[0049] The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various modules described herein may be implemented in other modules or combinations of other modules. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
[0050] The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope and spirit of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
[0051] Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, BLU-RAYs, flash drives, disks, and any other known physical storage media.
[0052] It is intended that the disclosure and examples be considered as exemplary only, with a true scope and spirit of disclosed embodiments being indicated by the following claims.
| # | Name | Date |
|---|---|---|
| 1 | 201721034787-IntimationOfGrant06-06-2023.pdf | 2023-06-06 |
| 1 | 201721034787-STATEMENT OF UNDERTAKING (FORM 3) [29-09-2017(online)].pdf | 2017-09-29 |
| 2 | 201721034787-PatentCertificate06-06-2023.pdf | 2023-06-06 |
| 2 | 201721034787-REQUEST FOR EXAMINATION (FORM-18) [29-09-2017(online)].pdf | 2017-09-29 |
| 3 | 201721034787-Written submissions and relevant documents [21-04-2022(online)].pdf | 2022-04-21 |
| 3 | 201721034787-FORM 18 [29-09-2017(online)].pdf | 2017-09-29 |
| 4 | 201721034787-FORM 1 [29-09-2017(online)].pdf | 2017-09-29 |
| 4 | 201721034787-Correspondence to notify the Controller [30-03-2022(online)].pdf | 2022-03-30 |
| 5 | 201721034787-FORM-26 [30-03-2022(online)]-1.pdf | 2022-03-30 |
| 6 | 201721034787-FORM-26 [30-03-2022(online)].pdf | 2022-03-30 |
| 6 | 201721034787-DRAWINGS [29-09-2017(online)].pdf | 2017-09-29 |
| 7 | 201721034787-US(14)-HearingNotice-(HearingDate-13-04-2022).pdf | 2022-03-21 |
| 7 | 201721034787-COMPLETE SPECIFICATION [29-09-2017(online)].pdf | 2017-09-29 |
| 8 | 201721034787-FORM-26 [31-10-2017(online)].pdf | 2017-10-31 |
| 8 | 201721034787-ABSTRACT [23-09-2020(online)].pdf | 2020-09-23 |
| 9 | 201721034787-CLAIMS [23-09-2020(online)].pdf | 2020-09-23 |
| 9 | 201721034787-Proof of Right (MANDATORY) [14-12-2017(online)].pdf | 2017-12-14 |
| 10 | 201721034787-COMPLETE SPECIFICATION [23-09-2020(online)].pdf | 2020-09-23 |
| 10 | 201721034787-ORIGINAL UNDER RULE 6 (1A)-FORM 1-21-12-2017.pdf | 2017-12-21 |
| 11 | 201721034787-FER_SER_REPLY [23-09-2020(online)].pdf | 2020-09-23 |
| 11 | Abstract.jpg | 2018-08-11 |
| 12 | 201721034787-- ORIGINAL UR 6( 1A) FORM 26-021117.pdf | 2018-11-12 |
| 12 | 201721034787-OTHERS [23-09-2020(online)].pdf | 2020-09-23 |
| 13 | 201721034787-FER.pdf | 2020-03-23 |
| 14 | 201721034787-- ORIGINAL UR 6( 1A) FORM 26-021117.pdf | 2018-11-12 |
| 14 | 201721034787-OTHERS [23-09-2020(online)].pdf | 2020-09-23 |
| 15 | 201721034787-FER_SER_REPLY [23-09-2020(online)].pdf | 2020-09-23 |
| 15 | Abstract.jpg | 2018-08-11 |
| 16 | 201721034787-COMPLETE SPECIFICATION [23-09-2020(online)].pdf | 2020-09-23 |
| 16 | 201721034787-ORIGINAL UNDER RULE 6 (1A)-FORM 1-21-12-2017.pdf | 2017-12-21 |
| 17 | 201721034787-Proof of Right (MANDATORY) [14-12-2017(online)].pdf | 2017-12-14 |
| 17 | 201721034787-CLAIMS [23-09-2020(online)].pdf | 2020-09-23 |
| 18 | 201721034787-FORM-26 [31-10-2017(online)].pdf | 2017-10-31 |
| 18 | 201721034787-ABSTRACT [23-09-2020(online)].pdf | 2020-09-23 |
| 19 | 201721034787-US(14)-HearingNotice-(HearingDate-13-04-2022).pdf | 2022-03-21 |
| 19 | 201721034787-COMPLETE SPECIFICATION [29-09-2017(online)].pdf | 2017-09-29 |
| 20 | 201721034787-FORM-26 [30-03-2022(online)].pdf | 2022-03-30 |
| 20 | 201721034787-DRAWINGS [29-09-2017(online)].pdf | 2017-09-29 |
| 21 | 201721034787-FORM-26 [30-03-2022(online)]-1.pdf | 2022-03-30 |
| 22 | 201721034787-FORM 1 [29-09-2017(online)].pdf | 2017-09-29 |
| 22 | 201721034787-Correspondence to notify the Controller [30-03-2022(online)].pdf | 2022-03-30 |
| 23 | 201721034787-Written submissions and relevant documents [21-04-2022(online)].pdf | 2022-04-21 |
| 23 | 201721034787-FORM 18 [29-09-2017(online)].pdf | 2017-09-29 |
| 24 | 201721034787-REQUEST FOR EXAMINATION (FORM-18) [29-09-2017(online)].pdf | 2017-09-29 |
| 24 | 201721034787-PatentCertificate06-06-2023.pdf | 2023-06-06 |
| 25 | 201721034787-IntimationOfGrant06-06-2023.pdf | 2023-06-06 |
| 25 | 201721034787-STATEMENT OF UNDERTAKING (FORM 3) [29-09-2017(online)].pdf | 2017-09-29 |
| 1 | SearchStrategyAE_26-02-2021.pdf |
| 2 | 2020-02-2710-58-11_27-02-2020.pdf |