FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION
(See Section 10 and Rule 13)
Title of invention: TRANSCRIBING AUDIO FILE TO TEXT FILE
Applicant
Tata Consultancy Services Limited A company Incorporated in India under The Companies Act, 1956
Having address:
Nirmal Building. 9th Floor.
Nariman Point, Mumbai 400021,
Maharashtra, India
The following specification particularly describes the invention and the manner in which it is to be performed
TRANSCRIBING AUDIO FILE TO TEXT FILE
TECHNICAL FIELD
[001] The present subject matter described herein, in general, relates to transcription
of audio files, and more particularly to a system and a method for transcribing audio files to text files.
BACKGROUND
[002] In an era of digital processing systems, there is large amount of audio files
being received and stored by the digital processing systems installed in an organization such as a call center. The audio files may be put to use by the organization for several purposes such as, training employees and improving work related processes. However, to make the audio files useful, the audio files needs to be converted into text file. This conversion of the audio files into the text file is called transcription.
[003] However, technology related to transcribing the audio files into the text file is
not yet fully mature and gives erroneous results. Further, problem is compounded when there are audio files in multiple languages.
[004] Alternative to the automated transcription is the human based transcription.
Human can more accurately transcribe the audio files into the text file. However, human based transcription would require a large number of people for transcribing multiple languages. Further, confidentially is an issue while using a large number of people for transcribing. Furthermore, economic feasibility is another issue in full time availability of the large number of people for transcribing audio files in multiple languages. Therefore, human based transcription may not be suitable for both employer and employees.
SUMMARY
[005] This summary is provided to introduce aspects related to a system and a
method for transcribing audio files to text file and the aspects are further described below in the detailed description. This summary is not intended to identify essential features of the
claimed subject matter nor is it intended for use in determining or limiting the scope of the claimed subject matter.
[006] In one implementation, a system for transcribing an audio file to text file is
disclosed. The system includes a processor and a memory coupled to the processor, wherein the processor is capable of executing a plurality of modules embodied on the memory. The modules comprises of an audio clipping module, a distribution module, and a consolidation module. The audio clipping module is configured to receive a plurality of audio files from a plurality of sources and segments each audio clip into a plurality of segments. Then the plurality of audio clips are generated using the plurality of segments, wherein each audio clip of the plurality of audio clips is generated by combining segments associated with one or more audio files. The distribution module is configured to distribute the plurality of audio clips to the plurality of stakeholders for transcribing the plurality of audio clips into text clips, wherein the plurality of audio clips are distributed based upon a set of parameters Further, a consolidation module is configured to receive the text clips from electronic devices associated with the plurality of stakeholders, wherein each of the plurality of stakeholders transcribe a subset of the plurality of audio clips into the text clips using the electronic devices. Further, the text clips are consolidated and arranged to generate the text file corresponding to the audio file.
[007] In another implementation, a method for transcribing the audio file into the a
text file is disclosed. The method includes receiving a plurality of audio files from a plurality of sources and segmenting each audio file into a plurality of segments. Then, the plurality of audio clips are generated using the plurality of segments, wherein each audio clip of the plurality of audio clips is generated by combining segments associated with one or more audio files. The plurality of audio clips is distributed to the plurality of stakeholders for transcribing the plurality of audio clips into the text clips. The plurality of audio clips are distributed based upon a set of parameters. The text clips is received from electronic devices associated with the plurality of stakeholders, wherein each of the plurality of stakeholders transcribe a subset of the plurality of audio clips into the text clips using the electronic devices. Further, the text clips received from each of the plurality of electronic devices is consolidated and arranged to generate the text file corresponding to the audio file. The steps of receiving the audio file, the
segmenting, the generating , the distributing, the receiving the text clips, and the consolidating are performed by a processor.
[007] In yet another implementation, a computer program product having embodied
thereon a computer program for transcribing audio file into text file is disclosed. The computer program product includes a program code for receiving the plurality of audio files from a plurality of sources and a program code for segmenting each audio file into a plurality of segments. The computer program product includes a program code for generating a plurality of audio clips using the plurality of segments, wherein each audio clip of the plurality of audio clips is generated by combining segments associated with one or more audio files. Further, the computer program product includes a program code for distributing the plurality of audio clips to the plurality of stakeholders for transcribing the plurality of audio clips into the text clips. The plurality of audio clips are distributed based upon a set of parameters. Also, the computer program product includes a program code for receiving the text clips from electronic devices associated with the plurality of stakeholders, wherein each of the plurality of stakeholders transcribes a subset of the plurality of audio clips into the text clips using the electronic devices. Further, the computer program product includes a program code for consolidating and arranging the text clips to generate the text file corresponding to the audio file.
BRIEF DESCRIPTION OF THE DRAWINGS
[008] The detailed description is described with reference to the accompanying
figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The same numbers are used throughout the drawings to refer like features and components.
[009] Figure 1 illustrates a network implementation of a system for transcribing
audio files into text file using a plurality of stakeholders, in accordance with an embodiment of the present subject matter.
[0010] Figure 2 illustrates the system, in accordance with an embodiment of the
present subject matter.
[0011] Figure 3 shows an operational environment of the invention regarding the
transcribing of the audio files into the text file, in accordance with an embodiment of the present subject matter.
[0012] Figure 4 is a flowchart illustrating a method for transcribing the audio files into
the text file, in accordance with an embodiment of the present subject matter.
DETAILED DESCRIPTION
[0013] Systems and methods for transcribing audio files into text files using a
plurality of stakeholders are described. At first, audio files are received from a source. The source may be a call center which receives several calls from various customers. The audio files may need to be transcribed into text files for several reasons, such as employee training, quality monitoring, and the like. Subsequently, each audio file may be segmented into a plurality of segments. The plurality of segments belonging to one or more audio files may be combined to generate a plurality of audio clips. In other words, each audio clip may be made up of segments of different or same audio files.
[0014] The plurality of audio clips are distributed to the plurality of stakeholders for
transcribing each of the plurality of audio clips into its corresponding text clip. The distribution is based on a plurality of parameters that decides how the audio clips should be distributed and to whom the audio clips should be distributed. The parameters comprise at least one of a complexity of the audio clips, location of a stakeholder, level of literacy of a stakeholder, a number of reward points earned by a stakeholder, fairness of opportunity, frequency of participation, and the like.
[0015] Based upon these parameters, the audio clips may be distributed to the plurality
of stakeholders. The plurality of stakeholders may transcribe the audio clips into text clips using their respective electronic devices, such as a mobile phone. After transcribing the audio clips, the stakeholders may send the text clips to the system using the electronic device. Based upon the text clips, two types of scores may be assigned to the stakeholders. The two types of scores may include usability scores and reward points.
[0016] The usability scores are provided by the electronic devices. The usability
scores are based on a number of rewind/playback required, time taken, typing speed, and backspaces used while transcribing the audio clip to text clip on the electronic device. According to the proficiency of the stakeholders, the user interface of the electronic devices adapts itself to enhance the proficiency of the stakeholder. The reward points, on the other hand, are assigned based upon a quality of the text clip being transcribed by a stakeholder. In other words, the stakeholders are assigned reward points for performing the transcription of the audio clips and the verification of the previously transcribed audio clip. Also, the stakeholders are assigned reward points based on history of quality of transcriptions, an amount of time spent in transcription, and an amount of time spent in verification of the text clip, and the like. The reward points and the usability scores help in making future choices, as to who should transcribe which audio clip. Finally, text clips received from each of the electronic devices associated with the stakeholders is consolidated and arranged to generate the text file. The text file corresponds to an audio file. The text file may be used for training employees or for several other purposes.
[0017] While aspects of described system and method for transcribing the audio files
into the text file may be implemented in any number of different computing systems, environments, and/or configurations, the embodiments are described in the context of the following exemplary system.
[0018] Referring now to Figure 1, a network implementation 100 of transcribing the
audio files into the text files using a plurality of stakeholders is illustrated, in accordance with an embodiment of the present subject matter. In one embodiment, the system 102 performs transcribing of the audio files into the text files using a plurality of stakeholders. The system ] 02 receives the audio files from a source and segments the audio files into a plurality of segments. The plurality of segments belonging to one or more audio files may be combined and jumbled to generate a plurality of audio clips. In other words, each audio clip may be made up of several segments of different or same audio files.
[0019] Subsequently, the system 102 distributes the plurality of audio clips to plurality
of stakeholders for transcribing each audio clip of the plurality of audio clips into text clips. Such distribution is based on a plurality of parameters. After the distribution, the plurality of stakeholders transcribe the audio clips into text clips using the electronic devices such as
mobile phone. Subsequently, the plurality of stakeholders send the text clips to the system 102 using the electronic devices connected to the system 102. Further, the system 102 consolidates and arranges the text clips received from each of the plurality of electronic devices 104 to generate the text file corresponding to an audio file.
[0020] Although the present subject matter is explained considering that the system
102 is implemented on a server, it may be understood that the system 102 may also be implemented in a variety of computing systems, such as a laptop computer, a desktop computer, a notebook, a workstation, a mainframe computer, a network server, and the like. It will be understood that the system 102 may be accessed by the plurality of stakeholders through one or more electronic devices 104-1, 104-2... 104-N, collectively referred to as electronic devices 104 hereinafter, or applications residing on the electronic devices 104. Examples of the electronic devices 104 may include, but are not limited to. a portable computer, a personal digital assistant, a handheld device, a mobile phone, and a workstation. The electronic devices 104 are communicatively coupled to the system 102 through a network 106.
[0021] In one implementation, the network 106 may be a wireless network, a wired
network or a combination thereof. The network 106 can be implemented as one of the different types of networks, such as intranet, local area network (LAN), wide area network (WAN), the internet, and the like. The network 106 may either be a dedicated network or a shared network. The shared network represents an association of the different types of networks that use a variety of protocols, for example, Hypertext Transfer Protocol (HTTP), Transmission Control Protocol/Internet Protocol (TCP/IP), Wireless Application Protocol (WAP), and the like, to communicate with one another. Further the network 106 may include a variety of network devices, including routers, bridges, servers, computing devices, storage devices, and the like.
[0022] Referring now to Figure 2, the system 102 is illustrated in accordance with an
embodiment of the present subject matter. In one embodiment, the system 102 may include at least one processor 202, an input/output (I/O) interface 204, and a memory 206. The at least one processor 202 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic
circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the at least one processor 202 is configured to fetch and execute computer-readable instructions stored in the memory 206.
[0023] The I/O interface 204 may include a variety of software and hardware
interfaces, for example, a web interface, a graphical user interface, and the like. The I/O interface 204 may allow the system 102 to interact with a user directly or through the electronic devices 104. Further, the I/O interface 204 may enable the system 102 to communicate with other computing devices, such as web servers and external data servers (not shown). The I/O interface 204 can facilitate multiple communications within a wide variety of networks and protocol types, including wired networks, for example, LAN, cable, etc.. and wireless networks, such as WLAN, cellular, or satellite. The I/O interface 204 may include one or more ports for connecting a number of devices to one another or to another server.
[0024] The memory 206 may include any computer-readable medium known in the art
including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. The memory 206 may include modules 208 and data 210.
[0025] The modules 208 include routines, programs, objects, components, data
structures, etc., which perform particular tasks or implement particular abstract data types. In one implementation, the modules 208 may include an audio clipping module 212, a distribution module 214. a correction verification module 216, a consolidation module 218, a scoring module 220, and other modules 222. The other modules 222 may include programs or coded instructions that supplement applications and functions of the system 102.
[0026] The data 210, amongst other things, serves as a repository for storing data
processed, received, and generated by one or more of the modules 208. The data 210 may also include a system database 224, and other data 226. The other data 226 may include data generated as a result of the execution of one or more modules in the other modules 218.
[0027] The working of Figure 2 may be explained in conjunction with Figure 3.
Figure 3 shows an operational environment in which the system 102 works to transcribe audio
files into the text files, in accordance with an embodiment of the present subject matter. In one implementation, the system 102 receives audio files from a plurality of sources 302. The sources 302 may include, but not limited to a call centre or any other organization wishing to transcribe each audio file into a text file.
[0028] In the present implementation, the source 302 may have huge amount of audio
files. It may be understood that the audio files may be in different languages and of varying complexity. In one embodiment, source 302 may categorize the audio files based on complexity and language. In another embodiment, the audio clipping module 212 may categorize the audio files based on complexity and language. The audio files may be stored in the system database 224.
[0029] The audio clipping module 212, after receiving the audio files, may segment
each audio file into a plurality of segments. The plurality of segments belonging to several audio files may be combined/jumbled to generate a plurality of audio clips. In other words, each audio clip may be made up of segments of different audio files. The audio clip 304 may have a duration that is short and comfortable for a stakeholders 306 to transcribe. The stakeholder 306 may include, but not limited to, a human.
[0030] The audio clips 304 may be of varying lengths. In one embodiment, the audio
clips 304 may be clipped at the end of each paragraph/sentence so that the stakeholder may understand the context of the audio clip 304 and transcribe the audio clip more efficiently. In another embodiment, in order to maintain confidentiality of the audio files, the audio clips are generated in such a manner so that the stakeholder does not understand the context of the audio clip. For example, in order to maintain confidentiality of the audio files, the segments from several audio files may be jumbled and combined into one audio clip 304 before being sent to a stakeholder 306.
[0031] For example, in the present embodiment, the system 102 may store the
following audio files:
[0032] Audio file 1 (conversation): Oh! grandmother, what big ears you have! The
better to hear you with, my child. But, grandmother, what big eyes you have! The better to see you with, my dear. But. grandmother, what large hands you have! The better to hug you with. Oh! but, grandmother, what a terrible big mouth you have! The better to eat you with!
[0033] Audio file 2: Have some wine. I don't see any wine. There isn't any. Then it
wasn't very civil of you to offer it. It wasn't very civil of you to sit down without being invited. I didn't know it was your table, it's laid for a great many more than three.
[0034] Audio files 1 and 2 are sent to the system 102 by the source 302. The system
102 identifies language of the audio files 1 & 2, and other possible parameters that can be used for assessing complexity of the content - probably context/domain of the content. The audio clipping module 212 clips the audio files 1 & 2 into a plurality of segments.
[0035] Audio file 1 may be split into a plurality of segments, namely, Mlpl: Oh!
Grandmother; MIp2: what big ears you have!; M1p3: The better to hear you with, my child; Mlp4: But, grandmother, what big eyes you have!; Mlp5: The better to see you with, my dear; Mlp6 and so on.
[0036] Similarly, Audio file 2 may be split into a plurality of segments, namely,
M2pl: Have some wine. 1 don't see any wine; M2p2: There isn't any; M2p3: Then it wasn't very civil of you to offer it; M2p4: It wasn't very civil of you to sit down without being invited; M2p5: I didn't know it was your table, it's laid for a great many more than three.
[0037] In the present example, segments from one or more audio files may be
combined together to form an audio clip. This may be done to maintain confidentially of the audio files. Then, the audio clips 1 & 2 are made by combining various segments. For example, Audio clip 1 = m2p3 + mlpl + mlp5, i.e.. Audio clip 1 = Then it wasn't very civil of you to offer it. «pause» Oh! Grandmother. «pause» The better to see you with, my dear.
[0038] Similarly, Audio clip 2 = mlp3 + m2pl + m2p5, i.e., Audio clip 2 = The better
to hear you with, my child. «pause»I don't see any wine. «pause»l didn't know it was your table, it's laid for a great many more than three. Clip 3: m2p2 + mlp2: There isn't any. «pause> What big ears you have! Audio clip 4: The better to hug you with.; and so on.
[0039] In this way, it may be understood that the audio clips are composed of many segments and a track of this is kept in the system database 224. In another embodiment, different segments of a larger audio clip 304 may be combined together into one audio clip before being sent to the stakeholder 306.
[0040] After categorization and jumbling the segments to form the audio clips 304, the
distribution module 214 distributes the audio clips 304 to the plurality of stakeholders 306 for transcribing audio clips 304 to text clips. In one implementation, the distribution module 214 determines which audio clip 304 needs to be transcribed by how many stakeholders 306. and by which stakeholders 306. In one embodiment according to the present subject matter, the distribution module 214 assigns the audio clips 304 to stakeholders 306 based on a set of parameters such as complexity of the audio clips 304, location of the stakeholders, literacy level of the stakeholders 306. reward points and usability score assigned to the stakeholders in previous transcriptions, frequency of participation of the stakeholders 306 in transcribing audio clips to text clips. The reward points and the usability score may be used by the distribution module 214 to determine a proficiency level of a stakeholder in transcribing an audio clip to text clip.
[0041] In one example, the usability score is provided by a usability assessment
module present in the electronic device of the stakeholders 306. The usability assessment module may record the proficiency score on the basis of certain factors like number of rewind/playback required, time taken, typing speed, backspaces used by the stakeholder to transcribe the audio clip. According to behavior of the stakeholders, the usability assessment module 308 may adapt the user interface 310 to enhance the proficiency of the stakeholders 306. The proficiency score of the stakeholders 306 is sent to the distribution module 214 so that the distribution module 214 can make more judicious choices in future as to who should transcribe which audio clip 304.
[0042] Further, the distribution module 214 grades the stakeholders from low
proficiency to high proficiency by providing reward points. More complex audio clips 304 would be distributed to the stakeholders 306 with high proficiency level. Transcribed output by a stakeholder 306 with low proficiency level is given as input to another stakeholder with higher proficiency level. The stakeholder 306 with higher proficiency level makes corrections. The chain continues till no change is made indicating that the text clip is correct.
[0043] Further, the stakeholders 306 residing near to the system 102 are likely to
receive more audio clips 304 as compared to the stakeholder 306 residing remotely from the system 102. Further, the distribution module 214 assigns more complex audio clips 304 to more literate stakeholders. Also, the stakeholders 306 who participate more in the transcribing
are likely to receive more audio clips 304 than a stakeholder 306 who participates in the transcribing rarely.
[0044] First, the stakeholders need to be registered in the system 102 and uses a
mobile application for transcription. At any time, the stakeholders 306 can start the mobile application and seek requests to transcribe, i.e. fetch the audio clips 304. After fetching the audio clips 304. the stakeholders 306 can perform one or more of a plurality of tasks:
a. Transcribe: Transcribe the audio clips 304 to corresponding text clips.
b. Verify: Here, the stakeholders 306 verify the correctness of the previously
transcribed audio clips 304.
[0045] On the basis of the transcription and verification, the stakeholders 306 are
assigned reward points which may be positive or negative. For example, every transcription earns some reward points. A verified transcription without error earns more reward points. However, if the verification involves major corrections then it may result in negative reward points to the original stakeholders 306. This ensures quality participation from the stakeholders 306 so that there is consistent good quality. The stakeholders 306 may redeem reward points earned in various ways. The scoring module 216 takes care of providing reward points to the stakeholders 306 on basis of various metrics such as, quality of transcription and assessed proficiency. The correction verification module 216 takes care of quality of transcriptions by verification and mechanism of points. As explained earlier, the stakeholders 306 also receive usability scores on basis of certain parameters. There could be various user interfaces 310 that changes dynamically based on assessed proficiency of the stakeholders 306.
[0046] Further, to make text entry easier, especially for regional languages, the user
interface (UI) 310 of the electronic devices may have a customizable predictive text-entry module (not shown) that comprises a list of commonly occurring phrases/words. The stakeholders 306 can add items to this list. The list will provide a quicker means for writing text. The customizable predictive text-entry module would also detect word sequences/phrases that the stakeholder 306 has been frequently writing, and automatically add such phrases to the list.
[0047] Some of the stakeholders 306 may not be comfortable with typing using
text/keypad, therefore an alternative touch screen option may be used in a following manner:
first an automated speech engine (either at the system end or electronic device end, which may not have a high degree of accuracy) converts audio files to text file. For every recognized word, the user interface 310 displays other close matches. The stakeholder 306 scrolls between each recognized words and checks whether they are correct, or considers one of the displayed alternative. In this way the stakeholder 306 collects each correct word and makes a sentence, i.e. the stakeholder 306 drags the correct words and constructs a sentence at a specified region in the user interface 310.
[0048] In another embodiment, if the speech engine has not displayed the correct alternative to a word, the stakeholder 306 may himself speak out till the engine correctly recognizes the word and the stakeholder 306 than takes that word to the sentence region. [0049] After the transcriptions of all the audio clips are completed (along with verification). the consolidation module 218 receives the text clips from the electronic devices 104 associated with the plurality of stakeholders 306. Subsequently, the consolidation module 218 consolidates the text clips received from each the plurality of stakeholders 306 to generate the text file corresponding to the audio files. This text file is then sent to the sources 302. [0050] In accordance with an embodiment of the present subject matter, the transcriptions from each of the plurality of stakeholders 306 are merged and arranged by the consolidation module 218 to match the audio file. A system database 224 maintains mapping of which audio clip 304 is to which transcription, and which stakeholders 306 have transcribed, and other details so that the merging is correctly done.
[0051] The primary advantage of invention is higher accuracy, due to appropriate
choice of stakeholders 306. An advantage of the invention is to easily handle the spontaneous/
conversational audio files. Another advantage of the invention is that it does not require
automated speech recognition modules /or sophisticated models/computations. Yet another
advantage of the invention is that it can support any language. Further advantage of the
invention is to accommodate for protecting content context i.e. maintaining confidentiality of
audio files. Still further advantage of the invention is that it can be setup easily and quickly.
[0052] Referring now to Figure 4, a method 400 for transcribing the audio files into
the text file is shown, in accordance with an embodiment of the present subject matter. The method 400 may be described in the general context of computer executable instructions. Generally, computer executable instructions can include routines, programs, objects,
components, data structures, procedures, modules, functions, etc., that perform particular functions or implement particular abstract data types. The method 400 may also be practiced in a distributed computing environment where functions are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, computer executable instructions may be located in both local and remote computer storage media, including memory storage devices-
[0053] At block 402, the audio clipping module 212 receives the audio files from the
plurality of sources 302. The plurality of sources 302 has 3 lot of recorded audio files that it needs to transcribe into text file. In one embodiment, the sources 302 may include, but not limited to an organization.
[0054] The audio clipping module 212 segments the audio files into a plurality of
segments. The plurality of segments belonging to one or more audio files may be combined to generate a plurality of audio clips. In other words, each audio clip may be made up of segments of different or same audio files.
[0055] At block 404, the audio clipping module 212 segments the audio files into a
plurality of audio clips 304. The audio clip 304 has a duration that is short and comfortable for
stakeholders 306 to transcribe. The stakeholder 306 may include, but not limited to a human.
[0056] The audio clips 304 may be of varying lengths. As far as possible, the audio
clip 304 may be of short durations achieved in various ways like clipping at the end of the
sentence. This would help the stakeholder 306 to understand the context while transcribing.
[0057] In another embodiment, the audio clips 304 are distributed in such a manner so
that confidentiality of the audio files is maintained. In order to maintain confidentiality of the audio files, in one example, smaller audio clips 304 may be jumbled and combined into one large audio clip 304 before being sent to a stakeholder 306. In another example, different sentences or parts of sentences from various parts of a larger audio clip 304 can be made into one audio clip 304 and sent to the stakeholder 306.
[0058] At block 406, the distribution module 214 distributes the plurality of audio
clips 304 to the plurality of stakeholders 306 for transcribing the plurality of audio clips to the text clips. The plurality of audio clips are distributed based upon a set of parameters such as complexity of the audio clips 304, location of the stakeholders, literacy level of the
stakeholders 306, points assigned to the stakeholders, frequency of participation of the
stakeholders 306.
[0059] At block 408, the text clips is received from electronic devices associated with
the plurality of stakeholders, wherein the plurality of stakeholders transcribed the plurality of
audio clips into the text clips using the electronic devices.
[0060] At block 410. the text clips received from the plurality of stakeholders is
consolidated to generate the text file corresponding to the audio files. In one embodiment,
steps of receiving the audio files, the segmenting, the distributing, the receiving the text clips,
and the consolidating are performed by a processor.
[0061] The order in which the method 400 is described is not intended to be construed
as a limitation, and any number of the described method blocks can be combined in any order to implement the method 400 or alternate methods. Additionally, individual blocks may be deleted from the method 400 without departing from the spirit and scope of the subject matter described herein. Furthermore, the method can be implemented in any suitable hardware, software, firmware, or combination thereof. However, for ease of explanation, in the embodiments described below, the method 400 may be considered to be implemented in the above described system 102.
[0062] Although implementations for methods and systems for transcribing the audio
files into the text file have been described in language specific to structural features and/or methods, it is to be understood that the appended claims are not necessarily limited to the specific features or methods described. Rather, the specific features and methods are disclosed as examples of implementations for transcribing the audio files into the text file.
WE CLAIM:
1. A system (102) for transcribing an audio file to text file, the system comprising: a processor (202); and
a memory (206) coupled to the processor (202). wherein the processor (202) is capable of executing a plurality of modules embodied on the memory (206), the plurality of modules comprising:
an audio clipping module (212) configured to
receive a plurality of audio files from a plurality of sources (302); and
segment each audio file into a plurality of segments; and
generate a plurality of audio clips using the plurality of segments, wherein each audio clip of the plurality of audio clips is generated by combining segments associated with one or more audio files; a distribution module (214) configured to
distribute the plurality of audio clips (304) to a plurality of stakeholders (306) for transcribing the plurality of audio clips (304) into text clips, wherein the plurality of audio clips (304) are distributed based upon a set of parameters; and a consolidation module (218) configured to
receive the text clips from electronic devices (104) associated with the plurality of stakeholders (306), wherein each of the plurality of stakeholders (306) transcribes a subset of the plurality of audio clips (304) into the text clips using the electronic devices (104); and
consolidate and arrange the text clips to generate the text file corresponding to the audio file.
2. The system of claim 1, wherein the set of parameters comprises at least one of a location of the plurality of stakeholders (306), a literacy level of the plurality of
stakeholders (306), a frequency of participation of the plurality of stakeholders(306), a proficiency level of transcribing, a hierarchy among the plurality of stakeholders (306).
3. The system of claim 1, wherein the plurality of stakeholders (306) are remotely located from one another.
4. The system of claim 1, wherein generating the plurality of audio clips using the plurality of segments comprises jumbling two or more segments associated with one or more audio files, thereby maintaining confidentiality of the audio files.
5. The system of claim 1, further comprising a scoring module configured to assign rewards points to the plurality of stakeholders (306) based upon a quality of transcription of the audio clips into the text clips.
6. The system of claim 6, wherein the quality of transcription is verified by one or more stakeholders proficient in transcribing the audio clips into the text clips.
7. A method for transcribing an audio file into text file, the method comprising:
receiving a plurality of audio files from a plurality of sources (302); and
segmenting each audio file into a plurality of segments; and
generating a plurality of audio clips using the plurality of segments, wherein
each audio clip of the plurality of audio clips is generated by combining segments
associated with one or more audio files;
distributing the plurality of audio clips (304) to the plurality of stakeholders
(306) for transcribing the plurality of audio clips (304) into the text clips, wherein the
plurality of audio clips are distributed based upon a set of parameters; and
receiving the text clips from electronic devices (104) associated with the
plurality of stakeholders, wherein each of the plurality of stakeholders (306)
transcribes a subset of the plurality of audio clips (304) into the text clips using the
electronic devices; and
consolidating and arranging the text clips to generate the text file corresponding to the audio file,
wherein the receiving the audio file, the segmenting, the generating, the distributing, the receiving the text clips, and the consolidating are performed by a processor (202).
8. The method of claim 7, wherein generating the plurality of audio clips using the
plurality of segments comprises jumbling two or more segments associated with one or
more audio files, thereby maintaining confidentiality of the audio files.
9. The method of claim 8, further comprising assigning rewards points to the plurality of
stakeholders (306) based upon a quality of transcription of the audio clips into the text
clips.
10. The method of claim 10, further comprising verifying the quality of transcription by one or more stakeholders proficient in transcribing the audio clips into the text clips.
11. A computer program product having embodied thereon a computer program for transcribing an audio data into text file, the computer program product comprising:
a program code for receiving a plurality of audio files from a plurality of sources (302): and
a program code for segmenting each audio file into a plurality of segments; and
a program code for generating a plurality of audio clips using the plurality of segments, wherein each audio clip of the plurality of audio clips is generated by combining segments associated with one or more audio files;
a program code for distributing the plurality of audio clips (304) to the plurality of stakeholders (306) for transcribing the plurality of audio clips (304) into the text clips, wherein the plurality of audio clips (304) are distributed based upon a set of parameters; and
a program code for receiving the text clips from electronic devices (104) associated with the plurality of stakeholders (306), wherein each of the plurality of
stakeholders (306) transcribes a subset of the plurality of audio clips (304) into the text clips using the electronic devices (104); and
a program code for consolidating and arranging the text clips to generate the text file corresponding to the audio file.