System And Method For Transcription

< Back

System And Method For Transcription

Abstract: The present disclosure discloses a system and method for transcription of speech of an entity 112 to a text document in his/ her handwriting. First of all, one or more entities 112 are registered in the system 100, and correspondingly an identity (ID) is assigned to each of the entities 112. The system 100 can obtain temporal and auricular attributes, such as, image, voice sample, and handwriting sample, of said entities 112 for facilitating registration of the entities 112. When a registered entity 112 accesses the system 100, the system 100 can sense real-time acoustic signals associated with speech of said entity 112, and can further convert it into a text document having handwriting of said entity 112.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

21 July 2020

Publication Number

04/2022

Publication Type

INA

Invention Field

ELECTRONICS

Status

info@khuranaandkhurana.com

Parent Application

Patent Number

Legal Status

Grant Date

2024-06-14

Renewal Date

Applicants

Chitkara Innovation Incubator Foundation

SCO: 160-161, Sector - 9c, Madhya Marg, Chandigarh- 160009, India.

Inventors

1. SINGH, Harjeet

Chitkara University Institute of Engineering and Technology, Chitkara University, Chandigarh-Patiala National Highway (NH-64), Village Jansla, Rajpura, Punjab - 140401, India.

2. MALARVEL, Muthukumaran

Chitkara University Institute of Engineering and Technology, Chitkara University, Chandigarh-Patiala National Highway (NH-64), Village Jansla, Rajpura, Punjab - 140401, India.

3. KHANRA, Partha

Chitkara University Institute of Engineering and Technology, Chitkara University, Chandigarh-Patiala National Highway (NH-64), Village Jansla, Rajpura, Punjab - 140401, India.

Specification

0001] The present disclosure relates generally to the field of speech recognition. In
particular, the present disclosure relates to recognition of speech and conversion into a text
format. More particularly, the present disclosure relates to a system and method for
transcription.
BACKGROUND
[0002] The background description includes information that may be useful in
understanding the present invention. It is not an admission that any of the information
provided herein is prior art or relevant to the presently claimed invention, or that any
publication specifically or implicitly referenced is prior art.
[0003] It consumes a lot of time to think and type word-by-word on a computing
device, especially in case of beginners. Individuals who communicate on daily basis via
handwritten notes and do a lot of paper drafting work happen to waste a large part of their
time on typing such notes and hence, it results in reduction in their efficiency. Moreover, it
may have adverse impact on their eye sight. Various researches have been done on speech to
text conversion techniques in order to speed up the process. However, such techniques
cannot prevent forgery in important documents, such as, affidavits, registry papers, etc. as a
person might use stamp of the person-in-charge in case of such documents and may
mishandle them.
[0004] In such a critical time of COVID-19, many of the exams can be conducted
online in a very short time by using existing speech to text conversion techniques. However,
in such cases, many students may opt to illegal means of cheating, as they may make
someone else having a knowledge of the subject to speak in the exam so that it can be
completed easily and in-time. Some or other means is required in order to come over such
issues. Moreover, physically challenged persons, particularly visually impaired persons, face
a lot of challenge in doing such tasks, hence, it may lead to a loss of employment opportunity
for them.
[0005] Speech and handwriting recognition are the indispensable research areas in the
real-time utility applications needed in society. In the recent past, a good amount of research
work has been carried out for speech and handwriting recognition, both offline and online.
However, many applications have been developed on these two areas, i.e., speech
3
recognition, speaker recognition, speech to text conversion, offline handwriting recognition,
online handwriting recognition, writer identification, postal address recognition etc., are not
supported by all systems and are very costly.
[0006] There is, therefore, a need in the art to provide a system and method to
mitigate the above- mentioned problems, and to reduce time required in completing a task
and hence increasing the efficiency.
OBJECTS OF THE PRESENT DISCLOSURE
[0007] Some of the objects of the present disclosure, which at least one embodiment
herein satisfies are as listed herein below.
[0008] It is an object of the present disclosure to provide a system and method for
speech to text conversion.
[0009] It is an object of the present disclosure to provide a system and method for
conversion of speech of an entity to a text comprising his/ her handwriting.
[0010] It is an object of the present disclosure to provide a system and method for
recognition of voice of an entity in real time and correspondingly convert it into text in
minimal time.
[0011] It is an object of the present disclosure to provide a system and method that is
efficient, cost effective, and easy to implement.
SUMMARY
[0012] Aspects of the the present disclosure relates generally to the field of speech
recognition. In particular, the present disclosure relates to recognition of speech and
conversion into a text format. More particularly, the present disclosure relates to a system and
method for transcription.
[0013] Various objects, features, aspects and advantages of the inventive subject
matter will become more apparent from the following detailed description of preferred
embodiments, along with the accompanying drawing figures in which like numerals represent
like components.
[0014] An aspect of the present disclosure pertains to atranscription system
comprising: an input unit configured to detect temporal and auricular attributes of an entity,
and correspondingly generate a set of input signals a processing unit operatively coupled to
the input unit, the processing unit comprising one or more processors, and coupled with a
memory, the memory storing instructions executable by the one or more processors and
4
configured to: extract the temporal and auricular attributes from the set of input signals;
compare the extracted temporal and auricular attributes with a first dataset comprising prestored temporal and auricular attributes of one or more entities, and correspondingly identify
the entity; receive real time acoustic signals associated with said entity from the input unit,
and perform sampling; and transcript the sampled real time acoustic signals, responsive to
matching of the sampled real time acoustic signals with a second dataset comprising speech
and handwriting in one or more languages of the identified entity, and correspondingly
generate a first set of signals.
[0015] In an aspect, the input unit comprises any or a combination of mic,
microphone, touchpad, tactile sensor, biometric device, and camera.
[0016] In an aspect, the temporal and auricular attributes comprise any or a
combination of comprise any or a combination of images, biometric, voice sample, and
handwriting sample of each of the one or more registered entities.
[0017] In an aspect, one or more entities are registered into the system by obtaining
the temporal and auricular attributes of each of the one or more entities, and correspondingly
an identity (ID) is assigned to each of the one or more entities.
[0018] In an aspect, the processing unit is configured to update the first dataset and
the second dataset based on the extracted temporal and auricular attributes.
[0019] In an aspect, the system comprises a display device operatively coupled to the
processing unit, and configured to receive the first set of signals, and correspondingly display
any or a combination of the transcribed real-time acoustic signals, temporal and auricular
attributes, user ID of the entity, and image and gestures of the entity.
[0020] In an aspect, the display device comprises any or a combination of smart
phone, computer screen, television (TV) screen, laptop, and tablet.
[0021] Another aspect of the present disclosure pertains to a transcription method
comprising steps of: detecting, by an input unit, temporal and auricular attributes of an entity;
extracting, at one or more processors of a processing unit, the temporal and auricular
attributes from the set of input signals; comparing, at the one or more processors, the
extracted temporal and auricular attributes with a first dataset comprising pre-stored temporal
and auricular attributes of one or more entities, and correspondingly identifying the entity;
receiving real time acoustic signals associated with said entity from the input unit, and
perform sampling; and transcribing, at the one or more processors, the sampled real time
acoustic signals, responsive to matching of the sampled real time acoustic signals with a
5
second dataset comprising speech and handwriting in one or more languages of the identified
entity, and correspondingly generating a first set of signals.
[0022] In an aspect, the method comprises a step of registering of the one or more
entities by obtaining the temporal and auricular attributes of each of the one or more entities,
and correspondingly an identity (ID) is assigned to each of the one or more entities.
[0023] In an aspect, the method comprises a step of updating the first dataset and the
second dataset based on the extracted temporal and auricular attributes.
BRIEF DESCRIPTION OF THE DRAWINGS
[0024] In the figures, similar components and/or features may have the same
reference label. Further, various components of the same type may be distinguished by
following the reference label with a second label that distinguishes among the similar
components. If only the first reference label is used in the specification, the description is
applicable to any one of the similar components having the same first reference label
irrespective of the second reference label.
[0025] FIG. 1 illustrates an exemplary network architecture in which or with which
proposed system can be implemented in accordance with an embodiment of the present
disclosure.
[0026] FIG. 2 illustrates exemplary functional units of a processing unit, in
accordance with an embodiment of the present disclosure.
[0027] FIG. 3 is a flow diagram illustrating a process for performing transcription, in
accordance with an embodiment of the present disclosure.
[0028] FIGs. 4A and 4B illustrate exemplary representations of working of the
system, in accordance with an embodiment of the present disclosure.
[0029] FIG. 5 illustrates an exemplary computer system in which or with which
embodiments of the present invention can be utilized in accordance with embodiments of the
present disclosure.
DETAILED DESCRIPTION
[0030] The following is a detailed description of embodiments of the disclosure
depicted in the accompanying drawings. The embodiments are in such detail as to clearly
communicate the disclosure. However, the amount of detail offered is not intended to limit
the anticipated variations of embodiments; on the contrary, the intention is to cover all
6
modifications, equivalents, and alternatives falling within the spirit and scope of the present
disclosure as defined by the appended claims.
[0031] In the following description, numerous specific details are set forth in order to
provide a thorough understanding of embodiments of the present invention. It will be
apparent to one skilled in the art that embodiments of the present invention may be practiced
without some of these specific details.
[0032] Embodiments of the present invention include various steps, which will be
described below. The steps may be performed by hardware components or may be embodied
in machine-executable instructions, which may be used to cause a general-purpose or specialpurpose processor programmed with the instructions to perform the steps. Alternatively, steps
may be performed by a combination of hardware, software, and firmware and/or by human
operators.
[0033] Various methods described herein may be practiced by combining one or more
machine-readable storage media containing the code according to the present invention with
appropriate standard computer hardware to execute the code contained therein. An apparatus
for practicing various embodiments of the present invention may involve one or more
computers (or one or more processors within a single computer) and storage systems
containing or having network access to computer program(s) coded in accordance with
various methods described herein, and the method steps of the invention could be
accomplished by modules, routines, subroutines, or subparts of a computer program product.
[0034] If the specification states a component or feature “may”, “can”, “could”, or
“might” be included or have a characteristic, that particular component or feature is not
required to be included or have the characteristic.
[0035] As used in the description herein and throughout the claims that follow, the
meaning of “a,” “an,” and “the” includes plural reference unless the context clearly dictates
otherwise. Also, as used in the description herein, the meaning of “in” includes “in” and “on”
unless the context clearly dictates otherwise.
[0036] Exemplary embodiments will now be described more fully hereinafter with
reference to the accompanying drawings, in which exemplary embodiments are shown. These
exemplary embodiments are provided only for illustrative purposes and so that this disclosure
will be thorough and complete and will fully convey the scope of the invention to those of
ordinary skill in the art. The invention disclosed may, however, be embodied in many
different forms and should not be construed as limited to the embodiments set forth herein.
Various modifications will be readily apparent to persons skilled in the art. The general
7
principles defined herein may be applied to other embodiments and applications without
departing from the spirit and scope of the invention. Moreover, all statements herein reciting
embodiments of the invention, as well as specific examples thereof, are intended to
encompass both structural and functional equivalents thereof. Additionally, it is intended that
such equivalents include both currently known equivalents as well as equivalents developed
in the future (i.e., any elements developed that perform the same function, regardless of
structure). Also, the terminology and phraseology used is for the purpose of describing
exemplary embodiments and should not be considered limiting. Thus, the present invention is
to be accorded the widest scope encompassing numerous alternatives, modifications, and
equivalents consistent with the principles and features disclosed. For the purpose of clarity,
details relating to technical material that is known in the technical fields related to the
invention have not been described in detail so as not to unnecessarily obscure the present
invention.
[0037] Thus, for example, it will be appreciated by those of ordinary skill in the art
that the diagrams, schematics, illustrations, and the like represent conceptual views or
processes illustrating systems and methods embodying this invention. The functions of the
various elements shown in the figures may be provided through the use of dedicated
hardware as well as hardware capable of executing associated software. Similarly, any
switches shown in the figures are conceptual only. Their function may be carried out through
the operation of program logic, through dedicated logic, through the interaction of program
control and dedicated logic, or even manually, the particular technique being selectable by
the entity implementing this invention. Those of ordinary skill in the art further understand
that the exemplary hardware, software, processes, methods, and/or operating systems
described herein are for illustrative purposes and, thus, are not intended to be limited to any
particular named element.
[0038] Embodiments of the present invention may be provided as a computer program
product, which may include a machine-readable storage medium tangibly embodying thereon
instructions, which may be used to program the computer (or other electronic devices) to
perform a process. The term “machine-readable storage medium” or “computer-readable
storage medium” includes, but is not limited to, fixed (hard) drives, magnetic tape, floppy
diskettes, optical disks, compact disc read-only memories (CD-ROMs), and magneto-optical
disks, semiconductor memories, such as ROMs, PROMs, random access memories (RAMs),
programmable read-only memories (PROMs), erasable PROMs (EPROMs), electrically
erasable PROMs (EEPROMs), flash memory, magnetic or optical cards, or other type of
8
media/machine-readable medium suitable for storing electronic instructions (e.g., computer
programming code, such as software or firmware).A machine-readable medium may include
a non-transitory medium in which data may be stored and that does not include carrier waves
and/or transitory electronic signals propagating wirelessly or over wired connections.
Examples of a non-transitory medium may include but are not limited to, a magnetic disk or
tape, optical storage media such as compact disk (CD) or digital versatile disk (DVD), flash
memory, memory or memory devices. A computer program product may include code and/or
machine-executable instructions that may represent a procedure, a function, a subprogram, a
program, a routine, a subroutine, a module, a software package, a class, or any combination
of instructions, data structures, or program statements. A code segment may be coupled to
another code segment or a hardware circuit by passing and/or receiving information, data,
arguments, parameters, or memory contents. Information, arguments, parameters, data, etc.
may be passed, forwarded, or transmitted via any suitable means including memory sharing,
message passing, token passing, network transmission, etc.
[0039] Furthermore, embodiments may be implemented by hardware, software,
firmware, middleware, microcode, hardware description languages, or any combination
thereof. When implemented in software, firmware, middleware or microcode, the program
code or code segments to perform the necessary tasks (e.g., a computer-program product)
may be stored in a machine-readable medium. A processor(s) may perform the necessary
tasks.
[0040] Systems depicted in some of the figures may be provided in various
configurations. In some embodiments, the systems may be configured as a distributed system
where one or more components of the system are distributed across one or more networks in
a cloud computing system.
[0041] Each of the appended claims defines a separate invention, which for
infringement purposes is recognized as including equivalents to the various elements or
limitations specified in the claims. Depending on the context, all references below to the
"invention" may in some cases refer to certain specific embodiments only. In other cases, it
will be recognized that references to the "invention" will refer to subject matter recited in one
or more, but not necessarily all, of the claims.
[0042] All methods described herein may be performed in any suitable order unless
otherwise indicated herein or otherwise clearly contradicted by context. The use of any and
all examples, or exemplary language (e.g., “such as”) provided with respect to certain
embodiments herein is intended merely to better illuminate the invention and does not pose a
9
limitation on the scope of the invention otherwise claimed. No language in the specification
should be construed as indicating any non-claimed element essential to the practice of the
invention.
[0043] Various terms as used herein are shown below. To the extent a term used in a
claim is not defined below, it should be given the broadest definition persons in the pertinent
art have given that term as reflected in printed publications and issued patents at the time of
filing.
[0044] The present disclosure relates generally to the field of speech recognition. In
particular, the present disclosure relates to recognition of speech and conversion into a text
format. More particularly, the present disclosure relates to a system and method for
transcription.
[0045] In an aspect, the present disclosure pertains to a transcription system
including: an input unit configured to detect temporal and auricular attributes of an entity,
and correspondingly generate a set of input signals; a processing unit operatively coupled to
the input unit, the processing unit comprising one or more processors, and coupled with a
memory, the memory storing instructions executable by the one or more processors and
configured to: extract the temporal and auricular attributes from the set of input signals;
compare the extracted temporal and auricular attributes with a first dataset including prestored temporal and auricular attributes of one or more entities, and correspondingly identify
the entity; receive real time acoustic signals associated with said entity from the input unit,
and perform sampling; and transcript the sampled real time acoustic signals, responsive to
matching of the sampled real time acoustic signals with a second dataset including speech
and handwriting in one or more languages of the identified entity, and correspondingly
generate a first set of signals.
[0046] In an embodiment, the input unit includes any or a combination of mic,
microphone, touchpad, tactile sensor, biometric device, and camera.
[0047] In an embodiment, the temporal and auricular attributes include any or a
combination of comprise any or a combination of images, biometric, voice sample, and
handwriting sample of each of the one or more registered entities.
[0048] In an embodiment, one or more entities can be registered into the system by
obtaining the temporal and auricular attributes of each of the one or more entities, and
correspondingly an identity (ID) can be assigned to each of the one or more entities.
[0049] In an embodiment, the processing unit can be configured to update the first
dataset and the second dataset based on the extracted temporal and auricular attributes.
10
[0050] In an embodiment, the system includes a display device operatively coupled to
the processing unit, and configured to receive the first set of signals, and correspondingly
display any or a combination of the transcribed real-time acoustic signals, temporal and
auricular attributes, user ID of the entity, and image and gestures of the entity.
[0051] In an embodiment, the display device includes any or a combination of smart
phone, computer screen, television (TV) screen, laptop, and tablet.
[0052] In another aspect, the present disclosure pertains to a transcription method
including steps of: detecting, by an input unit, temporal and auricular attributes of an entity;
extracting, at one or more processors of a processing unit, the temporal and auricular
attributes from the set of input signals; comparing, at the one or more processors, the
extracted temporal and auricular attributes with a first dataset including pre-stored temporal
and auricular attributes of one or more entities, and correspondingly identifying the entity;
receiving real time acoustic signals associated with said entity from the input unit, and
perform sampling; and transcribing, at the one or more processors, the sampled real time
acoustic signals, responsive to matching of the sampled real time acoustic signals with a
second dataset including speech and handwriting in one or more languages of the identified
entity, and correspondingly generating a first set of signals.
[0053] In an embodiment, the method includes a step of registering of the one or more
entities by obtaining the temporal and auricular attributes of each of the one or more entities,
and correspondingly an identity (ID) can be assigned to each of the one or more entities.
[0054] In an embodiment, the method includes a step of updating the first dataset and
the second dataset based on the extracted temporal and auricular attributes.
[0055] FIG. 1 illustrates an exemplary network architecture in which or with which
proposed system can be implemented in accordance with an embodiment of the present
disclosure.
[0056] In an embodiment, as illustrated in FIG. 1, the proposed transcription system
100 (interchangeably referred to as transcription system 100 and system 100, herein) can
include a processing unit 102 that can be communicatively coupled with an input unit 108
and one or more computing devices 110-1, 110-2… 110-N (also, collectively referred to as
computing devices 110 or devices 110, and individually referred to as computing device 110
or device 110 herein), through a network 104. In an exemplary embodiment, the input unit
108 can be, but not limited to, any or a combination of mic, microphone, touchpad, tactile
sensor, and camera. In another exemplary embodiment, the computing devices 110 can
include devices, such as, but not limited to, laptop, personal computer, tablet, smart phone,
11
portable computer, personal digital assistant, handheld device, and workstation. In a preferred
embodiment, the computing devices 110 can be mobile phones associated with respective
input devices 108.
[0057] In an embodiment, the processing unit 102 can be implemented using any or a
combination of hardware components and software components such as a cloud, a server, a
computing system, a computing device, a network device and the like. Further, the processing
unit 102 can interact with the devices 110 through a website or an application that can reside
in the system 100. In an implementation, the processing unit 102 can be accessed by website
or application that can be configured with any operating system, including but not limited to,
AndroidTM, iOSTM, and the like.
[0058] Further, the network 104 can be a wireless network, a wired network or a
combination thereof that can be implemented as one of the different types of networks, such
as Intranet, Local Area Network (LAN), Wide Area Network (WAN), Internet, and the like.
Further, the network 104 can either be a dedicated network or a shared network. The shared
network can represent an association of the different types of networks that can use variety of
protocols, for example, Hypertext Transfer Protocol (HTTP), Transmission Control
Protocol/Internet Protocol (TCP/IP), Wireless Application Protocol (WAP), and the like.
[0059] In an embodiment, the system 100 can be configured to obtain acoustic signals
associated with voice of an entity, through the input unit 108. Further, the obtained acoustic
signals can be sampled and analysed to identify words and sentences spelled by the entity,
and correspondingly transcription to text in a first language in a first handwriting, where the
first handwriting resembles the handwriting of said entity. Hence, the obtained acoustic
signals associated with an entity can be transcribed into handwriting of said entity by the
transcription system 100.
[0060] In an embodiment, firstly, one or more entities 112-1, 112-2… 112-N
(collectively referred to as entities 112, and individually referred to as entity 112, hereinafter)
are required to get registered into the system 100. In an exemplary embodiment, in order to
register an entity in the system 100, temporal and auricular attributes, such as, but not limited
to, images of entity, biometric, voice sample, and handwriting sample of an entity is obtained
through the input units 108 and are stored at a server 106. In an embodiment, once the
process of registration is completed, a corresponding identity (ID) can be generated against
registration of each entity 112. In a preferred embodiment, the temporal attributes can be
stored in a first dataset and the auricular attributes, including, voice sample and handwriting
sample, can be stored in a second dataset that can be linked to the first dataset.
12
[0061] In an implementation, in case a first entity 112-1, who is registered in the
system 100, wants to access the system 100, and perform transcription, he/ she is required to
enter his/ her identity. If the entered identity matches with a corresponding dataset, and is
found to be correct, then, any or a combination image and biometric of the entity 112-1 can
be obtained through camera and biometric device, respectively. In an embodiment, the
obtained image and/ or biometric is authenticated, which can be done by comparing it with
the first dataset, and when the first entity 112-1 gets positively authenticated, then he/ she can
be permitted to access the system 100.
[0062] In an embodiment, further, when the first entity 112-1 speaks in real time,
corresponding acoustic signals can be sensed through the input unit 108, such as, a mic or a
microphone. The sensed acoustic signals can be transmitted to the processing unit 102, where
the transmitted acoustic signals can be sampled and processed and matched with the second
dataset. Based on the sampling, words spoken by the entity 112-1 can be determined, and
accordingly the determined words can be transcribed into handwriting of the first entity 112-1
in one or more languages, and can be displayed.
[0063] In another implementation, in case a second entity 112-2, who is not registered
in the system 100, wants to access the system 100, and perform transcription, he/ she is
required to enter his/ her identity. If the entered identity do not match with a corresponding
dataset, and if it is found to be incorrect, or, any or a combination image and biometric of the
entity 112-2 obtained through camera and biometric device, respectively cannot be
authenticated, then the first entity 112-2 gets negatively authenticated, and he/ she cannot be
permitted to access the system 100.
[0064] FIG. 2 illustrates exemplary functional units of a processing unit, in
accordance with an embodiment of the present disclosure.
[0065] As illustrated, the processing unit 102 can include one or more processor(s)
202. The one or more processor(s) 202 can be implemented as one or more microprocessors,
microcomputers, microcontrollers, digital signal processors, central processing units, logic
circuitries, and/or any devices that manipulate data based on operational instructions. Among
other capabilities, one or more processor(s) 202 are configured to fetch and execute
computer-readable instructions stored in a memory 204 of the processing unit 102. The
memory 204 can store one or more computer-readable instructions or routines, which may be
fetched and executed to create or share the data units over a network service. The memory
204 can include any non-transitory storage device including, for example, volatile memory
such as RAM, or non-volatile memory such as EPROM, flash memory, and the like.
13
[0066] In an embodiment, the processing unit 102 can also include an interface(s)
206. The interface(s) 206 may include a variety of interfaces, for example, interfaces for data
input and output devices, referred to as I/O devices, storage devices, and the like. The
interface(s) 206 may facilitate communication of the processing unit 102 with various devices
coupled to the processing unit 102. The interface(s) 206 may also provide a communication
pathway for one or more components of the processing unit 102. Examples of such
components include, but are not limited to, processing engines(s) 208 and database 210.
[0067] In an embodiment, the processing engine(s) 208 can be implemented as a
combination of hardware and programming (for example, programmable instructions) to
implement one or more functionalities of the processing engine(s) 208. In examples described
herein, such combinations of hardware and programming may be implemented in several
different ways. For example, the programming for the processing engine(s) 208 may be
processor executable instructions stored on a non-transitory machine-readable storage
medium and the hardware for the processing engine(s) 208 may include a processing resource
(for example, one or more processors), to execute such instructions. In the present examples,
the machine-readable storage medium may store instructions that, when executed by the
processing resource, implement the processing engine(s) 208. In such examples, the
processing unit 102 can include the machine-readable storage medium storing the instructions
and the processing resource to execute the instructions, or the machine-readable storage
medium may be separate but accessible to the processing unit 102 and the processing
resource. In other examples, the processing engine(s) 208 may be implemented by electronic
circuitry. The database 210 can include data that is either stored or generated as a result of
functionalities implemented by any of the components of the processing engine(s) 208.
[0068] In an embodiment, as illustrated in FIG. 2, processing engine(s) 208 can
include a sampling unit 212, an authentication unit 214, a transcribing unit 216, and other
unit(s) 218. The other unit(s) 218 can implement functionalities that supplement applications
or functions performed by the processing unit 102 or the processing engine(s) 208.
[0069] In an embodiment, the sampling unit 212 associated with the processing unit
102 can facilitate sampling of acoustic signals associated with voice of an entity 112, where
the acoustic signals can be obtained through the input unit 108, such as, but not limited to, a
mic or a microphone. In an exemplary embodiment, the sampling can be performed by
reducing a continuous-time acoustic signal to a discrete-time signal. In an implementation,
the continuous-time acoustic signal can be sampled and segmented into a sequence of
samples, where the samples can be segmented based on instantaneous value of the
14
continuous-time acoustic signal at the desired points. In a preferred embodiment, the
sampling unit 212 can facilitate sampling of real-time acoustic signals.
[0070] In an exemplary embodiment, if an entity speaks “You would better see it
properly”. Let’s say, if the entity 112 speaks it on a mic 108 associated with the transcription
system 100, then, the corresponding continuous-time acoustic signals can be transmitted to
the processing unit 102, where the sampling unit 212 can analyse the received signal, and
accordingly sample said signal into a set of discrete-time signal samples, such that the words
pronounced by the entity 112 can be easily distinguished from the sampled set of discretetime signals.
[0071] In an embodiment, the authentication unit 214 associated with the processing
unit 102 can facilitate authentication of an entity 112. In an exemplary embodiment, if an
entity 112 wants to access the transcription system 100, first of all he/ she has to get
registered with the system 100. In an exemplary embodiment, in order to get registered in the
system 100, an entity 112 is required to provide his/ her temporal and auricular attributes,
such as, but not limited to, images of entity, biometric, voice sample, and handwriting
sample. The temporal and auricular attributes of said entity can be obtained through the input
units 108, which can be further stored at a server 106. In an embodiment, after completion of
the process of registration of said entity 112, a corresponding identity (ID) can be generated
against his/ her registration. In a preferred embodiment, the temporal attributes can be stored
in a first dataset and the auricular attributes, including, voice sample and handwriting sample,
can be stored in a second dataset that can be linked to the first dataset. In an exemplary
embodiment, sample of the handwriting sample can be taken by capturing image of a writing
of said entity 112 on a paper. In another exemplary embodiment, sample of the handwriting
sample can also be obtained via a soft copy, such as, but not limited to, a doc file, a pdf file,
and the like, that can contain handwriting of said entity 112.
[0072] In another exemplary embodiment, if an entity 112 tries to access the
transcription system 100 to obtain transcribed output, he/ she is required to enter his/ her
identity. If the entered identity matches with a corresponding dataset, and is found to be
correct, then, any or a combination image and biometric of the entity 112 can be obtained
through camera and biometric device, respectively. In an embodiment, the obtained image
and/ or biometric is authenticated, which can be done by comparing it with the first dataset,
and when the entity gets positively authenticated, then he/ she can be permitted to access the
system 100. In another embodiment, if the entity 112 cannot get positively authenticated then
he/ she cannot access the system 100.
15
[0073] In yet another exemplary embodiment, in case an entity 112 happens to forget
his/ her identity, then, the authentication unit 214 can also facilitate direct authentication of
said entity, by obtaining any or a combination image and biometric of the entity 112 through
camera and biometric device, respectively, and further authenticating it.
[0074] In an embodiment, the transcribing unit 216 associated with the processing
unit 102 along with the sampling unit 212 can facilitate in transcribing the obtained real-time
acoustic signals associated with speaking of an entity 112. In an exemplary embodiment, the
transcribing unit 216 can facilitate transcription of words and sentences spoken by said entity
112 to his/ her handwriting in one or more languages. In an implementation, the sensed
acoustic signals can be transmitted to the sampling unit 212, where the transmitted acoustic
signals can be sampled and processed and matched with the second dataset, and based on the
sampling performed at the sampling unit 212, words spoken by the entity 112 can be
determined, and accordingly the determined words can be transcribed into handwriting of
said entity 112 in one or more languages, and can be displayed.
[0075] In an embodiment, the first dataset and the second dataset can be updated
regularly, based on sampling and transcription being done. In another embodiment, the
transcription can be performed based on the updation of any or a combination of the first
dataset and the second dataset.
[0076] In an exemplary embodiment, if an entity 112 is registered and he/ she gets
positively authenticated, but, he/ she is not well may be due to cold or fever, now when he/
she speaks his/ her voice may sound different from that stored in the database 210. Then, the
first dataset and the second dataset can be updated for said voice sample and further the
transcription can be performed accordingly.
[0077] In another exemplary embodiment, the entity 112 speaks “You would better
see it properly”, but transcription in written form in his/ her handwriting is obtained as “You
would better sea it properly”. In this case, the entity 112 may pronounce the word “see” again
or he may spell out it, in order to obtain a correct transcription. Now, the first dataset and the
second dataset can be updated accordingly to distinguish between the words “see” and “sea”,
determine which word is to be used in reference to a real-time sentence spoken by the entity
112, the correct usage of such homophones can also be determined via machine learning
along with pronunciation of said word done by said entity 112. In an embodiment, the
updatation can be done by utilizing techniques, such as, but not limited to, HMM and CNN
techniques.
16
[0078] FIG. 3 is a flow diagram illustrating a process for performing transcription, in
accordance with an embodiment of the present disclosure.
[0079] In an embodiment, as illustrated in FIG. 3, the process for performing
transcription can include a step 302 of detecting, by an input unit, temporal and auricular
attributes of an entity.
[0080] In an embodiment, the process can include a step 304 of extracting, at one or
more processors of a processing unit, the temporal and auricular attributes from the set of
input signals that is generated in the step 302.
[0081] In an embodiment, the process can include a step 306 of comparing, at the one
or more processors, the temporal and auricular attributed, which are extracted in the step 304,
with a first dataset that can be including pre-stored temporal and auricular attributes of one or
more entities, and correspondingly identifying the entity.
[0082] In an embodiment, the process can include a step 308 of receiving real time
acoustic signals associated with said entity from the input unit, and perform sampling of the
received real time acoustic signals.
[0083] In an embodiment, the process can include a step 310 of transcribing, at the
one or more processors, the real time acoustic signals that are sampled in the step 308,
responsive to matching of the sampled real time acoustic signals with a second dataset that
can be including speech and handwriting in one or more languages of the identified entity,
and correspondingly generating a first set of signals.
[0084] In an embodiment, the process can include a step of registering of the one or
more entities by obtaining the temporal and auricular attributes of each of the one or more
entities, and correspondingly an identity (ID) is assigned to each of the one or more entities.
[0085] In an embodiment, the process can include a step of a step of updating the first
dataset and the second dataset based on the extracted temporal and auricular attributes.
[0086] FIGs. 4A and 4B illustrate exemplary representations of working of the
system, in accordance with an embodiment of the present disclosure.
[0087] In an embodiment, as illustrated in FIG. 4A, image and acoustic signals
associated with an entity 112 can be received, and temporal and auricular attributes can be
extracted from the received image and acoustic signal. In another embodiment, the system
100 can include reference templates or models associated with temporal and auricular
attributes of various entities 112. In an embodiment, the system 100 can be configured to
match the extracted temporal and auricular attributes with the reference templates or models,
and correspondingly determine a similarity. In an exemplary embodiment, the reference
17
template or model having maximum similarity with the extracted temporal and auricular
attributes can be selected, and correspondingly the entity can be identified, and the identity of
said entity can be displayed.
[0088] In an embodiment, the system 100 can process in following steps –
a) Entity (interchangeably referred to as speaker, hereinafter) identification and verification:
As illustrated, the process of speaker identification can be carried out by using template
matching scheme from the n-speaker’s speech dataset. Verification is the process of
accepting or rejecting the identity claimed by a speaker. Most of the applications in which
voice is used to confirm identity are classified as speaker verification.
b) Speech to text conversion: Speech to text conversion is the process of converting spoken
words into machine texts. This process is also called speech recognition. All speech-to-text
systems rely on at least two models: an acoustic model and a language model. In addition,
large vocabulary systems use a pronunciation model. It is important to understand that there
is no such thing as a universal speech recognizer. To get the best transcription quality, all of
these models can be specialized for a given language, dialect, application domain, type of
speech, and communication channel. The speech transcript accuracy is highly dependent on
the speaker, the style of speech and the environmental conditions. Speech recognition is a
harder process than what people commonly think, even for a human being. Humans are used
to understanding speech, not to transcribing it, and only speech that is well formulated can be
transcribed without ambiguity.
[0089] In an exemplary embodiment, let’s say, said entity is identified as entity M
(also, referred to as speaker M), then auricular attributes of real-time acoustic signals can be
matched with reference template or model associated with entity M, and correspondingly a
similarity can be determined. In an embodiment, the determined similarity can be compared
with a threshold, and accordingly a verification process is carried out, and further the words
spoken by the speaker M that are associated with said real-time acoustic signal can be
transcribed in his/ her handwriting.
[0090] In an exemplary embodiment, as illustrated in FIG. 4B, an entity can speak
through a mic 108, and corresponding acoustic signals can be received by the system 100 and
further speech analysis can be performed. In an exemplary embodiment, the acoustic signals
can be modified by a speaker modification unit, which can further be processed to operator
dataset. In an embodiment, it can be further compared to a class level, and handwriting styles
of the speaker M can be extracted from said class level, and further the acoustic signals can
be transcribed into his/ her handwriting. In an embodiment, the system 100 can also include
18
speech to text conversion unit that can be coupled with the speech recognition dataset. In an
exemplary embodiment, the speaker modification unit and the speech to text conversion unit
can be utilized simultaneously to transcribe real time acoustic signals into a text having
handwriting of said speaker.
[0091] FIG. 5 illustrates an exemplary computer system in which or with which
embodiments of the present invention can be utilized in accordance with embodiments of the
present disclosure.
[0092] As shown in FIG. 5, computer system includes an external storage device 510,
a bus 520, a main memory 530, a read only memory 540, a mass storage device 550,
communication port 560, and a processor 570. A person skilled in the art will appreciate that
computer system may include more than one processor and communication ports. Examples
of processor 570 include, but are not limited to, an Intel® Itanium® or Itanium 2
processor(s), or AMD® Opteron® or Athlon MP® processor(s), Motorola® lines of
processors, FortiSOC™ system on a chip processors or other future processors. Processor
570 may include various modules associated with embodiments of the present invention.
Communication port 560 can be any of an RS-232 port for use with a modem based dialup
connection, a 10/100 Ethernet port, a Gigabit or 10 Gigabit port using copper or fiber, a serial
port, a parallel port, or other existing or future ports. Communication port 560 may be chosen
depending on a network, such a Local Area Network (LAN), Wide Area Network (WAN), or
any network to which computer system connects.
[0093] In an embodiment, the memory 530 can be Random Access Memory (RAM),
or any other dynamic storage device commonly known in the art. Read only memory 540 can
be any static storage device(s) e.g., but not limited to, a Programmable Read Only Memory
(PROM) chips for storing static information e.g., start-up or BIOS instructions for processor
570. Mass storage 550 may be any current or future mass storage solution, which can be used
to store information and/or instructions. Exemplary mass storage solutions include, but are
not limited to, Parallel Advanced Technology Attachment (PATA) or Serial Advanced
Technology Attachment (SATA) hard disk drives or solid-state drives (internal or external,
e.g., having Universal Serial Bus (USB) and/or Firewire interfaces), e.g. those available from
Seagate (e.g., the Seagate Barracuda 7102 family) or Hitachi (e.g., the Hitachi Deskstar
7K1000), one or more optical discs, Redundant Array of Independent Disks (RAID) storage,
e.g. an array of disks (e.g., SATA arrays), available from various vendors including Dot Hill
Systems Corp., LaCie, Nexsan Technologies, Inc. and Enhance Technology, Inc.
19
[0094] In an embodiment, the bus 520 communicatively couples processor(s) 570
with the other memory, storage and communication blocks. Bus 520 can be, e.g. a Peripheral
Component Interconnect (PCI) / PCI Extended (PCI-X) bus, Small Computer System
Interface (SCSI), USB or the like, for connecting expansion cards, drives and other
subsystems as well as other buses, such a front side bus (FSB), which connects processor 570
to software system.
[0095] In another embodiment, operator and administrative interfaces, e.g. a display,
keyboard, and a cursor control device, may also be coupled to bus 520 to support direct
operator interaction with computer system. Other operator and administrative interfaces can
be provided through network connections connected through communication port 560.
External storage device 510 can be any kind of external hard-drives, floppy drives,
IOMEGA® Zip Drives, Compact Disc - Read Only Memory (CD-ROM), Compact Disc -
Re-Writable (CD-RW), Digital Video Disk - Read Only Memory (DVD-ROM). Components
described above are meant only to exemplify various possibilities. In no way should the
aforementioned exemplary computer system limit the scope of the present disclosure.
[0096] Thus, it will be appreciated by those of ordinary skill in the art that the
diagrams, schematics, illustrations, and the like represent conceptual views or processes
illustrating systems and methods embodying this invention. The functions of the various
elements shown in the figures may be provided through the use of dedicated hardware as well
as hardware capable of executing associated software. Similarly, any switches shown in the
figures are conceptual only. Their function may be carried out through the operation of
program logic, through dedicated logic, through the interaction of program control and
dedicated logic, or even manually, the particular technique being selectable by the entity
implementing this invention. Those of ordinary skill in the art further understand that the
exemplary hardware, software, processes, methods, and/or operating systems described
herein are for illustrative purposes and, thus, are not intended to be limited to any particular
named.
[0097] While embodiments of the present invention have been illustrated and
described, it will be clear that the invention is not limited to these embodiments only.
Numerous modifications, changes, variations, substitutions, and equivalents will be apparent
to those skilled in the art, without departing from the spirit and scope of the invention, as
described in the claim.
[0098] In the foregoing description, numerous details are set forth. It will be apparent,
however, to one of ordinary skill in the art having the benefit of this disclosure, that the
20
present invention may be practiced without these specific details. In some instances, wellknown structures and devices are shown in block diagram form, rather than in detail, to avoid
obscuring the present invention.
[0099] As used herein, and unless the context dictates otherwise, the term "coupled
to" is intended to include both direct coupling (in which two elements that are coupled to
each other contact each other)and indirect coupling (in which at least one additional element
is located between the two elements). Therefore, the terms "coupled to" and "coupled with"
are used synonymously. Within the context of this document terms "coupled to" and "coupled
with" are also used euphemistically to mean “communicatively coupled with” over a
network, where two or more devices are able to exchange data with each other over the
network, possibly via one or more intermediary device.
[00100] It should be apparent to those skilled in the art that many more modifications
besides those already described are possible without departing from the inventive concepts
herein. The inventive subject matter, therefore, is not to be restricted except in the spirit of
the appended claims. Moreover, in interpreting both the specification and the claims, all
terms should be interpreted in the broadest possible manner consistent with the context. In
particular, the terms “comprises” and “comprising” should be interpreted as referring to
elements, components, or steps in a non-exclusive manner, indicating that the referenced
elements, components, or steps may be present, or utilized, or combined with other elements,
components, or steps that are not expressly referenced. Where the specification claims refers
to at least one of something selected from the group consisting of A, B, C …. and N, the text
should be interpreted as requiring only one element from the group, not A plus N, or B plus
N, etc.
[00101] While the foregoing describes various embodiments of the invention, other and
further embodiments of the invention may be devised without departing from the basic scope
thereof. The scope of the invention is determined by the claims that follow. The invention is
not limited to the described embodiments, versions or examples, which are included to enable
a person having ordinary skill in the art to make and use the invention when combined with
information and knowledge available to the person having ordinary skill in the art.
ADVANTAGES OF THE PRESENT DISCLOSURE
[00102] The present disclosure provides system and method for speech to text
conversion.
21
[00103] The present disclosure provides system and method for conversion of speech
of an entity to a text comprising his/ her handwriting.
[00104] The present disclosure provides system and method for recognition of voice
of an entity in real time and correspondingly convert it into text in minimal time.
[00105] The present disclosure provides system and method that is efficient, cost
effective, and easy to implement.

We Claim:

1. A transcription system comprising:
an input unit configured to detect temporal and auricular attributes of an
entity, and correspondingly generate a set of input signals;
a processing unit operatively coupled to the input unit, the processing unit
comprising one or more processors, and coupled with a memory, the memory storing
instructions executable by the one or more processors and configured to:
extract the temporal and auricular attributes from the set of input
signals;
compare the extracted temporal and auricular attributes with a first
dataset comprising pre-stored temporal and auricular attributes of one or more
entities, and correspondingly identify the entity;
receive real time acoustic signals associated with said entity from the
input unit, and perform sampling; and
transcript the sampled real time acoustic signals, responsive to
matching of the sampled real time acoustic signals with a second dataset
comprising speech and handwriting in one or more languages of the identified
entity, and correspondingly generate a first set of signals.
2. The system as claimed in claim 1, wherein the input unit comprises any or a
combination of mic, microphone, touchpad, tactile sensor, biometric device, and
camera.
3. The system as claimed in claim 1, wherein the temporal and auricular attributes
comprise any or a combination of comprise any or a combination of images,
biometric, voice sample, and handwriting sample of each of the one or more
registered entities.
4. The system as claimed in claim 1, wherein one or more entities are registered into the
system by obtaining the temporal and auricular attributes of each of the one or more
entities, and correspondingly an identity (ID) is assigned to each of the one or more
entities.
5. The system as claimed in claim 4, wherein the processing unit is configured to update
the first dataset and the second dataset based on the extracted temporal and auricular
attributes.
23
6. The system as claimed in claim 4, wherein the system comprises a display device
operatively coupled to the processing unit, and configured to receive the first set of
signals, and correspondingly display any or a combination of the transcribed realtime acoustic signals, temporal and auricular attributes, user ID of the entity, and
image and gestures of the entity.
7. The system as claimed in claim 4, wherein the display device comprises any or a
combination of smart phone, computer screen, television (TV) screen, laptop, and
tablet.
8. A transcription method comprising steps of:
detecting, by an input unit, temporal and auricular attributes of an entity;
extracting, at one or more processors of a processing unit, the temporal and
auricular attributes from the set of input signals;
comparing, at the one or more processors, the extracted temporal and auricular
attributes with a first dataset comprising pre-stored temporal and auricular attributes
of one or more entities, and correspondingly identifying the entity;
receiving real time acoustic signals associated with said entity from the input
unit, and perform sampling; and
transcribing, at the one or more processors, the sampled real time acoustic
signals, responsive to matching of the sampled real time acoustic signals with a
second dataset comprising speech and handwriting in one or more languages of the
identified entity, and correspondingly generating a first set of signals.
9. The method as claimed in claim 8, wherein the method comprises a step of registering
of the one or more entities by obtaining the temporal and auricular attributes of each
of the one or more entities, and correspondingly an identity (ID) is assigned to each of
the one or more entities.
10. The method as claimed in claim 8, wherein the method comprises a step of updating
the first dataset and the second dataset based on the extracted temporal and auricular
attributes.

Documents

Orders

Section	Controller	Decision Date
15,43	Shrikant Bagde	2024-06-14
15,43	Shrikant Bagde	2024-06-14

Application Documents

#	Name	Date
1	202011031195-IntimationOfGrant14-06-2024.pdf	2024-06-14
1	202011031195-STATEMENT OF UNDERTAKING (FORM 3) [21-07-2020(online)].pdf	2020-07-21
2	202011031195-PatentCertificate14-06-2024.pdf	2024-06-14
2	202011031195-FORM FOR STARTUP [21-07-2020(online)].pdf	2020-07-21
3	202011031195-FORM FOR SMALL ENTITY(FORM-28) [21-07-2020(online)].pdf	2020-07-21
3	202011031195-Annexure [14-05-2024(online)].pdf	2024-05-14
4	202011031195-Written submissions and relevant documents [14-05-2024(online)].pdf	2024-05-14
4	202011031195-FORM 1 [21-07-2020(online)].pdf	2020-07-21
5	202011031195-FORM-26 [23-04-2024(online)].pdf	2024-04-23
5	202011031195-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [21-07-2020(online)].pdf	2020-07-21
6	202011031195-EVIDENCE FOR REGISTRATION UNDER SSI [21-07-2020(online)].pdf	2020-07-21
6	202011031195-Correspondence to notify the Controller [22-04-2024(online)].pdf	2024-04-22
7	202011031195-US(14)-HearingNotice-(HearingDate-29-04-2024).pdf	2024-03-15
7	202011031195-DRAWINGS [21-07-2020(online)].pdf	2020-07-21
8	202011031195-DECLARATION OF INVENTORSHIP (FORM 5) [21-07-2020(online)].pdf	2020-07-21
8	202011031195-CLAIMS [23-02-2023(online)].pdf	2023-02-23
9	202011031195-CORRESPONDENCE [23-02-2023(online)].pdf	2023-02-23
9	202011031195-COMPLETE SPECIFICATION [21-07-2020(online)].pdf	2020-07-21
10	202011031195-FER_SER_REPLY [23-02-2023(online)].pdf	2023-02-23
10	202011031195-Proof of Right [23-07-2020(online)].pdf	2020-07-23
11	202011031195-FER.pdf	2022-08-25
11	202011031195-FORM-26 [23-07-2020(online)].pdf	2020-07-23
12	202011031195-FORM 18 [14-03-2022(online)].pdf	2022-03-14
13	202011031195-FER.pdf	2022-08-25
13	202011031195-FORM-26 [23-07-2020(online)].pdf	2020-07-23
14	202011031195-FER_SER_REPLY [23-02-2023(online)].pdf	2023-02-23
14	202011031195-Proof of Right [23-07-2020(online)].pdf	2020-07-23
15	202011031195-COMPLETE SPECIFICATION [21-07-2020(online)].pdf	2020-07-21
15	202011031195-CORRESPONDENCE [23-02-2023(online)].pdf	2023-02-23
16	202011031195-CLAIMS [23-02-2023(online)].pdf	2023-02-23
16	202011031195-DECLARATION OF INVENTORSHIP (FORM 5) [21-07-2020(online)].pdf	2020-07-21
17	202011031195-DRAWINGS [21-07-2020(online)].pdf	2020-07-21
17	202011031195-US(14)-HearingNotice-(HearingDate-29-04-2024).pdf	2024-03-15
18	202011031195-Correspondence to notify the Controller [22-04-2024(online)].pdf	2024-04-22
18	202011031195-EVIDENCE FOR REGISTRATION UNDER SSI [21-07-2020(online)].pdf	2020-07-21
19	202011031195-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [21-07-2020(online)].pdf	2020-07-21
19	202011031195-FORM-26 [23-04-2024(online)].pdf	2024-04-23
20	202011031195-Written submissions and relevant documents [14-05-2024(online)].pdf	2024-05-14
20	202011031195-FORM 1 [21-07-2020(online)].pdf	2020-07-21
21	202011031195-FORM FOR SMALL ENTITY(FORM-28) [21-07-2020(online)].pdf	2020-07-21
21	202011031195-Annexure [14-05-2024(online)].pdf	2024-05-14
22	202011031195-PatentCertificate14-06-2024.pdf	2024-06-14
22	202011031195-FORM FOR STARTUP [21-07-2020(online)].pdf	2020-07-21
23	202011031195-STATEMENT OF UNDERTAKING (FORM 3) [21-07-2020(online)].pdf	2020-07-21
23	202011031195-IntimationOfGrant14-06-2024.pdf	2024-06-14

Search Strategy

1	24AUG2E_24-08-2022.pdf