Abstract: Disclosed herein is a system and method for embedding a creative content with an image. The present disclosure provides a technique that uses the voice attributes of the user’s voice and learn about the emotional state of user. Accordingly, when the system learns about user’s voice attributes and corresponding emotional state of user then it may utilize these 10 factors on the text message written by the user during a conversation. The system is capable enough to convert the textual content of the message in to corresponding audio content and share the same with recipient in speaking style of the user.
[0001]The present disclosure relates to instant messaging, and more particularly to
a system and method of facilitating text to audio conversation between a user and a
recipient in user’s speaking style.
5
BACKGROUND
[0002] Proliferation of social networking provide various means of interaction to the
users such as text, audio, and video etc. Particularly, now a days, people used to send
messages or make audio/video calls for expressing one’s emotion and to connect
10 with one another, instead of going to meet another person at their place, which also
reduces the time and cost of transportation. However, many a time, user is not in a
situation where he/she is able to make audio/video call to another person in such
cases, text messages are the best options to convey the intent of one user to another
user. Thus, the technology related to instant messaging is also upgrading itself to
15 cater the requirements of users.
[0003] Existing techniques allow a user to write either plain text message through
message editor or to select and embed an already existing emoji from a palette of
emojis into the message editor while messaging to another user/recipient. Further,
20 for expressing one’s emotion, messaging platform is providing emoji/Avatars.
Avatars/emoji helps to express the emotional reaction of a user while
communicating with recipient. However, the users are still not able to emotionally
connect themselves with another user in the same way as they could while
conversing over a phone call. In other words, these messages sometimes fail to
25 express the user’s emotion in the same way that the user intends to.
[0004] Thus, there exists a need for the technology that can help the user to express
the emotion in the same way as he/she intends to. To address the same, the present
invention provides an efficient way of converting the text messages shared by the
30 user in to his/her style of expressing through voice to the recipient.
OBJECT OF THE INVENTION
3
[0005] An object of the present disclosure is to correctly understand the speaking style of
the user and covert the text message into audio content with speaking style of the
user.
[0006] Another object of the present disclosure is to emotionally connect the recipient with
5 user through his voice.
SUMMARY
[0007] The present disclosure overcomes one or more shortcomings of the prior art and
10 provides additional advantages discussed throughout the present disclosure. Additional
features and advantages are realized through the techniques of the present disclosure. Other
embodiments and aspects of the disclosure are described in detail herein and are considered
a part of the claimed disclosure.
15 [0008] In an embodiment of the present disclosure, method of facilitating text to audio
conversation between users on a messaging platform is disclosed. The method comprises
monitoring current conversation between a user and a recipient to capture conversation data.
The conversation data comprises user messages and recipient messages in a textual format.
The method further comprises applying a pretrained voice model upon the conversation
20 data while the user is in conversation with the recipient. The method further comprises
applying a Natural language processing (NLP) technique upon the user messages to
determine a set of text content indicating a set of emotional state of the user. The method
further comprises converting at least one of the sets of text content into corresponding voice
content expressing the emotional state of the user in the speaking style of the user.
25
[0009] In another embodiment of the present disclosure, a system for facilitating text to
audio conversation between users on a messaging platform is disclosed. The system
comprises a monitoring unit configured to monitor current conversation between a user and
a recipient to capture conversation data. The conversation data comprises conversation data
30 comprises user messages and recipient messages in a textual format. The system further
comprises application unit configured to apply a pretrained voice model upon the
conversation data while the user is in conversation with the recipient. The system further
4
comprises an application unit configured to apply a pretrained voice model upon the
conversation data while the user is in conversation with the recipient. The application unit
further applies a Natural language processing (NLP) technique upon the user messages to
determine a set of text content indicating a set of emotional state of the user. The system
5 further comprises an identifying unit to identify a creative content based on the intent. The
system also comprises a conversion unit configured to convert at least one of the sets of text
content into corresponding voice content expressing the emotional state of the user in the
speaking style of the user.
10 [0010] The foregoing summary is illustrative only and is not intended to be in any way
limiting. In addition to the illustrative aspects, embodiments, and features described above,
further aspects, embodiments, and features will become apparent by reference to the
drawings and the following detailed description.
15 BRIEF DESCRIPTION OF THE DRAWINGS
[0011] The accompanying drawings, which are incorporated in and constitute a part of this
disclosure, illustrate exemplary embodiments and, together with the description, serve to
explain the disclosed embodiments. In the figures, the left-most digit(s) of a reference
20 number identifies the figure in which the reference number first appears. The same numbers
are used throughout the figures to reference like features and components. Some
embodiments of system and/or methods in accordance with embodiments of the present
subject matter are now described, by way of example only, and with reference to the
accompanying figures, in which:
25
[0012] Figure 1A shows an environment 100 for converting a textual content shared by the
user in audio content in user’s speaking style, in accordance with an embodiment of the
present disclosure;
30 [0013] Figure 1B shows an exemplary embodiment which represents conversion of text
message shared by user “SAM” in to corresponding audio content along with his speaking
style;
5
[0014] Figure 2 shows a block diagram 200 illustrating a system for facilitating text to audio
conversation between users on a messaging platform, in accordance with an embodiment
of the present disclosure;
5
[0015] Figure 3 shows a method 300 for generating a trained voice model to express
emotional state of the user, in accordance with an embodiment of the present disclosure and
[0016] Figure 4 shows a method 400 for facilitating text to audio conversation between
10 users on a messaging platform, in accordance with an embodiment of the present disclosure.
[0017] The figures depict embodiments of the disclosure for purposes of illustration only.
One skilled in the art will readily recognize from the following description that alternative
embodiments of the structures and methods illustrated herein may be employed without
15 departing from the principles of the disclosure described herein.
DETAILED DESCRIPTION
[0018] The foregoing has broadly outlined the features and technical advantages of the
20 present disclosure in order that the detailed description of the disclosure that follows may
be better understood. It should be appreciated by those skilled in the art that the conception
and specific embodiment disclosed may be readily utilized as a basis for modifying or
designing other structures for carrying out the same purposes of the present disclosure.
25 [0019] The novel features which are believed to be characteristic of the disclosure, both as
to its organization and method of operation, together with further objects and advantages
will be better understood from the following description when considered in connection
with the accompanying figures. It is to be expressly understood, however, that each of the
figures is provided for the purpose of illustration and description only and is not intended
30 as a definition of the limits of the present disclosure.
[0020] In the present document, the word "exemplary" is used herein to mean "serving as
an example, instance, or illustration". Any embodiment or implementation of the present
6
subject matter described herein as "exemplary" is not necessarily to be construed as
preferred or advantageous over other embodiments.
[0021] Further, the terms like “comprises”, “comprising”, or any other variations thereof,
5 are intended to cover non-exclusive inclusions, such that a setup, device that comprises a
list of components that does not include only those components but may include other
components not expressly listed or inherent to such setup or device. In other words, one or
more elements in a system or apparatus proceeded by “comprises… a” does not, without
more constraints, preclude the existence of other elements or additional elements in the
10 system or apparatus or device.
[0022] Furthermore, the terms like “voice”, “audio” may be used interchangeably or in
combination throughout the description.
15 [0023] Furthermore, the terms like “recipient”, “another user” may be used interchangeably
or in combination throughout the description.
[0024] Disclosed herein is a method and a system for facilitating text to audio conversion.
Demand of providing real-world experience is increasing in every field of technology so as
20 in the instant messaging. The user may use text, audio, video calls to feel the presence of
another user and to connect with him/her socially and emotionally. However, in many of
the situations such as due to presence of unwanted circumstances, it may not always be
possible to perform audio or video call to another person and convey the intent of user to
the recipient. In such situations, the present disclosure provides an effective way through
25 which the user may write a text message on the messaging platform and his voice with
correct speaking style and tone may be received by another user with same emotional
connect as it could be with a voice call.
[0025] Further, when people chat with each other they share their feeling, emotions, and/or
30 information. The level of sharing emotions or feelings varies from one person to another
person depending upon their emotional connect or relationship. So, the system allows a user
to enable text to voice content conversion for the persons with whom the user wants to share
7
the voice content instead of text message even when the input provided by him is in text
format. For example, a text message may not indicate the exact speaking style of the user
or the expression of user or the mood of user. Therefore, it becomes quite difficult for the
recipient to connect with user’s context and to feel the joy of hearing him/her. These
5 personalized connect with another user and experiencing the user’s emotional state in their
speaking style even when the user is typing in textual is explained in below paragraphs of
the specification.
[0026] The present disclosure provides a technique that uses the voice attributes of the
10 user’s voice and learn about the emotional state of user. Accordingly, when the system
learns about user’s voice attributes and corresponding emotional state of user then it may
utilize these factors on the text message written by the user during a conversation. The
system is capable enough to convert the textual content of the message in to corresponding
audio content and share the same with recipient in speaking style of the user. In this process,
15 a voice model extracts the voice attributes of user’s voice for each emotional state of the
user. Based on the same, the voice model is trained for voice modulation and speaking style
of the user so that once the voice model is trained, it can perfectly imitate the user’s speaking
style corresponding to the context of the text message shared by him/her. Now, during a
real-time implementation, when a user provides the text message, the system determines
20 the context of the text message and the pre-trained voice model transform the text message
into audio message and share with the recipient in speaking style of the user. In this way,
the present disclosure allows a more realistic representation of user’s emotions and also
provides an alternate way for voice calling.
25 [0027] Figures 1A presents exemplary environment of a system for converting text content
shared by a user on messaging platform to audio content in accordance with an embodiment
of the present disclosure. Figure 1B shows application of the exemplary environments
presented in figure 1A. Further, figure 1A illustrates the training aspect of the system for
learning about the emotional state of the user and his/her speaking style and figure 1B
30 illustrates the conversion of textual content shared by the user into audio content while
conversing with another user in real-time. A person skilled in the art by referring to these
figures (1A and 1B) may understand that the figure 1B is a real-time application of the
8
learning performed by the voice trained model in figure 1A. It must also be appreciated that
the system presented in figures 1A and 1B are exemplary and the system may also be
implemented in various environments, other than as shown in Figs. 1A and 1B.
5 [0028] The detailed explanation of the exemplary environment 100 is explained in
conjunction with Figure 2 that shows a block diagram 200 of a system 202 for text to audio
conversation between users, in accordance with an embodiment of the present disclosure.
Although the present disclosure is explained considering that the system 202 is
implemented on a server, it may be understood that the system 202 may be implemented in
10 a variety of computing systems, such as a laptop computer, a desktop computer, a notebook,
a workstation, a mainframe computer, a server, a network server, a cloud-based computing
environment. It may be understood that the system 202 may be accessed by multiple users
through one or more user devices or applications residing on the user devices. In one
implementation, the system 202 may comprise the cloud-based computing environment in
15 which a user may operate individual computing systems configured to execute remotely
located applications. Examples of the user devices may include, but are not limited to, a IoT
device, IoT gateway, portable computer, a personal digital assistant, a handheld device, and
a workstation. The user devices are communicatively coupled to the system 202 through a
network.
20
[0029] In one implementation, the network may be a wireless network, a wired network or
a combination thereof. The network can be implemented as one of the different types of
networks, such as intranet, local area network (LAN), wide area network (WAN), the
internet, and the like. The network may either be a dedicated network or a shared network.
25 The shared network represents an association of the different types of networks that use a
variety of protocols, for example, Hypertext Transfer Protocol (HTTP), Hypertext Transfer
Protocol Secure (HTTPS), Transmission Control Protocol/Internet Protocol (TCP/IP),
Wireless Application Protocol (WAP), and the like, to communicate with one another.
Further the network may include a variety of network devices, including routers, bridges,
30 servers, computing devices, storage devices, and the like.
9
[0030] In one implementation, the system 202 may comprise an I/O interface 204, a
processor 206, a memory 208 and the units 210. The memory 208 may be communicatively
coupled to the processor 206 and the units 210. The processor 206 may be implemented as
one or more microprocessors, microcomputers, microcontrollers, digital signal processors,
5 central processing units, state machines, logic circuitries, and/or any devices that
manipulate signals based on operational instructions. Among other capabilities, the
processor 206 is configured to fetch and execute computer-readable instructions stored in
the memory 208. The I/O interface 204 may include a variety of software and hardware
interfaces, for example, a web interface, a graphical user interface, and the like. The I/O
10 interface 204 may allow the system 202 to interact with the user directly or through the user
devices. Further, the I/O interface 204 may enable the system 202 to communicate with
other computing devices, such as web servers and external data servers (not shown). The
I/O interface 204 can facilitate multiple communications within a wide variety of networks
and protocol types, including wired networks, for example, LAN, cable, etc., and wireless
15 networks, such as WLAN, cellular, or satellite. The I/O interface 204 may include one or
more ports for connecting many devices to one another or to another server.
[0031] In one implementation, the units 210 may comprise a generation unit 212, a
monitoring unit 214, a conversion unit 216, an application unit 218 and an updation unit
20 220. According to embodiments of present disclosure, these units 212-220 may comprise
hardware components like processor, microprocessor, microcontrollers, applicationspecific integrated circuit for performing various operations of the system 202. It must be
understood to a person skilled in art that the processor 206 may also perform all the
functions of the units 212-220 according to various embodiments of the present disclosure.
25
[0032] As explained above, the figure 1A describes the “training” of the system 102 to
convert text content into audio content in user’s speaking style. To train the system 102,
firstly, a plurality of audio snippets is received by the system 102 via the receiving unit 212a
of the generation unit 212. Each of the audio snippets present a portion of audio generated
30 by the user for any expression. For example, if the expression is “How are you?”, “where
have you been these days” then the system may receive at least two audio snippets
10
indicating first audio snippet for content “How are you” and second audio snippet for the
content “where have you been these days”. Then the system 102 may register these audio
snippets via the registration unit 212b of the generation unit 212.
5 [0033] In an exemplary embodiment, the registration unit 212b may register the audio
snippets based on the speaking style of the user. The registration unit 212b may register the
audio snippets of the same user during various instances. The audio snippets may provide
an indication about the emotional state of the user and how he is expressing his emotions
naturally by way of his speaking style during conversation at any instance.
10
[0034] In the exemplary environment 100A, during an instance while SAM is conversating
with Johana:
SAM-Hi Johana, How are you?
Where have you been these days?
15 JOHANA- Hey Sam, I am gud, busy these days with kids exams.
You say….
SAM-I am still enjoying bachelor life
In this instance, different audio snippets corresponding to user’s (i.e., SAM) speaking style and
20 emotional state are registered for the expressions. Then, an extraction unit 212c of the generation
unit 212 extracts audio attributes from each of the plurality of audio snippets during each of the
plurality of instances. In this exemplary embodiment, the extraction unit has extracted the audio
attributes from the speaking style of SAM and considers SAM’s emotional state as “Happy” when
he speaks with Johana for an expression “I am still enjoying bachelor life”.
25
[0035] In another exemplary embodiment, SAM is conversing with ruby:
SAM-Hi Ruby, what’s up?
Ruby-Hi Sam, I am good.
I saw you at Arnav’s place, everything all right.
30 SAM- No, yar He lost his brother in an accident.
11
In this instance, different audio snippets corresponding to user’s (i.e., SAM) speaking style and
to represent his emotional state are also registered for the expressions. Then, an extraction unit
212c of the generation unit 212 extracts audio attributes from each of the plurality of audio
snippets during each of the plurality of instances. In this exemplary embodiment, the extraction
5 unit 212c extracts the audio attributes from SAM’s speaking style and considers SAM’s
emotional state as “Sad” when he speaks with ruby for an expression “No, yar He lost his
brother in an accident”.
[0036] In yet another exemplary embodiment, SAM is conversing with Sandy:
10 Sandy- Hey bro, what’s up?
SAM-Hey dude, i am good.
How’ life?
Sandy- Getting married on coming valentine’s day and you know her?
SAM- NO yar, Tell me.
15 Sandy-Miss fresher of your college and your best friend.
SAM-Trishuuuu and uuuuuuuu.
In this instance, different audio snippets corresponding to user’s (i.e., SAM) presenting his
speaking style and emotional state are also registered for the expressions. Then, an
20 extraction unit 212c of the generation unit 212 extracts audio attributes from each of the
plurality of audio snippets during each of the plurality of instances. In this exemplary
embodiment, the extraction unit 212c extracts the audio attributes and considers SAM’s
emotional state as “Surprised” when he speaks with ruby in an expression “Trishuuuu and
uuuuuuuu”.
25
[0037] In still another embodiment, when at second instance, SAM is conversing with
Johana:
SAM-Hi Johana, where are you?
Johana-Hi Sam, I am struck in traffic jam and will be late for movie.
30 SAM- yarrrr, That’s why i asked you to start early.
Johana- will try SAM.
12
[0038] In this instance, different audio snippets corresponding to user’s (i.e., SAM)
speaking style and emotional state are also registered for the expressions. Then, an
extraction unit 212c of the generation unit 212 extracts audio attributes from each of the
5 plurality of audio snippets during each of the plurality of instances. In this exemplary
embodiment, the extraction unit 212c extracts the audio attributes and considers SAM’s
emotional state as “Angry” when he speaks with Johana for an expression “yarrrr, That’s
why i asked you to start early”.
10 [0039] Similarly, the registration unit 212b registers the audio snippets of the user and the
extraction unit 212c extracts the audio attributes from each of the plurality of audio snippets
during each of the plurality of instances. The set of audio attributes extracted by the
extraction unit 212c may include, but not limited to, a voice tone, a pitch, an accent, a
pronunciation style, a stress level and a pause interval. In the exemplary embodiment, when
15 SAM was “sad”, his audio attributes were captured as a higher pitch, less intensity, longer
duration with more pause interval. In another exemplary embodiment, when SAM was
“angry”, his audio attributes were indicating lower pitch, higher intensity, more energy.
[0040] The extraction unit 212c may consider at least one or a combination of these
20 attributes while extracting the audio attributes from the audio snippets. Further, it must also
be noted that for efficient training of the system 102, selection of voice/audio attributes may
vary for different users on basis of their gender, age, area of residence, mother tongue,
pronunciation style etc. For example, in male and female there is a difference in the rate of
speech, the range of pitch, and the duration of speech, and pitch slope. Emotional states
25 such as happiness, sadness, anger, and disgust can be determined solely based on the
acoustic structure of a non-linguistic speech act. Speech requires highly precise and
coordinated movement of the articulators (e.g., lips, tongue, and larynx) in order to transmit
linguistic information. Thus, considering audio attributes along with emotional state of user
provide effective training to the voice model. It may also be noted that in the exemplary
30 environment 100A, only four scenarios are covered for understanding the concept of how
the registration unit 212a registers the audio snippets and extraction unit 212b extracts the
13
audio attributes from the audio snippets. However, the number of audio snippets received
by the system 102 for training can be higher or lower.
[0041] Main objective of the training is to convert the text generated by the user during the
5 conversation with another user to corresponding audio content which can imitate user’s
emotion in the same speaking style which the user has. To do so, the system after extracting
the audio attributes, compares the plurality of audio attributes extracted in one instance with
another plurality of audio attributes extracted during another instance via the comparator
212d. For example, when SAM is conversing with Johana during first instance, his voice
10 attributes for some expression may be different than while SAM was conversing with
Johana during second instance and may be same with the audio attributes when SAM was
conversing with Ruby or Sandy. Therefore, the comparator 212d compares the audio
attributes of the user at one instance with the audio attributes at another instance to
determine a similarity amongst the plurality of audio attributes and another plurality of
15 audio attributes.
[0042] Further, based on the comparison of audio attributes of the user at one instance with
the audio attributes of the user at another instance, the generation unit 212 generates a
confidence score. The confidence score indicates ability of the pretrained voice model to
20 convert user messages into corresponding voice content in the user’s speaking style in a
real-time while the user is in conversation with the recipient. For example, if the comparator
212d determines that the confidence score generated by the generation unit 212 is above a
threshold limit (e.g., 4 sample having same audio attributes) then the trained voice model
is able to convert the textual content in to corresponding audio content in user’s speaking
25 style. Further, the voice model learns gradually and an updation unit 220 keeps on updating
the confidence score for those audio attributes where a similarity is present, based on
comparison of the plurality of audio attributes extracted in one instance with another
plurality of audio attributes.
30 [0043] Once, the system 102 learns to convert the text to corresponding audio content then
the conversion unit 216 may auto-enable the conversion of the text content in to
corresponding audio content in the user’s speaking style. However, this feature may be
14
enabled when the confidence score exceeds over a predetermined threshold. It may be
understood that although the auto-enable feature of conversion of text content into audio
content is enabled based on the confidence score, yet a choice is provided to the recipient
whether he/she wishes to see text message shared by the user or wish to hear the
5 corresponding audio content in user’s speaking style.
[0044] Referring back to figure 1B in conjunction with figure 2, figure 1B describes the
real- time application of the trained voice model for conversion of text content shared by
user on messaging platform to corresponding audio content in user’s speaking style for the
10 recipient. Once the system 102 has been trained or, trained audio attributes have been
generated, the system 102 may operate in real-time to convert a text message in to
corresponding audio content for different emotional states of the user during the
conversation.
15 [0045] The monitoring unit 214 monitor current conversation between a user and a recipient
to capture conversation data. Considering the exemplary embodiment of figure 1A, in this
embodiment, SAM and Johana are conversing with each other and the monitoring unit 214
captures the conversation data which is in the form of text messages. After capturing the
conversation data, the application unit 218 applies the pretrained voice model (after getting
20 training, as described in description of figure 1A) upon the conversation data while the user
is in conversation with the recipient. The application unit 218 applies a Natural language
processing (NLP) technique upon the user messages to determine a set of text content
indicating a set of emotional state of the user. Based on the audio attributes extracted from
the audio snippets, now the system is trained in respect of audio content generation for each
25 emotional state of user. As presented in exemplary embodiment for the user SAM.
Emotional state of SAM is also determined based on the application of NLP over the
conversation data (e.g SAM’s emotional state was happy at first exemplary instance and it
was angry in another exemplary instance). Basically, the NLP technique may determine the
context of each instance during the conversation and the application unit 218 may use this
30 context in determination of emotional state of the user. For example, ohhh.., Alas!, lost etc.
words in the expression may indicate emotional state as “SAD”. Yahoooo!, Whoa, yippy,
enjoy, found etc. words in the expression may indicate emotional state as “Happy”. As
15
discussed above, the pre-trained voice model generated for SAM is shown using a voice
training model table at the right-hand side of Figure 1A. The pre-learnt voice may be stored
in the memory (208). After knowing the emotional state of the user, the conversion unit 216
may convert the at least one of the sets of text content into corresponding voice content
5 expressing the emotional state of the user in the speaking style of the user.
[0046] Now referring back to figure 1A, it can be observed that a user “SAM” is conversing
other users “Johana”, “Ruby” and “Sandy”. At the right-hand side of Figure 1A, a table is
shown which determines the emotional state of the user “SAM” as happy, angry, surprised
10 and sad. However, it may be understood to a skilled person that the present disclosure may
be implemented on various other types of emotional state of the user not shown in figure
1A. The application unit 218 considers the emotional state of the user based on the
conversation data and applies the audio attributes corresponding to that emotional state of
the user for conversion of text content in to corresponding audio content so that the system
15 can perfectly imitate user’s speaking style to express the emotional state of user.
[0047] The system 102 thus provides an efficient way of converting text content in to
corresponding audio content to express emotional state of user in user’s speaking style.
20 [0048] Figure 3 depicts a method 300 for generating a trained voice model for expressing
emotional state of user in user’s speaking style, in accordance with an embodiment of the
present disclosure. As illustrated in figure 3, the method 300 includes one or more blocks
illustrating a method for embedding a creative content with an image. The method 300 may
be described in the general context of computer executable instructions. Generally,
25 computer executable instructions can include routines, programs, objects, components, data
structures, procedures, modules, and functions, which perform specific functions or
implement specific abstract data types.
[0049] The order in which the method 300 is described is not intended to be construed as a
30 limitation, and any number of the described method blocks can be combined in any order
to implement the method. Additionally, individual blocks may be deleted from the methods
without departing from the spirit and scope of the subject matter described.
16
[0050] At block 302, the method 300 may include registering a plurality of audio snippets
corresponding to the user during a plurality of instances, wherein each audio snippet
indicates a speaking style of the user while naturally expressing an emotion.
5 [0051] At block 304, the method 300 may include extracting a plurality of audio attributes
from each of the plurality of audio snippets during each of the plurality of instances.
[0052] At block 306, the method 300 may include comparing the plurality of audio
attributes extracted in one instance with another plurality of audio attributes extracted
10 during another instance to determine a similarity amongst the plurality of audio attributes
and another plurality of audio attributes.
[0053] At block 308, the method 300 may include generating a confidence score based on
comparison, wherein the confidence score indicates ability of the pretrained voice model to
15 convert user messages into corresponding voice content in the user’s speaking style in a
real-time while the user is in conversation with the recipient.
[0054] At block 310, the method 300 may include auto-enabling the conversion of the text
content in to corresponding audio content in the user’s speaking style when the confidence
20 score exceeds over a predetermined threshold.
[0055] Now, figure 4 shows a method 400 for converting audio content expressing user’s
emotional state while naturally expressing emotions in real-time using the pre-trained voice
model. The method 400 may be described in the general context of computer executable
25 instructions. Generally, computer executable instructions can include routines, programs,
objects, components, data structures, procedures, modules, and functions, which perform
specific functions or implement specific abstract data types.
[0056] The order in which the method 400 is described is not intended to be construed as a
30 limitation, and any number of the described method blocks can be combined in any order
to implement the method. Additionally, individual blocks may be deleted from the methods
without departing from the spirit and scope of the subject matter described.
17
[0057] At block 402, the method 400 may include monitoring current conversation between
a user and a recipient to capture conversation data. The conversation data comprises user
messages and recipient messages in a textual format.
5 [0058] At block 404, the method 400 may include applying a pretrained voice model upon
the conversation data while the user is in conversation with the recipient.
[0059] At block 406, the method 400 may include applying a Natural language processing
(NLP) technique upon the conversation data to determine a set of text content indicating a
10 set of emotional state of the user.
[0060] At block 408, the method 400 may include converting at least one of the sets of text
content into corresponding voice content expressing the emotional state of the user in the
speaking style of the user.
15
[0061] A description of an embodiment with several components in communication with
each other does not imply that all such components are required. On the contrary, a variety
of optional components are described to illustrate the wide variety of possible embodiments
20 of the invention.
[0062] When a single device or article is described herein, it will be clear that more than
one device/article (whether they cooperate) may be used in place of a single device/article.
Similarly, where more than one device or article is described herein (whether they
25 cooperate), it will be clear that a single device/article may be used in place of the more than
one device or article or a different number of devices/articles may be used instead of the
shown number of devices or programs. The functionality and/or the features of a device
may be alternatively embodied by one or more other devices which are not explicitly
described as having such functionality/features. Thus, other embodiments of the invention
30 need not include the device itself.
[0063] Finally, the language used in the specification has been principally selected for
readability and instructional purposes, and it may not have been selected to delineate or
circumscribe the inventive subject matter. It is therefore intended that the scope of the
18
invention be limited not by this detailed description, but rather by any claims that issue on
an application based here on. Accordingly, the embodiments of the present invention are
intended to be illustrative, but not limiting, of the scope of the invention, which is set forth
in the following claims.
5
[0064] While various aspects and embodiments have been disclosed herein, other aspects
and embodiments will be apparent to those skilled in the art. The various aspects and
embodiments disclosed herein are for purposes of illustration and are not intended to be
limiting, with the true scope and spirit being indicated by the following claims.
10
[0065] The illustrated steps are set out to explain the exemplary embodiments shown, and
it should be anticipated that ongoing technological development will change the manner in
which particular functions are performed. These examples are presented herein for purposes
of illustration, and not limitation. Further, the boundaries of the functional building blocks
15 have been arbitrarily defined herein for the convenience of the description. Alternative
boundaries can be defined so long as the specified functions and relationships thereof are
appropriately performed. Alternatives (including equivalents, extensions, variations,
deviations, etc., of those described herein) will be apparent to persons skilled in the relevant
art(s) based on the teachings contained herein. Such alternatives fall within the scope and
20 spirit of the disclosed embodiments. It must also be noted that as used herein and in the
appended claims, the singular forms “a,” “an,” and “the” include plural references unless
the context clearly dictates otherwise.
[0066] Advantages of the embodiment of the present disclosure are illustrated herein:
25 1. Correctly presenting emotional state of user in user’s speaking style while
converting text content in to corresponding audio content during conversation
2. Providing a personalized experience to the recipients while conversing with
user using messaging platforms
We Claim:
1. A method of facilitating text to audio conversation between users on a messaging platform,
the method comprising:
5 monitoring current conversation between a user and a recipient to capture
conversation data, wherein conversation data comprises user messages and recipient
messages in a textual format;
applying a pretrained voice model upon the conversation data while the user is in
conversation with the recipient, wherein applying the pretrained voice model comprises:
10 applying a Natural language processing (NLP) technique upon the user
messages to determine a set of text content indicating a set of emotional state of the
user; and
converting at least one of the sets of text content into corresponding voice
content expressing the emotional state of the user in the speaking style of the user.
15
2. The method as claimed in claim 1, wherein the pre-trained voice model is generated by:
registering a plurality of audio snippets corresponding to the user during a plurality of
instances, wherein each audio snippet indicates a speaking style of the user while naturally
expressing an emotion;
20 extracting a plurality of audio attributes from each of the plurality of audio snippets
during each of the plurality of instances;
comparing the plurality of audio attributes extracted in one instance with another
plurality of audio attributes extracted during another instance to determine a similarity
amongst the plurality of audio attributes and another plurality of audio attributes; and
25 generating a confidence score based on comparison, wherein the confidence score
indicates ability of the pretrained voice model to convert user messages into corresponding
voice content in the user’s speaking style in a real-time while the user is in conversation
with the recipient.
30
21
3. The method as claimed in claim 2, wherein the plurality of audio attributes extracted from
the plurality of audio snippets comprises at least one of a voice tone, a pitch, an accent, a
pronunciation style, a stress level and a pause interval.
5 4. The method as claimed in claim 1, wherein the pretrained voice model learns and updates
the confidence score based on comparing the plurality of audio attributes extracted in one
instance with another plurality of audio attributes.
5. The method as claimed in claim 1, further comprising:
10 auto-enabling the conversion of the text content in to corresponding audio content
in the user’s speaking style when the confidence score exceeds over a predetermined
threshold.
6. A system (202) for facilitating text to audio conversation between users on a messaging
15 platform, the system comprising:
monitoring unit configured to monitor current conversation between a user and a
recipient to capture conversation data, wherein conversation data comprises user messages
and recipient messages in a textual format;
application unit configured to apply a pretrained voice model upon the conversation
20 data while the user is in conversation with the recipient, wherein the application unit apply
the pretrained voice model by:
applying a Natural language processing (NLP) technique upon the user
messages to determine a set of text content indicating a set of emotional state of the
user; and
25 converting at least one of the sets of text content into corresponding voice
content expressing the emotional state of the user in the speaking style of the user.
7. The system as claimed in claim 6, wherein a generation unit configured to generate the pretrained voice model by:
22
registering a plurality of audio snippets corresponding to the user during a plurality of
instances, wherein each audio snippet indicates a speaking style of the user while naturally
expressing an emotion;
extracting a plurality of audio attributes from each of the plurality of audio snippets
5 during each of the plurality of instances;
comparing the plurality of audio attributes extracted in one instance with another
plurality of audio attributes extracted during another instance to determine a similarity
amongst the plurality of audio attributes and another plurality of audio attributes; and
generating a confidence score based on comparison, wherein the confidence score
10 indicates ability of the pretrained voice model to convert user messages into corresponding
voice content in the user’s speaking style in a real-time while the user is in conversation
with the recipient.
8. The system as claimed in claim 7, wherein the plurality of audio attributes extracted from
15 the plurality of audio snippets comprises at least one of a voice tone, a pitch, an accent, a
pronunciation style, a stress level and a pause interval.
9. The system as claimed in claim 6, wherein an updation unit is configured to update the
confidence score in the pretrained voice model based on comparing the plurality of audio
20 attributes extracted in one instance with another plurality of audio attributes.
10. The system as claimed in claim 6, wherein a conversion unit configured to:
auto-enable the conversion of the text content in to corresponding audio content in
the user’s speaking style when the confidence score exceeds over a predetermined
25 threshold.
| # | Name | Date |
|---|---|---|
| 1 | 202011004660-STATEMENT OF UNDERTAKING (FORM 3) [03-02-2020(online)].pdf | 2020-02-03 |
| 2 | 202011004660-PROVISIONAL SPECIFICATION [03-02-2020(online)].pdf | 2020-02-03 |
| 3 | 202011004660-FORM 1 [03-02-2020(online)].pdf | 2020-02-03 |
| 4 | 202011004660-DRAWINGS [03-02-2020(online)].pdf | 2020-02-03 |
| 5 | 202011004660-DECLARATION OF INVENTORSHIP (FORM 5) [03-02-2020(online)].pdf | 2020-02-03 |
| 6 | 202011004660-Proof of Right [28-09-2020(online)].pdf | 2020-09-28 |
| 7 | 202011004660-PETITION u-r 6(6) [01-10-2020(online)].pdf | 2020-10-01 |
| 8 | 202011004660-Covering Letter [01-10-2020(online)].pdf | 2020-10-01 |
| 9 | 202011004660-DRAWING [03-02-2021(online)].pdf | 2021-02-03 |
| 10 | 202011004660-CORRESPONDENCE-OTHERS [03-02-2021(online)].pdf | 2021-02-03 |
| 11 | 202011004660-COMPLETE SPECIFICATION [03-02-2021(online)].pdf | 2021-02-03 |
| 12 | abstract.jpg | 2021-10-18 |
| 13 | 202011004660-FORM 18 [04-01-2024(online)].pdf | 2024-01-04 |
| 14 | 202011004660-FER.pdf | 2025-06-12 |
| 15 | 202011004660-FORM 3 [30-07-2025(online)].pdf | 2025-07-30 |
| 1 | 202011004660_SearchStrategyNew_E_1000E_04-06-2025.pdf |