Abstract: A method and electronic device for classifying and selectively rendering an audio and a video content from a multimedia are provided. The method includes identifying the audio content from the multimedia, splitting the audio content into multiple frames, extracting a set of features associated with each of the multiple frames, analyzing the frames based on the set of features and a predefined threshold, classifying the frames based on the analyzing, and rendering the audio content based on the classification. The electronic device includes an identification module for identifying the audio content from the multimedia, an extracting module for extracting the set of features associated with each of the multiple frames, a memory for storing the multiple frames and the predefined threshold, and a processor for analyzing and classifying the frames based on the set of features and a predefined threshold.
METHOD AND SYSTEM FOR CLASSIFYING AND SELECTIVELY RENDERING OF AUDIO AND VIDEO CONTENT FROM MULTIMEDIA
FIELD
[0001] The present disclosure generally relates to the field of multimedia communications, and more particularly it relates to the field of classifying and selectively rendering an audio and a video content from a multimedia.
BACKGROUND
[0002] The audio content may be divided into segments. Each segment is further divided into multiple frames. In the current scenario, the frames identifying a segment boundary are very few in numbers, due to which the boundary for a segment may not be identified. In the current scenario, a k-NN (k-Nearest Neighbour) algorithm is used for classifying the audio content. The k-NN (k-Nearest Neighbour) algorithm can be degraded by the presence of noise and one or more irrelevant features in the audio content. Hence the k-NN algorithm may not produce clear audio content.
[0003] In the current scenario, an audio content is classified into a pure speech and non speech based on predefined threshold. In the existing methods of classification of audio content may lack accuracy. Further, in the existing scenario, the process of extracting a set of features associated with each of the multiple frames may result in loss of information from the audio content.
[0004] In light of the foregoing discussion there is a need for an efficient technique of classifying and selectively rendering an audio and a video content from a multimedia.
2/22
SUMMARY
[0005] Embodiments of the present disclosure described herein provide a method and an electronic device for classifying and selectively rendering an audio and a video content from a multimedia.
[0006] An example of a method for classifying and selectively rendering an audio content from a multimedia includes identifying the audio content from the multimedia. The method also includes splitting the audio content into multiple frames and extracting a set of features associated with each of the multiple frames. Further, the method includes analyzing the frames based on the set of features and a predefined threshold. Furthermore, the method includes, classifying the frames based on the analyzing, and rendering the audio content based on the classification.
[0007] An example of an electronic device for classifying and selectively rendering an audio content from a multimedia includes an identification module for identifying the audio content from the multimedia. The electronic device also includes an extracting module for extracting the set of features associated with each of the multiple frames. The electronic device further includes a memory for storing the multiple frames and the predefined threshold. Furthermore, the electronic device includes a processor for analyzing and classifying the frames based on the set of features and a predefined threshold.
BRIEF DESCRIPTION OF FIGURES
[0008] The accompanying figures, similar reference numerals may refer to identical or functionally similar elements. These reference numerals are used in the detailed
3/22
description to illustrate various embodiments and to explain various aspects and advantages of the present disclosure.
[0009] FIG. 1 is a block diagram of an electronic device for classifying and selectively rendering an audio and a video content from a multimedia, in accordance with one embodiment;
[0010] FIG. 2 is a flow chart illustrating a method for classifying and selectively rendering an audio and a video content from a multimedia, in accordance with one embodiment;
[0011] FIG. 3 is a flow diagram illustrating a process for classifying and selectively rendering an audio content from a multimedia, in accordance with another embodiment;
[0012] FIG. 4 is a flow diagram illustrating a process for extracting a set of features associated with each of the multiple frames for training the GMM models in accordance with one embodiment; and
[0013] FIG. 5 is a flow diagram illustrating a process for classifying the frames based on the set of features, in accordance with one embodiment.
[0014] Persons skilled in the art will appreciate that elements in the figures are illustrated for simplicity and clarity and may have not been drawn to scale. For example, the dimensions of some of the elements in the figures may be exaggerated relative to other elements to help to improve understanding of various embodiments of the present disclosure.
DETAILED DESCRIPTION
4/22
[0015] It should be observed that method steps and system components have been represented by conventional symbols in the figures, showing only specific details that are relevant for an understanding of the present disclosure. Further, details that may be readily apparent to person ordinarily skilled in the art may not have been disclosed. In the present disclosure, relational terms such as primary and secondary, first and second, and the like, may be used to distinguish one entity from another entity, without necessarily implying any actual relationship or order between such entities.
[0016] FIG. 1 is a block diagram of an electronic device 105 for classifying and selectively rendering an audio and a video content from a multimedia. An example of the electronic device 105 includes, but is not limited to a television, a desktop computer, a server, a camera, a laptop, and a handheld device.
[0017] The electronic device 105 includes a bus 110 or other communication mechanism for communicating information. The electronic device also 105 includes a processor 115 coupled with the bus 110. The processor 115 can include an integrated electronic circuit for processing and controlling functionalities of the electronic device 105.
[0018] The processor 115 of the electronic device 105 is used for analyzing and classifying the frames based on the set of features and a predefined threshold. The electronic device 105 also includes a memory 120, such as a random access memory (RAM) or other dynamic storage device, coupled to the bus 110 for storing information which can be used by the processor 115. The memory 120 can be used for storing the multiple frames and the predefined threshold. The electronic device 105 further includes a read only memory (ROM) 125 or other static storage device coupled to the bus 110 for
5/22
storing static information for the processor 115. A storage unit 130, such as a magnetic disk or optical disk, is provided and coupled to the bus 110 for storing information. The storage unit 130 is used for storing information.
[0019] The electronic device 105 can be coupled via the bus 110 to a display 135, such as a cathode ray tube (CRT), a liquid crystal display (LCD) or a light emitting diode (LED) display, for displaying information. An input device 140, including alphanumeric and other keys, is coupled to the bus 110 for communicating an input to the processor 115. The input device 140 can be included in the electronic device 105. Another type of user input device is a cursor control 145, such as a mouse, a trackball, or cursor direction keys for communicating the input to the processor 115 and for controlling cursor movement on the display 135. The input device 140 can also be included in the display 135, for example a touch screen.
[0020] Various embodiments are related to the use of the electronic device 105 for implementing the techniques described herein. In one embodiment, the techniques are performed by the processor 115 using information included in the memory 120. The information can be read into the memory 120 from another machine-readable medium, such as the storage unit 130.
[0021] The term “machine-readable medium” as used herein refers to any medium that participates in providing data that causes a machine to operate in a specific fashion. In an embodiment implemented using the electronic device 105, various machine-readable medium are involved, for example, in providing information to the processor 115. The machine-readable medium can be a storage media. Storage media includes both non-
6/22
volatile media and volatile media. Non-volatile media includes, for example, optical or magnetic disks, such as the storage unit 130. Volatile media includes dynamic memory, such as the memory 120. All such media must be tangible to enable the information carried by the media to be detected by a physical mechanism that reads the information into a machine.
[0022] Common forms of machine-readable medium include, for example, a floppy disk, a flexible disk, hard disk, magnetic tape, or any other magnetic medium, a CD-ROM, any other optical medium, punchcards, papertape, any other physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge.
[0023] In another embodiment, the machine-readable medium can be a transmission media including coaxial cables, copper wire and fiber optics, including the wires that include the bus 110. Transmission media can also take the form of acoustic or light waves, such as those generated during radio-wave and infra-red data communications.
[0024] The electronic device 105 also includes a communication interface 150 coupled to the bus 110. The communication interface 150 provides a two-way data communication coupling to a network 155. Examples of the network 155 include but are not limited to a Local Area Network (LAN), a Wireless Local Area Network (WLAN), a Wide Area Network (WAN), internet and a Small Area Network (SAN).
[0025] For example, the communication interface 150 can be a local area network (LAN) card to provide a data communication connection to a compatible LAN. Wireless links can also be implemented. In any such implementation, the communication interface 150 sends
7/22
and receives electrical, electromagnetic or optical signals that carry digital data streams representing various types of information. The communication interface 150 can be a universal serial bus port.
[0026] In some embodiments, the electronic device 105 can be connected to the storage device 160 for storing or fetching information. Examples of the storage device 160 includes, but are not limited to, a flash drive, a pen drive, a hard disk or any other storage media.
[0027] The electronic device 105 also includes an identification module 165 and an extracting module 170.The identification module 165 is used for identifying an audio content from the multimedia. The extracting module 170 is used for extracting a set of features associated with each of the multiple frames.
[0028] The electronic device 105 receives a multimedia from the network 155. The identification module 165 identifies the audio content from the multimedia. The audio content identified is then split into multiple frames. Each frame is associated with a set of features. The set of features associated with each of the multiple frames are extracted by the extracting module 170. The processor 115 analyzes the frames based on the set of features and a predefined threshold. The frames are further classified into one of a pure speech, a pure music, a silence, an environmental noise, and a mixture of speech and music based on the analysis. The audio content classified is then rendered to a user through a speaker. In some embodiments the audio content may be embedded in a video file. In such case, the classified contents are simultaneously rendered through the display 135 and the speaker.
8/22
[0029] In some embodiments, the electronic device 105 can receive a multimedia from the storage device 160 and a memory 120.
[0030] The first set of features used for testing the audio segment for silence and environmental noise are short time energy, short time spectral entropy and the autocorrelation peak volume of audio frame.
[0031] The second set of features used for analyzing the multiple frames are normalized root mean square amplitude, normalized root mean square variance, a low short-time energy ratio, variance of log energy and differential log energy, a minimum value of root mean square amplitude, variance of spectral entropy computed from the frequency range of 500 Hz to 3 kHz, a variance of first five Mel frequency cepstral coefficients excluding the first coefficient.
[0032] The third set of features used for classifying the audio content are normalized zero crossing rate, skew ness of zero-crossing rate ratio, variance of pitch, range of zero crossing rate a variance of spectral roll-off, a variance of differential spectral roll off log energy, the variance of 4th to 8th differential Mel frequency cepstral coefficients and a combination thereof.
[0033] FIG. 2 is a flow chart illustrating a method for classifying and selectively rendering an audio and a video content from a multimedia in accordance with one embodiment. The method for classifying and selectively rendering an audio content from a multimedia starts at step 205.
9/22
[0034] At step 210, the audio content is identified from the multimedia. For example, the multimedia can be a video clip played in an electronic device. The audio content is identified from the video clip played in the electronic device.
[0035] At step 215, the identified audio content from the multimedia is partitioned into segments. Each audio segment is further split into frames with a delay. Each of the frames is associated with a set of features. The audio content is segmented based on a Gaussian computing likelihood model.
[0036] In one embodiment, the user of the electronic device provides the value of the delay. In another embodiment, the electronic device provides a predefined delay value.
[0037] At step 220, a first set of features are extracted associated with each of the segments. The segments are tested for silence and environmental noise based on a predefined threshold and a set of features. The first set of features used for testing the audio segment for silence and environmental noise are short time energy, short time spectral entropy and the autocorrelation peak volume of audio frame.
[0038] If the audio content does not include silence and environmental noise after testing, then a second set of features associated with each of the frames are extracted.
[0039] At step 225, the segments are analyzed based on another set of features and the predefined threshold. The features extracted by the extracting module are analyzed by sending through a plurality of Gaussian mixture models for checking the set of features.
[0040] The second set of features used for analyzing the multiple frames are a normalized root mean square amplitude, normalized root mean square variance, a low
10/22
short-time energy ratio, variance of log energy, and differential log energy, minimum value of root mean square amplitude, variance of spectral entropy computed from the frequency range of 500 Hz to 3 kHz, a variance of first five Mel frequency cepstral coefficients excluding the first coefficient.
[0041] The third set of features used for classifying the audio content are a normalized zero crossing rate, skewness of zero-crossing rate, variance of pitch, range of zero crossing rate ,a variance of spectral roll-off, a variance of differential spectral roll off , the variance of of 4th to 8th differential Mel frequency cepstral coefficients and a combination thereof. The analyzed frames are then sent for classification.
[0042] At step 230, the frames are classified based on the analysis. The frames are classified into a pure speech, a pure music and mixture of speech and music.
[0043] At step 235, the audio content is rendered to the user of the electronic device based on the classification.
[0044] The method stops at 240.
[0045] FIG. 3 is a flow diagram illustrating a process for classifying and selectively rendering an audio content from a multimedia, in accordance with another embodiment.
[0046] At step 305 and 310 an audio content identified from the multimedia is received as an input. The audio content is divided into one or more segments with duration between two segments.
11 / 2 2
[0047] At step 315, each segment of audio content is tested for silence based on a predefined threshold value and a set of features. The feature used for testing the silence regions in the audio content are the short time energy. The short time energy is defined as the sum of squares of signal samples. The silence regions are identified from the segments by using the measure of the short time energy.
[0048] At step 320, the segment of the audio content are classified as silence, if the measured short time energy value of the segment is less than the predefined threshold value. If the short time energy value of the segment is greater than the predefined threshold value, then it is classified as non silence. The algorithm will proceed for the environmental noise detection if the segment is identified as non-silence.
[0049] At step 325, each segment of audio content is tested for environmental noise based on a predefined threshold value and a set of features. Each of the segments of the audio content are divided into multiple non-overlapping frames. The feature used for testing the environmental noise in the audio content is short time entropy and the autocorrelation peak values. The environmental noise can be detected from the audio content by computing using the following formula:
E= (√ │Es * Am│ + 1).
Where Es=short time entropy
Am= the first major peak value from the zero lag position of the obtained autocorrelation sequence
E= environmental feature value
12/22
[0050] At step 330, the segment of the audio content are classified as environmental noise, if the measured environmental feature value of the segment is less than the predefined threshold value. If the environmental feature value of the segment is greater than the predefined threshold value then it is classified as non environmental noise.
[0051] At step 335, if the segments of the audio content are detected as non silence and non environmental noise then the features associated with each of the frames are extracted.
[0052] At step 340, the extracted features associated with each of the frames are then analyzed by subjecting the multimedia to a plurality of Gaussian mixture models. The features extracted by the extracting module are analyzed by sending through a first Gaussian mixture model for checking the set of features.
[0053] The second set of features used for analyzing the frames are normalized root mean square amplitude, normalized root mean square variance, a low short-time energy ratio, variance of log energy and differential log energy, a minimum value of root mean square amplitude, variance of spectral entropy computed from the frequency range of 500 Hz to 3 kHz, a variance of first five Mel frequency cepstral coefficients excluding the first coefficient. A first Gaussian mixture model is used for classifying the audio content into pure speech and non speech. The non speech includes the pure music and a mixture of speech and music.
[0054] At step 345, the audio content is classified as pure speech by the first Gaussian mixture model.
13/22
[0055] At step 350, the audio content is classified as non speech by the first Gaussian mixture model. The non speech can be further classified into a pure music and a mixture of speech and music.
[0056] At step 355, the third set of features associated with each of the multiple frames for pure music and the mixture of speech and music are extracted. The third set of features used for classifying the audio content are normalized zero crossing rate, skew ness of zero-crossing rate ratio, variance of pitch, range of zero crossing rate a variance of spectral roll-off, a variance of differential spectral roll off log energy, the variance of 4th to 8th differential Mel frequency cepstral coefficients and a combination thereof.
[0057] At step 360, the extracted features associated with each of the multiple frames are then analyzed by sending through a second Gaussian mixture model.
[0058] At step 365, the audio content is classified as pure music by the second Gaussian mixture model.
[0059] At step 370, the audio content is classified as the mixture of speech and music by the second Gaussian mixture model.
[0060] Consider an example of a movie stored in a mobile phone. The movie consists of a song. The audio content of the entire movie is identified and split into multiple frames. The set of features are extracted from the frames. Based on the features of the frames and the predefined threshold, the audio contents are analyzed by multiple gaussian mixture models. The multiple frames are then classified based one of a pure speech, a pure music, a silence, an environmental noise, and a mixture of speech and music. If the user is
14/22
interested to extract and watch the song from the movie, he can select the song from the classification. The user is rendered the required content without manually searching for the song.
[0061] FIG. 4 is a flow diagram illustrating a process for extracting a set of features associated with each of the multiple frames, in accordance with one embodiment.
[0062] At step 405, an audio segment is received as an input. An unwanted DC component is removed from the audio segment by subtracting its average value.
[0063] At step 410 and 415, the audio segment without the DC component is re-sampled to a sampling rate, for example, sampling rate is equal to 10 kHz. Each of the audio segments is further divided into multiple frames with duration between frames, for example, the duration between frames is equal to 20ms. A frame count for the multiple frames is considered to be equal to one.
[0064] At step 420, a set of features associated with each of the multiple frames for pure speech and the non speech are extracted The set of features are normalized root mean square amplitude, normalized root mean square variance, a low short-time energy ratio, variance of log energy and differential log energy, minimum value of root mean square amplitude, variance of spectral entropy computed from the frequency range of 500 Hz to 3 kHz, a variance of first five Mel frequency cepstral coefficients excluding the first coefficient.
[0065] In one embodiment, the set of features associated with each of the multiple frames are extracted using signal processing technique.
15/22
[0066] At step 425, if the frame count is equal to number of frames per segment, then the set of features associated with each of the multiple frames of the segment are computed at step 435.
[0067] If the frame count is not equal to the number of frames per segment, then the frame count is incremented at step 430 and returns to step 420.
[0068] At step 440, the plurality of Gaussian mixture models are trained by using the set of features extracted associated with each of the multiple frames of the pure speech and non speech audio content.
[0069] FIG. 5 is a flow diagram illustrating a process for classifying the frames based on the set of features, in accordance with one embodiment.
[0070] At step 505, an audio segment is received as an input. The unwanted DC component is removed from the audio segment by subtracting its average value.
[0071] At step 510, a mean subtracted segment is re-sampled to a particular sampling rate, for example, sampling rate is equal to16 kHz. Each of the audio segments is further split into multiple frames with duration between the frames.
[0072] For example, the audio content is divided into one or more audio segments of 1 sec duration. Each of the audio segments is further divided into frames of 20 ms duration.
[0073] At step 515, the set of features associated with each of the audio segments are computed.
16/22
[0074] At step 520, the audio content is segmented based on a Gaussian computing likelihood model.
[0075] At step 525, if one or more model parameters are equal to total number of classes for speech and non speech, then the audio segment is classified into one of a pure speech, a pure music, a silence, an environmental noise, and a mixture of speech and music.
[0076] If the model parameters are not equal to the total number of classes for speech and non speech, then the model parameters is incremented. The model parameters incremented are read to the Gaussian computing likelihood model and the process repeats of classifying the audio segment into one of a pure speech, a pure music, a silence, an environmental noise, and a mixture of speech and music.
[0077] At step 530, the decision of classifying the audio segment into one of the pure speech, pure music, silence, an environmental noise, and the mixture of speech and music. Is taken based on the analysis of the frames.
[0078] The user consumes less time in identifying the desired multimedia to be played based on the classification of the audio segment into one of the pure speech, pure music, silence, an environmental noise, and the mixture of speech and music. The audio content separated from the multimedia helps the user in identifying the desired multimedia to be played.
[0079] In the preceding specification, the present disclosure and its advantages have been described with reference to specific embodiments. However, it will be apparent to a person of ordinary skill in the art that various modifications and changes can be made,
17/22
without departing from the scope of the present disclosure, as set forth in the claims below. Accordingly, the specification and figures are to be regarded as illustrative examples of the present disclosure, rather than in restrictive sense. All such possible modifications are intended to be included within the scope of present disclosure.
18/22
I/We cLaim:
1. A method for classifying and selectively rendering an audio and a video content from a multimedia, the method comprising: identifying the audio content from the multimedia; splitting the audio content into multiple frames;
extracting a set of features associated with each of the multiple frames; analyzing the frames based on the set of features and a predefined threshold; classifying the frames based on the analyzing; and rendering the audio content based on the classification.
2. The method of claim 1, wherein the set of features for analyzing the frames are a short time energy, a short time spectral entropy, a autocorrelation peak volume, a skewness of zero-crossing rate ratio, range of zero crossing ratio, a variance of spectral entropy, a variance of spectral roll-off, a variance of differential log energy, a variance of first five Mel frequency cepstral coefficients, a normalized root mean square amplitude, a low short-time energy ratio, a variance of log energy, , a variance of fourth to eighth differential Mel frequency cepstral coefficients and a variance of differential zerocrossing rate and a combination thereof.
3. The method of claim 1, wherein the analyzing comprises:
a plurality of gaussian mixture models for checking the set of features.
19/22
4. The method of claim 1, wherein the classification is based on one of a pure speech, a pure music, a silence, an environmental noise, and a mixture of speech and music.
5. The method of claim 1 wherein the rendering is based on one of the classification.
6. The method of claim 1 wherein the splitting further comprises:
segmenting the audio content based on a gaussian computing likelihood model.
7. A system for classifying and selectively rendering an audio and a video content from a multimedia, the system comprising: an electronic device, the electronic device comprising:
an identification module for identifying the audio content from the multimedia;
an extracting module for extracting the set of features associated with each of the
multiple frames;
a memory for storing the multiple frames and the predefined threshold; and
a processor for analyzing and classifying the frames based on the set of features
and a predefined threshold.
8. A system for performing a method, the method as described herein and in accompanying figures.
20/22
9. A method for classifying and selectively rendering an audio and a video content from a multimedia in an electronic device, the electronic device as described herein and in accompanying figures.
| # | Name | Date |
|---|---|---|
| 1 | 1773-CHE-2009 POWER OF ATTORNEY 28-05-2010.pdf | 2010-05-28 |
| 1 | 1773-CHE-2009-AbandonedLetter.pdf | 2018-02-12 |
| 2 | 1773-CHE-2009-FER.pdf | 2017-07-26 |
| 2 | 1773-CHE-2009 OTHER PATENT DOCUMENT 28-05-2010.pdf | 2010-05-28 |
| 3 | Drawings.pdf | 2011-09-03 |
| 3 | 1773-che-2009 form-1 28-05-2010.pdf | 2010-05-28 |
| 4 | 1773-CHE-2009 POWER OF ATTORNEY 27-06-2011.pdf | 2011-06-27 |
| 4 | Form-1.pdf | 2011-09-03 |
| 5 | Form-3.pdf | 2011-09-03 |
| 5 | 1773-CHE-2009 FORM- 18 27-06-2011.pdf | 2011-06-27 |
| 6 | Form-5.pdf | 2011-09-03 |
| 6 | 1773-CHE-2009 CORRESPONDENCE OTHERS 27-06-2011.pdf | 2011-06-27 |
| 7 | Power of Authority.pdf | 2011-09-03 |
| 8 | Form-5.pdf | 2011-09-03 |
| 8 | 1773-CHE-2009 CORRESPONDENCE OTHERS 27-06-2011.pdf | 2011-06-27 |
| 9 | Form-3.pdf | 2011-09-03 |
| 9 | 1773-CHE-2009 FORM- 18 27-06-2011.pdf | 2011-06-27 |
| 10 | 1773-CHE-2009 POWER OF ATTORNEY 27-06-2011.pdf | 2011-06-27 |
| 10 | Form-1.pdf | 2011-09-03 |
| 11 | 1773-che-2009 form-1 28-05-2010.pdf | 2010-05-28 |
| 11 | Drawings.pdf | 2011-09-03 |
| 12 | 1773-CHE-2009-FER.pdf | 2017-07-26 |
| 12 | 1773-CHE-2009 OTHER PATENT DOCUMENT 28-05-2010.pdf | 2010-05-28 |
| 13 | 1773-CHE-2009-AbandonedLetter.pdf | 2018-02-12 |
| 13 | 1773-CHE-2009 POWER OF ATTORNEY 28-05-2010.pdf | 2010-05-28 |
| 1 | search_26-07-2017.pdf |