Abstract: The present provides a system for providing web search for multimedia contents containing: 1. an image analyzer; ii. an audio analyzer and iii. text analyzer wherein the said system is capable of storing and searching multimedia video, audio and image content without imposing restriction on the format of the contents; wherein said system analyzes multimedia contents and annotates the contents by means of the said components characterized in that said system contains a video encoder the output of which is used for keyframe extractor and the said image analyzer uses output of keyframe extractor.
A NOVEL SYSTEM AND METHOD PROVIDING ADVANCED WEB SEARCH FOR MULTIMEDIA
CONTENTS
FIELD OF THE INVENTION
The present invention relates to a multimedia information retrieval system based on semantic web.
The system consists of three main components the image analyzer, audio analyzer and text analyzer.
The invention analyzes the multimedia contents and annotates the contents automatically by
gathering information from image analysis, audio analysis and text analysis. The system enhances the
searchability of content and provides several options to the user for searching the content. The
present system works for videos, images and audio files and enable search of all kind of multimedia
files. The system does not impose any restriction on the format of the content. The videos can be in
any format like MP4, AVI, MPEG, F'LV. The audio content can be in MP3, WMA and similarly images
can be in any format like PNG, PEG. Figure 1 illustrates the system design.
BACKGROUND OF THE INVENTION
In the web world there is huge information and the accumulated knowledge of the world but this
information could not be of use if it is not available in the format which could make it searchable and
further useable. In other words what we need is processed information or knowledge. Only if the
information is processed can it makes sense of the huge data available, without which it is waste. In
the words of Tim Berners Lee, the inventor of Web and Semantic Web, "Semantic Web is a web of
data that can be processed directly and indirectly by machines." In other words the aim of Semantic
web is to make all information available in a format so that machine can understand it. Semantic Web
enables the entire information to be represented in a graph like structure. All the entities are
connected to each other. The connections between these nodes represent relations between all such
entities. The Hypertext Markup Language (HTML) which is used to display webpages lacks this
ability. It is possible for the webpages to be connected to each other through hyperlinks, but these
hyperlinks provide no extra information i.e. as to why the two pages are related, what kind of
functional relation it has. Semantic Web makes that a possibility, the entire information
may be represented in RDF (Resource Description Framework) or OWL (Ontology web language)
format. The use of Semantic Web thus makes the entire web like a huge dataset connected to each
other. It thus allows the machine to make decisions based on the relation between the different
contents. Multimedia content has its own problem. If a user needs to search for some content, in the
2
absence of any technique to process the content, user is left with no option but to manually scan the
multimedia content which is a time consuming process. Without any technique or methodology to
make contents understandable by machine, it is more than a pile of multimedia contents. The entire
process needs to be automated which should speed up the system and get rid of bias issues.
In today's world, searching for text in the World Wide Web has become much easier than before, but
searching for a specific type of multimedia content is still very difficult. The search engines return
the pointer to multimedia contents whose description entered by the user matches with the
keywords of the search query. It is the need of an hour to focus on ways to make the multimedia
content searchable not just through the description entered by the user but by combining
information from different sources such as video, audio components of the multimedia file along with
the existing approaches searching through the title of the multimedia file and the user descriptions.
On one side there is data and on the other there are users. There is a huge disconnect between the
two, the systems store information in manner which makes no or very little sense to a user. In a
world where the entire information is stored in bits and bytes the system is unable to give results in a
proper manner. This causes a 'Semantic Gap'. This problem plagues the Multimedia Information
Retrieval (MIR) systems. There is a need to combat this problem and store information in such a
manner so that system is capable of understanding and presenting the information in a manner
needed by the user.
Describing the content, searching and organizing the content manually is a painstaking effort and
needs a lot of time and effort on the part of the administrator involved. It is very difficult to identify
the multimedia content that one wants. The success of any multimedia retrieval system depends on
how efficient the system is in this regard. If a good efficiency is achieved it can be rapidly adopted
and inability to do so will lead to downfall. It thus becomes imperative to take steps and adopt
methods which help in meeting these targets.
Thus there is a need for a fully automated system which does exactly as described above. In order to
achieve it, the content needs to be described semantically. These multimedia file can have low level
or high level descriptions. The low-level features are those like color and shape for images, or pitch
and timbre for speech. High-level features have a semantic value associated with what the content
means to humans. This includes events, or genre classification. It is easier to extract low level features
3
from the multimedia content and describe it in Resource Description Framework (RDF) but user shall
query at the semantic level. Thus low level descriptions should be mapped to the high-level query.
- The invented Semantic Web based Multimedia Information Retrieval System (SWIRL) system is a
web application which takes in a multimedia file, analyses it and generates a RDF description. This
entire process is automated and doesn't need any manual intervention. It analyses the audio,
keyframes and the description entered by the user at the time of upload. Natural language processing
is used to analyze the text. Keyframes extracted are used in image processing and also serve as a
representation of the video file, thus allowing to scan through the contents of the video file quickly.
OBJECT OF THE INVENTION
It is an object of the invention is to develop a multimedia information retrieval system consisting of
image analyzer, audio analyzer and text analyzer.
It is another object of the invention is to provide a system which enables search across all format of
multimedia files.
It is yet another object of the invention is to develop multimedia information retrieval system based
on Semantic Web technology.
It is another object of the invention is to provide a system that can analyze the audio, image and text
components of a multimedia file in addition to user entered description and multimedia file title and
based upon that can annotate the multimedia contents automatically. Thus generating more rich
annotation and which results in enhancement of searchability of the contents.
It is another object of the invention is to provide a system enabling several search options to user for
searching multimedia content files such as search by text with much enhanced searchability, search
by image, search by video, search by face, and search by audio.
SUMMARY OF THE INVENTION
The above and other objects of the present invention are achieved by developing a system for
providing web search for multimedia contents containing:
1. an image analyzer;
ii. an audio analyzer and
iii. text analyzer
wherein the said system is capable of storing multimedia video, audio and image content without
imposing restriction on the format of the contents;
wherein said system analyzes multimedia contents and annotates the contents by means of the said
components characterized in that said system contains a video encoder the output of which is used
for keyframe extracter and the said image analyzer uses output of keyframe extractor.
The present invention provides a system wherein said image analyzer means has an optical character
recognition (OCR), object detection, face detection and recognition, the output of OCR is given as
input to text analyzer and the output of said face detection and recognition and object detection is
used to retrieve information from DBPedia, wherein the output of processing from DBPedia is added
to a RDF file for tagging of multimedia file and said face recognition enable searching multimedia
contents by the face in its image component.
In the present invention the system generates image fingerprints of the output of the keyframe
extractor to enable searching multimedia contents by its image component and to enable searching
multimedia contents using video clip.
In another embodiment the present invention provides an audio analyzer means having an audio
fingerprinting and speech transcription, said audio fingerprinting is used to enable searching
multimedia contents by its audio component.
In still another embodiment the present invention provides a system wherein said text analyzer takes
input from speech transcription, OCR, and information entered by the user, the said output of text
analyzer is stored in the RDF File representing annotation information of the multimedia content.
In yet another embodiment the present invention provides a system wherein said system enables
multimedia search by sending text as search query.
In another embodiment the present invention provides a system wherein said system enables
multimedia search by sending an image as search query.
5
In still another embodiment the present invention provides a system wherein said system enables
multimedia search by sending an audio clip as search query.
In yet another embodiment the present invention provides a system wherein said system enables
multimedia search by sending the video clip as search query, wherein said system searches the
multimedia content based on the identified faces in the image given as search query.
In another embodiment the present invention provides a method for providing web search for
multimedia contents comprising the steps of:
i. submitting multimedia contents to server along with user description
ii. generating keyframes and image fingerprints from said keyframes
iii. accessing the said keyframes by said optical character recognition , said object
detection and said face detection and recognition;
iv. getting information from DBPedia for the faces detected by said face recognizer
and augmenting the RDF File
v. generating audio fingerprints using said multimedia contents
vi. generating speech transcription using said multimedia contents
vii. using said text analyzer on the said speech transcription, said OCR, user
description and augmenting the said RDF File
viii. using said RDF File for annotarion of said multimedia contents
ix. getting search query on client side in the form of text and displaying search
results based on matching with said RDF File
x. getting search query on client side in the form of image and displaying search
results based on matching with said image fingerprints
xi. getting search query on client side in the form of video clip and displaying search
results based on matching with said image fingerprints
xii. getting search query on client side in the form of audio clip and displaying search
results based on matching with said audio fingerprints
xiii. getting search query on client side in the form of image and displaying search
results based on matching in said face detection and recognition.
BREIF DECRIPTION OF DRAWINGS
Figure 1 illustrates the system design
Figure 2 illustrates Image Analyzer's design
Figure 3 illustrates a sample response string generated by the system
Figure 4 illustrates different phases of Face Recognition process
Figure 5 illustrates Audio Analyzer's design
Figure 6 illustrates Chromaprint and AcoustId
Figure 7 illustrates search result-search by image
Figure 8 illustrates computing relevance parameter for a multimedia file
Figure 9 illustrates text analyzer operation
DESCRIPTION OF THE INVENTION
The system has three main components Image Analyzer, Audio analyzer and Text Analyzer The
invention analyzes the multimedia contents and annotates the contents automatically by gathering
information from image analysis, audio analysis and text analysis. This system works for all videos,
images and audio files. The system enables search across all kinds of multimedia files.
The system consists of hardware components which enable to perform the operational utility of the
present invention. As far as the input search query is concerned, the system can take text, image,
video, or audio as the input. For taking input in textual form, the standard text input devices is to be
used. For enabling input in image form, the user responsible for giving input need to use standard
camera or other picture capturing device. Similarly, user has to use any standard video or audio
capturing device for preparing the input query of corresponding form. On the client side, for using the
system by end-user, two options are provided. One, the end user can use the client interface through
any standard desktopflaptop or similar system with Internet/network connectivity. Second, the end
user can use the client interface through any android enabled or Blackberry device. The system uses
hard-disk storage device for storing the database and other components on server side. Any standard
7
high end system with enough amounts of storage, RAM and computing power can act as server
dependent upon the deployment. This system requires to perform a lot of computationally intensive
operations. So, CUDA enabled computing devices are much beneficial. While this system is capable
of running without this, but to leverage the great performance which CUDA systems have to offer,
this system utilizes it. It enables dramatic increase in computing performance by harnessing the
power of the graphics processing unit (GPU). Multithreading is phenomenal in these hardware
devices. System can use OpenCV library for implementation of the process. A lot of functions are
ported for CUDA already and it runs extremely fast. For this reason a lot of system's components can
run on these devices.
The aim of the system is to obtain as much information about the multimedia file and the content it
describes. For this the system needs to analyze each and every component of the multimedia file. The
multimedia content is split into two parts audio content and pure video content. These two
components are sent to Audio Analyzer and Image Analyzer respectively. The third component, the
Text Analyzer analyzes the user description. The key frames can be extracted and serve as a
representation of the video. A continuous stream of video recording is referred to as shot. Within these
shots there are points when there is a scene change. This may be because object being recorded moved
or camera is moved to record a different scene. In order to represent the video, identification of such
points are required. Thus each of these video segments can be represented by one image for each
segment. These images are referred as keyframes. Videos are recorded at specific frame rates. Using
image processing algorithms, similarity between adjacent frames and also similarity over a set of
frames is computed. A threshold for similarity can be decided, as and when the similarity is lesser than
this point, it infers that there is a scene change.
Since, the set of these keyframes represent the video file, they also offer another advantage that any
analysis which needs to be done on the video component of the multimedia file can be done by
analyzing these keyframes itself. Thus, any object detection or face detection, identification and other
analysis can be done on these frames. If these keyframes match the required conditions then so must
the corresponding video in multimedia file.
Image Analyzer
Image Analyzer's design is illustrated in Figure 2. The keyframes can be considered as a representation
of the complete video. Any text which appears in these keyframes has also to be extracted and
analyzed. These keyframes contain information in text and graphic form. The information in text
form can be extracted using Optical Character Recognition (OCR). This text helps to identify context,
for example if the video contains a cricket match, names of the cricket players in the keyframes could
be find. Thus, if these names are extracted then there is a possibility that this video may be about
cricket. If we have more words matching a specific context, the likelihood that the context guessed is
right increases.
The second source of information in the keyframes is in the graphic form. Some of the keyfrarnes may
contain people in it. These people need to be identified. This task is split into two parts: Face detection
and face recognition. The system analyses the keyfrarnes for any faces in it. Face detection may be
done by using Viola-Jones method or using LBP (Local Binary Patterns) features which are well
known to a skilled person sface detection. The second step is the Face Recognition. Once the faces are
detected, these faces are matched against database of trained faces. If there is a hit then that
information is stored in the RDF (Resource Description Framework) file associated with the
corresponding multimedia file. This face once recognized can be used to further aid in adding more
information about the content from DBPedia. A query can be sent to DBPedia. For an instance, if a
person is a celebrity hdshe will have information in DBPedia. This information is fetched and
corresponding tags are added in the RDF file.
If there is no hit then this detected face is stored as an image and assigned a temporary name. The
database of detected but unrecognized faces is updated to reflect this. These faces can be seen and later
recognized by the user. This would mean the user can enter the name of the person, later through the
web interface or mobile apps. Once identified by the user the RDF file of the corresponding files can
be updated in a manner as described earlier.
Face recognition process
When a video is uploaded the keyframes are extracted from it. Now one by one each of these
keyframes is given as input to the face detection component. These images are loaded as grayscale.
9
The Haar Cascade is loaded. The system detects the faces in these grayscale keyframes of the video.
To speed up the detection process, the images are scaled down and faces are identified, the scaling
- down process helps as less number of pixels has to be checked. Once these faces are identified,
appropriate region from the original image are extracted. If faces are detected, then the system
extracts the face images. However if no faces are detected the process is stopped here.
In case the faces were detected and the corresponding face images were extracted the face recognition
process begins. The system attempts to identify the person from the face images. For each such
identification a prediction label and a distance parameter is specified. If the system is unable to
identlfy the person it gives a prediction label as -1. For other images where the identification is made
the distance parameter is also specified along with the prediction label. Figure 3 illustrates a sample
response string generated by the system. Less distance value indicates a very high confidence in
predicting the person correctly. The best possible value thus is 0 for distance parameter. The first
value indicates the predicted label while the second shows the distance parameter. Figure 4 shows the
face recognition process followed by SWIRL system
One or more person may be present in different keyframes, thus if the system were to just return the
names of the person in these keyframes the name of many people will be repeated many times. To
combat this while searching and recognizing faces the system makes use of hashtable. Hashtables are
useful for checking duplicates at the time of insertion of data. Thus all names are returned only once.
Once these set of names are returned the system stores these names in the database. Whenever a user
provides a name of the person as the search query, the SWIRL system presents the video containing
that person as a query result. Another augmentation to the annotation information of multimedia
contents is through object detection in the corresponding images.
Image based searching in this system is enabled by generating image fingerprints using image
perpetual hash generation. For the image given as search query, the fingerprint will be generated and
will be matched against the fingerprints of keyframe images stored in server. The pointer to the
multimedia files for which the corresponding keyframe's fingerprint matches with the fingerprint of
input search image, will be displayed as search result. The system thus is not only capable of
displaying the multimedia files matching exactly to input image given as search query, but also
having similar keyframes. Also, the threshold of similarity can be controlled.
Audio Analyzer
Audio form is a major component of any multimedia file. The Audio Analyzer 's design is illustrated
in Figure 5. Two different types of analysis need to be done on the audio content. First branch focuses
on the audio fingerprint generation. This enables search for any multimedia content by just submitting
an audio file. The second branch focuses on analyzing the content that is being talked about. This
audio may contain a speech or may contain dialogue between a group of people or music. If there is a
possibility that audio contains speech or dialogues between the group of people then one can go for
speech to text conversion. Audio's fingerprint can be generated. Audio may then be identified by
matching it's fingerprint in the music database or can utilize the vast information stored in online
music identification system solutions like Chromaprint and AcoustId . This is illustrated in Figure 6. In
this system for implementation, Chromaprint is used for audio fingerprinting. Chromaprint aims at
converting frequencies to their equivalent notes. Now even if the audio content is encoded into other
formats the representation would still be the same. Further a set of filters are applied to these notes
and a fingerprint is generated. Speech transcription in Audio Analyzer helps in understanding what
is being talked about. The output from the speech transcription is fed to the text analyzer which is
explained in the next section. For the implementation, the Sphinx engine developed by Carnegie
Mellon University or such other can be used for speech transcription.
Text Analyzer
One of the most important sources of information is information in textual format entered by the
user as the title of the video and in the description of the video. This information source can be the
most unreliable form of information seen so far in the sense that user can have factual errors in the
description.
It is necessary that the user describes it in the way one feels about, so the user's opinion or bias about
the content is needed. This applies for not just the video but for any multimedia content. Now that
information is readily available in the text format. Natural language processing can be applied to the
text obtained from all the three sources described earlier. Finally the information is stored in RDF
format. These RDF files associated with various video files contain information about the
corresponding multimedia fles acting as annotation of that file which are further used in searching
process.
Information from metadata
The image or video content may have some information stored as metadata also. The system uses this
- information also. A lot of the smartphones store the location details along with the image. These
. location details can be used to obtain the name of the location. This can further add to the
knowledgebase of the system.
Searching
The system offers a variety of ways for searching through the multimedia content in the system.
Depending upon the convenience of the user the appropriate way for searching can be followed.
Search by Image
This is a novel feature and is not present in the other competing systems. Suppose if a user has just
image or a screenshotand wants to search the multimedia contents by submitting this image. This
feature enables that. During the video processing step, the keyfrarnes are generated and image
fingerprints for each of these keyframes are also generated and stored. As image is submitted to the
system as search query, the system generates a fingerprint for the image. The fingerprint is matched
against that in the system 's records. The keyframes whose perceptual hashes match closely with that
of the query image are arranged in the decreasing order of similarity. The corresponding multimedia
files or images in the system are displayed in the search results. The steps are shown in Figure 7.
Search by audio
During the multimedia contents processing step the audio component of the file is extracted and an
audio hash is generated. Whenever a query audio is received we need to determine its fingerprint so
that we can search for an audio similar to this. Chromaprint generates fingerprint from uncompressed
RAW audio. Thus any audio for which fingerprint is to be generated needs to be first encoded into
the RAW form. This is done using FFMPEG, after the audio conversion the file is sent
to Chromaprint for audio fingerprinting. The corresponding multimedia files whose audio
component's hash matches closely with that of the query audio file are displayed in the decreasing
order of similarity. However, in place of Chromaprint any other similar tool or technique for
fingerprinting can also be used. At first the audio fingerprint of an existing file in the system is
fetched and a comparison is made with that of the query audio's fingerprint. For finding the
similarity between the hashes, first the larger of the two hash sequence is identified. Now
starting from the left most frame of the larger hash, the Hamming distance between the two hashes
is determined. The ratio of match and the total length of the hash is taken as the measure of
similarity. The smaller hash is then slid towards the right. At each point we determine the Hamming
distance. The minimum Hamming distance is recorded. This process is carried out for all the audio
files, the one(s) which records the lowest value is the closest pair. Thus, the audio which is the most
similar to our query audio is determined.
Search by face
In this case, the user uploads a query image on the system. System attempts to detect and recognize
any face(s) present in it. The system then searches for multimedia files which contain all or most of
the peoples present in the query image. Based on the face(s) detected, the matching multimedia files
are presented to the user. Ordering is done on the basis of the number of people who are present in
the multimedia file as well as the query image. So if there are more hits in the multimedia file then
the file is more relevant.
Search by text
This refers to the case when the user decides to search by text in the system. For the purpose of
displaying results in the decreasing order of relevance, there is a relevance parameter defined. User
types the search query, the terms are searched in the user description and title first, if there is a hit
then the relevance parameter is updated. The search is made in speech transcriptions. If there is a hit
then the parameter is updated. The process is carried out for OCR text. Then we move onto the RDF
file. After this, a search is done in the list of people recognized in the videos. After this is done, the
relevance parameter 's final value is available. All the multimedia content with non zero parameter
value is to be displayed in the results. In order to display the most relevant result first, the multimedia
content is displayed in the decreasing order of the value of the relevance parameter. Figure 8
illustrates the process.
Search by video
This refers to the case when the user wants to find out a similar video. The user submits a video. The
user can submit a trailer of the movie and get the movie as the result. The keyframes are extracted
from the query video. The image hashes are generated and are searched for in the database. While
matching the similarity of the keyframes is important, the ordering of the keyframes is also
13
important. For instance, if the 1st and 2nd keyframes are similar to the 134& and 138th keyframes of
another video then the query video is more similar to this video than the one where the keyframes
are similar to the 14P and 147th keyframes of another video. As explained earlier in the first case the
ordering is same i.e. increasing order whereas for the second scenario, the order of the keyframes of
the second video is reversed.
Tools and Technologies used
The system consists of hardware components which enable to perform the operational utility of the
present invention. In the previous sections, the complete processing of the system with methodologies
involved along with the hardware components which enable to perform the operation utility of the
present invention is described. The system as a whole and the functionalities provided by it are either
novel or improved as compared to existing similar systems. Based upon the above description, one can
develop a real-time working system. Based upon the above discussed description, we have also
developed one of the possible working systems. The system is built using Servlets and Java Server
Pages (JSP). Besides Javascript US), HTML 5 and CSS are used. Apart from this, for implementing the
different components of the project Apache HttpClient-HttpComponents, MySql, Sphinx 4, Asprise
OCR, and FFMPEG are used. OpenCV is used for image processing operations. For the ontology
related operations, Jena is used. For text analysis, AlchemyAPI, a cloud-based text mining platform is
used. Tomcat handles the entire system. Apart from this, we have also developed Android and
Blackberry based applications serving as the client proving that our developed system can also be
accessed or deployed over android or blackberry enabled devices. The working process of this
implemented system is given below.
Working process
The system uses database to store the details regarding the video name, the image serving as the
thumbnail for the video file, the title and description entered by the user among the other details.
FFmpeg tool is used for audio extraction. A process is created by the server. This process provides the
parameters for audio extraction to FFmpeg tool by making an exec call. The multimedia file uploaded
on the system can be in any format, in order to make it compliant with the WebM format it needs to
be encoded. Another process is created which uses FFmpeg and provides it the parameters to encode
the video to the WebM format. As soon as the audio extraction process ends, another process is created
and speech to text conversion process begins. Once the video encoding process ends, the original video
is deleted and the new WebM format compliant video is stored. After the video encoding process, the
. keyframes are extracted. OCR is done on each of these images, all the text is extracted. Once the text is
obtained from speech to text, OCR and user's description, this text is stored for retrieval later. This
entire text is sent to the AlchemyAPI's server. RDF file is obtained and stored. Also the face detection,
object detection and image hash details are stored. Similarly the audio fingerprint is also added.
Similarly an application using which Android and Blackberry phone devices can act as client is also
developed.
The SWIRL system architecture helps it in combating the issues which plagued similar systems. Unlike
the other systems it does not depend on only source of information. It utilizes all the sources of
information which could be video, audio and user description. The three different components analyze
the multimedia file and understand the content it is about and thus make the system more efficient
than others. SWIRL system searches beyond the information which user provides it, It literally
increases the annotations of a multimedia file. Thus even for the conventional approach for searching
through text, the results are way better than the existing approaches. Systems relying only on the user
entered data cannot even imagine getting the search results for the queries where direct data is not
available. SWIRL makes that a possibility.
The search by image feature offered by SWIRL is a novel approach developed only in SWIRL and no
other system. Unlike other search engines in which user sends an image to the search engine and the
search engines return similar images, SWIRL system starts at the point where other systems have left
off. It not just searches for images similar to the one which the user submits but is also capable of
identlfylng video content in which this image is as a scene. This is only possible because of SWIRL
system's ability to analyze different kinds of multimedia contents. SWIRL system operates unhindered
by video, audio and image format issues. It is capable of handling the newest as well as most of the old
formats.
Similarly audio fingerprinting techniques and speech transcription add to the arsenal of features which
SWIRL system has to offer. The SWIRL system architecture uses the EXIF data also. It uses reverse
geo-coding to determine the location where the image or video content was captured or taken. It is
not limited by platforms and has different components developed on different platforms, all working
in tandem. Apart from this, system learns and trains itself as is visible in face recognition part. For
15
unrecognized faces, it asks users of the system to identify the person and trains itself to iden* the
person in the future, its knowledge is thus not static and evolves with time. It provides features like
- text analysis, keyframe extraction, OCR, Speech transcription, Audio fingerprinting, Image
fingerprinting, Face detection and Face recognition.
As per the study done by Nielsen, the most popular multimedia platforms are YouTube ,Vimeo and
Metacafe. Unlike other systems, SWIRL system does not limit searching to description and title
entered by the user only. Text analysis is done. Along with that more information is fetched about
the entities identified in the content. Thus if user description mentions 'Chuck Lorre' somewhere in
the user description, the text analyzer will identify Chuck Lorre to be a person and get more
information about him like 'TV Director' and other information. Search in Speech Transcription of
multimedia file is provided by the presented SWIRL system. Vimeo, Metacafe lack this feature.
Youtube provides speech transcription in limited number of multimedia files. SWIRL also offers
search in OCR. None of the other systems offer this. Also search is done in OCR text and in the
Speech transcription. Vimeo, Metacafe lack this feature. YouTube provides speech transcription in
limited number of multimedia files but apart from that search in OCR and fetching more information
about entities after text analysis of the user description and title is not offered. Thus it offers
enhanced searchability for search by text in which user input search query in the form of text using
any standard text input device such as keyboard.
I SWIRL system offers search based on images taking images as search query. The query image can be
taken by user from any other source or can be captured using any camera or other device capable of
capturing pictures. This is a novel feature. User can provide image as the input and the videos
containing the exact scene as the image or a similar scene is shown. The results are displayed in the
decreasing order of similarity. System contains videos, images and audio in it. Thus the results span
across different multimedia types i.e videos, images. What is meant is that if a system contains a scene
exactly the same or similar to the image provided for search, that video is shown in the search results,
I apart from this if a similar image exists in the system that too is shown in the result. Only the results
which cross the similarity threshold are displayed. No other system is known to have this feature.
Thus Youtube, Vimeo, Metacafe lacks on this point.
This system also provides searching based upon audio type input query. The query audio file can be
taken by user from any other source or can be captured using any device capable of capturing audio.
Audio is submitted to the system. System searches for multimedia files containing similar audio, audio
files similar acoustically to the one submitted as the query audio. The results are displayed in the
decreasing order of similarity. Point to note is that any portion of the audio track of the multimedia
file can be similar to the audio submitted for search, yet the system will display the result. Services
like Soundhound identify the song if a small clip of the audio is submitted to them, however these
services are not capable of telling which multimedia fde contains this audio in its audio track.
YouTube does use audio fingerprinting for checking whether content uploaded by user violates
copyright. However as a search feature no option is available to the end user. Neither Vimeo nor
Metacafe offer this feature.
This system also provides searching based upon the face as input query. SWIRL system recognizes
faces for which it is trained in a video or image. For unknown faces, it detects the faces and stores
them separately so that users of the system can provide name of the person whose face is displayed.
As and when the name of the person is provided the system learns to identify that particular person.
Apart from this SWIRL system fetches more information about this person from DBPedia which
further enhances searchability. YouTube, Vimeo and Metacafe do not provide such feature. YouTube
provides an option of face blurring at the time of upload of video, thus we can assume that it does
detect face, but no information regarding recognition of person in videos is known. Same goes for
other available system. Thus again SWIRL system leads here.
The other similar existing system SemWebVid provides search by text based upon user description,
text and some information beyond that, but all other features and searching options provided by our
system are not available in this. Video Google is a web search system but only provide search by OCR
and do object identification, but lacks on all other points discussed above. Further, the system needs
relatively less storage space in storage device and is relatively faster in searching and hence can work
with relatively low-level computing device. The method in SWIRL system can also be adapted to
work for multimedia content search over the simple network without internet connectivity or for
multimedia content search on a standalone system by user of that system.
We claim:
1. A system for providing web search for multimedia contents containi " 4 0D L
1. an image analyzer;
ii. an audio analyzer and
iii. text analyzer
wherein the said system is capable of storing and searching multimedia video, audio and image
content without imposing restriction on the format of the contents;
wherein said system analyzes multimedia contents and annotates the contents by means of the
said components characterized in that said system contains a video encoder the output of which
is used for keyframe extracter and the said image analyzer uses output of keyframe extractor.
2. The system as claimed in claim 1, wherein said image analyzer means has an optical character
recognition (OCR), object detection, face detection and recognition, the output of OCR is given
as input to text analyzer and the output of said face detection and recognition and object
detection is used to retrieve information from DBPedia, wherein the output of processing from
DBPedia is added to a RDF file for tagging of multimedia file and said face recognition enable
searching multimedia contents based upon the faces in its image component.
3. The system as claimed in claims 1 and 2, wherein said system generates image fingerprints of
the output of the keyframe extractor to enable searching multimedia contents by its image
component and to enable searching multimedia contents using video clip.
4. The system as claimed in claims 1 to 3, wherein said audio analyzer means has an audio
fingerprinting and speech transcription, said audio fingerprinting is used to enable
searching multimedia contents by its audio component.
5. The system as claimed in claim 1 to 4, wherein said text analyzer takes input from speech
transcription, OCR, and information entered by the uploading user, the said output of text
18
content.
6. The system as claimed in claim 1, wherein said system search multimedia content by
o R ~ ~ ~ ~ sending text as search query based upon the annotations provided by said RDF file.
7. The system as claimed in claim 1, wherein said system search multimedia content by
sending an image as search query.
8. The system as claimed in claim 1, wherein said system search multimedia file or audio
file containing the audio clip sent as search query.
9. The system as claimed in claim 1, wherein said system search multimedia file containing
the video clip sent as search query, wherein said system searches the multimedia content
based on the indentified faces in the image given as search query.
10. A method for providing web search for multimedia contents comprising the steps of:
xiv. submitting multimedia contents to server along with user description
xv. generating keyframes and image fingerprints from said keyframes
xvi. accessing the said keyframes by said optical character recognition , said object
detection and said face detection and recognition;
xvii. getting information from DBPedia for the faces detected by said face recognizer
and augmenting the RDF File
xviii. generating audio fingerprints using said multimedia contents
xix. generating speech transcription using said multimedia contents
xx. using said text analyzer on the said speech transcription, said OCR, user
description and augmenting the said RDF File
xxi. using said RDF File for annotation of said multimedia contents
xxii. getting search query on client side in the form of text and displaying search
results based on matching with said RDF File
xxiii. getting search query on client side in the form of image and displaying search
results based on matching with said image fingerprints
19
xxiv. getting search query on client side in the form of video clip and displaying search
results based on matching with said image fingerprints
xxv. getting search query on client side in the form of audio clip and displaying search
results based on matching with said audio fingerprints
xxvi. getting search query on client side in the form of image and displaying search
results based on matching in said face detection and recognition.
Dated this the 30fb day of December 2013.
Juhi Srivastava
of Rightz Intellectual Property Services
Attorney for the Applicants
| # | Name | Date |
|---|---|---|
| 1 | 3815-DEL-2013-FORM-27 [05-07-2024(online)].pdf | 2024-07-05 |
| 1 | 3815-del-2013-Form-5.pdf | 2014-05-27 |
| 2 | 3815-del-2013-Form-3.pdf | 2014-05-27 |
| 2 | 3815-DEL-2013-FORM 4 [05-01-2024(online)].pdf | 2024-01-05 |
| 3 | 3815-DEL-2013-RELEVANT DOCUMENTS [06-06-2023(online)].pdf | 2023-06-06 |
| 3 | 3815-del-2013-Form-2.pdf | 2014-05-27 |
| 4 | 3815-DEL-2013-IntimationOfGrant05-08-2022.pdf | 2022-08-05 |
| 4 | 3815-del-2013-Form-1.pdf | 2014-05-27 |
| 5 | 3815-DEL-2013-PatentCertificate05-08-2022.pdf | 2022-08-05 |
| 5 | 3815-del-2013-Drawings.pdf | 2014-05-27 |
| 6 | 3815-del-2013-Description (Complete).pdf | 2014-05-27 |
| 6 | 3815-DEL-2013-Correspondence-020921.pdf | 2021-10-17 |
| 7 | 3815-DEL-2013-OTHERS-020921.pdf | 2021-10-17 |
| 7 | 3815-del-2013-Correspondence-others.pdf | 2014-05-27 |
| 8 | 3815-DEL-2013-FORM-8 [22-09-2021(online)].pdf | 2021-09-22 |
| 8 | 3815-del-2013-Claims.pdf | 2014-05-27 |
| 9 | 3815-DEL-2013-FORM 13 [04-06-2021(online)].pdf | 2021-06-04 |
| 9 | 3815-del-2013-Abstract.pdf | 2014-05-27 |
| 10 | 3815-DEL-2013-8(i)-Substitution-Change Of Applicant - Form 6 [01-06-2021(online)].pdf | 2021-06-01 |
| 10 | 3815-del-2013-Form-5-(30-07-2014).pdf | 2014-07-30 |
| 11 | 3815-DEL-2013-Annexure [01-06-2021(online)].pdf | 2021-06-01 |
| 11 | 3815-del-2013-Form-3-(30-07-2014).pdf | 2014-07-30 |
| 12 | 3815-DEL-2013-ASSIGNMENT DOCUMENTS [01-06-2021(online)].pdf | 2021-06-01 |
| 12 | 3815-del-2013-Form-2-(30-07-2014).pdf | 2014-07-30 |
| 13 | 3815-DEL-2013-ENDORSEMENT BY INVENTORS [01-06-2021(online)].pdf | 2021-06-01 |
| 13 | 3815-del-2013-Form-13-(30-07-2014).pdf | 2014-07-30 |
| 14 | 3815-del-2013-Correspondence-Others-(30-07-2014).pdf | 2014-07-30 |
| 14 | 3815-DEL-2013-EVIDENCE FOR REGISTRATION UNDER SSI [01-06-2021(online)].pdf | 2021-06-01 |
| 15 | 3815-del-2013-Form-13-(13-10-2015).pdf | 2015-10-13 |
| 15 | 3815-DEL-2013-FORM28 [01-06-2021(online)].pdf | 2021-06-01 |
| 16 | 3815-del-2013-Correspondence Others-(13-10-2015).pdf | 2015-10-13 |
| 16 | 3815-DEL-2013-OTHERS [01-06-2021(online)].pdf | 2021-06-01 |
| 17 | 3815-DEL-2013-Copy of Form 13-(13-10-2015).pdf | 2015-10-13 |
| 17 | 3815-DEL-2013-Proof of Right [01-06-2021(online)].pdf | 2021-06-01 |
| 18 | 3815-DEL-2013-FORM 13 [24-11-2020(online)].pdf | 2020-11-24 |
| 18 | Form 18 [06-10-2016(online)].pdf | 2016-10-06 |
| 19 | 3815-DEL-2013-FORM-26 [24-11-2020(online)].pdf | 2020-11-24 |
| 19 | 3815-DEL-2013-FER.pdf | 2020-05-11 |
| 20 | 3815-DEL-2013-OTHERS [03-11-2020(online)].pdf | 2020-11-03 |
| 20 | 3815-DEL-2013-RELEVANT DOCUMENTS [24-11-2020(online)].pdf | 2020-11-24 |
| 21 | 3815-DEL-2013-ABSTRACT [03-11-2020(online)].pdf | 2020-11-03 |
| 21 | 3815-DEL-2013-FER_SER_REPLY [03-11-2020(online)].pdf | 2020-11-03 |
| 22 | 3815-DEL-2013-CLAIMS [03-11-2020(online)].pdf | 2020-11-03 |
| 22 | 3815-DEL-2013-DRAWING [03-11-2020(online)].pdf | 2020-11-03 |
| 23 | 3815-DEL-2013-CLAIMS [03-11-2020(online)].pdf | 2020-11-03 |
| 23 | 3815-DEL-2013-DRAWING [03-11-2020(online)].pdf | 2020-11-03 |
| 24 | 3815-DEL-2013-ABSTRACT [03-11-2020(online)].pdf | 2020-11-03 |
| 24 | 3815-DEL-2013-FER_SER_REPLY [03-11-2020(online)].pdf | 2020-11-03 |
| 25 | 3815-DEL-2013-RELEVANT DOCUMENTS [24-11-2020(online)].pdf | 2020-11-24 |
| 25 | 3815-DEL-2013-OTHERS [03-11-2020(online)].pdf | 2020-11-03 |
| 26 | 3815-DEL-2013-FER.pdf | 2020-05-11 |
| 26 | 3815-DEL-2013-FORM-26 [24-11-2020(online)].pdf | 2020-11-24 |
| 27 | 3815-DEL-2013-FORM 13 [24-11-2020(online)].pdf | 2020-11-24 |
| 27 | Form 18 [06-10-2016(online)].pdf | 2016-10-06 |
| 28 | 3815-DEL-2013-Copy of Form 13-(13-10-2015).pdf | 2015-10-13 |
| 28 | 3815-DEL-2013-Proof of Right [01-06-2021(online)].pdf | 2021-06-01 |
| 29 | 3815-del-2013-Correspondence Others-(13-10-2015).pdf | 2015-10-13 |
| 29 | 3815-DEL-2013-OTHERS [01-06-2021(online)].pdf | 2021-06-01 |
| 30 | 3815-del-2013-Form-13-(13-10-2015).pdf | 2015-10-13 |
| 30 | 3815-DEL-2013-FORM28 [01-06-2021(online)].pdf | 2021-06-01 |
| 31 | 3815-del-2013-Correspondence-Others-(30-07-2014).pdf | 2014-07-30 |
| 31 | 3815-DEL-2013-EVIDENCE FOR REGISTRATION UNDER SSI [01-06-2021(online)].pdf | 2021-06-01 |
| 32 | 3815-DEL-2013-ENDORSEMENT BY INVENTORS [01-06-2021(online)].pdf | 2021-06-01 |
| 32 | 3815-del-2013-Form-13-(30-07-2014).pdf | 2014-07-30 |
| 33 | 3815-DEL-2013-ASSIGNMENT DOCUMENTS [01-06-2021(online)].pdf | 2021-06-01 |
| 33 | 3815-del-2013-Form-2-(30-07-2014).pdf | 2014-07-30 |
| 34 | 3815-DEL-2013-Annexure [01-06-2021(online)].pdf | 2021-06-01 |
| 34 | 3815-del-2013-Form-3-(30-07-2014).pdf | 2014-07-30 |
| 35 | 3815-DEL-2013-8(i)-Substitution-Change Of Applicant - Form 6 [01-06-2021(online)].pdf | 2021-06-01 |
| 35 | 3815-del-2013-Form-5-(30-07-2014).pdf | 2014-07-30 |
| 36 | 3815-del-2013-Abstract.pdf | 2014-05-27 |
| 36 | 3815-DEL-2013-FORM 13 [04-06-2021(online)].pdf | 2021-06-04 |
| 37 | 3815-DEL-2013-FORM-8 [22-09-2021(online)].pdf | 2021-09-22 |
| 37 | 3815-del-2013-Claims.pdf | 2014-05-27 |
| 38 | 3815-DEL-2013-OTHERS-020921.pdf | 2021-10-17 |
| 38 | 3815-del-2013-Correspondence-others.pdf | 2014-05-27 |
| 39 | 3815-del-2013-Description (Complete).pdf | 2014-05-27 |
| 39 | 3815-DEL-2013-Correspondence-020921.pdf | 2021-10-17 |
| 40 | 3815-DEL-2013-PatentCertificate05-08-2022.pdf | 2022-08-05 |
| 40 | 3815-del-2013-Drawings.pdf | 2014-05-27 |
| 41 | 3815-DEL-2013-IntimationOfGrant05-08-2022.pdf | 2022-08-05 |
| 41 | 3815-del-2013-Form-1.pdf | 2014-05-27 |
| 42 | 3815-DEL-2013-RELEVANT DOCUMENTS [06-06-2023(online)].pdf | 2023-06-06 |
| 42 | 3815-del-2013-Form-2.pdf | 2014-05-27 |
| 43 | 3815-DEL-2013-FORM 4 [05-01-2024(online)].pdf | 2024-01-05 |
| 43 | 3815-del-2013-Form-3.pdf | 2014-05-27 |
| 44 | 3815-DEL-2013-FORM-27 [05-07-2024(online)].pdf | 2024-07-05 |
| 44 | 3815-del-2013-Form-5.pdf | 2014-05-27 |
| 1 | SearchStrategyMatrix(3815)E_29-04-2020.pdf |