System And Method Of Curating Social Media Content
Abstract:
The present invention envisages a system and method of analyzing the tweets and the hyperlinked information sources for identifying the events that generate significant public interest, in order to present them in a concise and attractive manner upon a user interface. The events are filtered based on their interestingness and correlated to generate event maps, which enable semantic navigation across the events and associated sub events before they are presented on the interface.
Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence
FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION
(See Section 10 and Rule 13)
Title of invention: SYSTEM AND METHOD OF CURATING SOCIAL MEDIA CONTENT
Applicant:
Tata Consultancy Services Limited A company Incorporated in India under The Companies Act, 1956
Having address:
Nirmal Building, 9th Floor,
Nariman Point, Mumbai 400021,
Maharashtra, India.
The following specification particularly describes the invention and manner in which it is
to be performed.
TECHNICAL FIELD
[001] The present invention relates generally to Social media analytics and more
particularly to a system and method for extraction and compilation of events and sub events shared on social networking platforms.
BACKGROUND
[002] Micro-blogging services have gained enormous popularity for public at
large to report new events or to engage in discussions about reported events. Several incidents in the recent past have illustrated the socio-political importance of social media sites like Twitter and Facebook. where people share news, experiences and interests. The rapid ability to reach a large audience with near-zero latency has turned these media into a veritable snapshot of the collective thoughts of the globe. Intelligence and investigative analysts are also turning to social media to gather insights about people, groups. organizations, networks and about past and future events, Social media not only reveals people's reactions to major planned events, but also contains information about upcoming, unplanned and often localized sub-events around an event. Some of the events generate more public interest than others generate, and are more vigorously discussed on the social media. While there had been earlier research to establish the popularity of individual discussion threads e.g. tweets and retweets, there had been no effort to estimate public interest around the events that are discussed in the tweets.
[003] While there have been earlier efforts to analyze tweet text and metadata, so
far there has been no attempt to analyze the cited contents in context of the tweets. The tweets often cite images, video clips, news articles and other information objects related to the subject of discussion, which provide much more information than the tweet text. Such contents are rarely associated along with the major event, and seldom is any effort made to prioritize these events as per user's priority for presentation before the user. Further, in view of millions of tweets being generated each day, identifying relevant events or subevents and tracking them is a challenging task. The noisy nature of content makes information extraction difficult. Though a large volume of work exists on event extraction from tweets that requires consideration of whole twitter space, little has been done to
understand significance of collating small sub-events with the major events to create event maps.
[004] Typically, these events are generally characterized by several facets, such as
Temporal, Spatial and Informational. While most of the earlier works have viewed these events as activities situated in location and time, which focus either on associating a unique time-stamp to an event or defining events as named-entity centric, nothing has been done to provide an action centric definition to these events to generate collated event maps.
[005] Event detection from vast volumes of content has attracted the attention of
several researchers in the recent past. For example, many have considered identifying events from a continuous stream of news documents. Extraction of meaningful semantic attributes like names, time references, location etc. from noisy text was explored by J. Makkonn, H. Ahonen-Mya and M.Salmenkivi in "Simple semantics in topic detection and tracking". These attributes help in improving the quality of event extraction but never explored collating of each small sub event with major event to present before the user as per his choice and interest.
[006] Extracting relevant information from Social-media content is a recent
phenomenon. Prior arts have considered. Flickr tags along with other content like images, temporal and spatial tags to detect events. Even, the use of clustering has been explored for event identification from tweets. Techniques for effective selection of quality event content to improve event browsing and search were proposed in recent past. Again, using multiple social media sites for information retrieval has been extensively researched and proposed. The use of different text and author properties to judge quality of content in Yahoo! answers is also considered. Furthermore, graph-based techniques have been explored to extract high-quality information from social media. Presenting summarized views of tweet content using event extraction, visualization and analytics were considered in many prior literature surveys.
[007] As will be acknowledged from aforementioned prior art work, much have
been explored and exploited in the field of social media analytics. Still, neither the factors determining event-viewing choice of the user are being considered, nor has there been any
attempt of associating these major events with smaller planned or unplanned sub-events that perhaps may be of significant interest to users.
[008] In the light of foregoing problems and limitations, there exists need for a
system and a method that can analyze the content and context of social media data and related content to arrive at a set of event maps arranged in order of user's priority and interest.
OBJECTIVES OF THE INVENTION
[009] In The principle object of the present invention is to provide in-context
analysis of tweet data, hyperlinked information sources like images, video clips and news articles, and other private and public repositories to discover multimedia contents related to events being discussed on social media.
[0010] Another significant object of the invention is to correlate the relevantly
similar events derived from multimodal inputs, and to enable semantic navigation across the correlated events.
[0011] It is another significant object of the present invention to prioritize the
events based on their interestingness factor for presentation.
[0012] Still another object of the present invention is to present an event map
establishing relationship between the major event and related smaller sub-events that may be significantly used to build effective predictive analytics framework.
SUMMARY
[0013] This summary is provided to introduce aspects related to system and method
of curating social media content and the aspects are further described below in the detailed description. This summary is not intended to identify essential features of the claimed subject matter nor is it intended for use in determining or limiting the scope of the claimed subject matter.
[0014] An important aspect of the present invention relates to a method for
determining interestingness factor of one or more events and associated subevents, the
method comprising: receiving multimodal data; processing the multimodal data for extracting a plurality of events,
subevents associated with each event of the plurality of events, and a set of predetermined attributes associated with each event of the plurality of events: generating a plurality of titles for the plurality of events;
segregating the plurality of events into event groups based upon similarly of titles present in the plurality of titles;
calculating an interestingness factor for each event group; prioritizing the event groups based upon the interestingness factor; and generating an event map based upon prioritizing the event groups, wherein the event map is indicative of a link between the event groups and the subevents associated with the event groups, and wherein the event map enables a user to select and navigate the event groups and the subevents;
wherein the receiving, the processing, the generating, the segregating, the calculating, the prioritizing, and the generating are performed by a computing device.
[0015] An aspect of the invention utilizes conditional random field technique for
processing the multimodal data.
[00161 1n another aspect, the method upon analysis of social media content
identifies and prioritizes the events that have generated significant public interest in order to present them in a concise and attractive manner. The events are filtered based on their interestingness for presentation. The metric of the presentation invention used for interestingness is based on several facets of the discussions, namely information content. popularity, recentness and novelty.
[0017] In still another significant aspect, the present invention provides an event
processing system for determining interestingness factor of one or more events and associated subevents, the system comprising:
a handler configured to receive a stream of multimodal data comprising of tweet text and information cited therein;
an engine configured to:
process the multimodal data, so received from the handler, for extraction of events and associated subevents therefrom along with a set of predetermined attributes,
construct a best suited title for each extracted event in an ordered sequence of the attributes; and
compute an interestingness factor of grouped event titles, wherein the engine interacts with a compiler to receive the grouped event titles; the compiler configured to perform grouping of relevantly similar event titles; and a user interface coupled to the engine to display an event map reflective of a link between the grouped event groups and the associated subevents.
[0018] This Summary is provided to introduce a selection of concepts in a
simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
[0019] Additional features and advantages of the invention will be set forth in the
description which follows, and in part will be obvious from the description, or may be learned by the practice of the invention. These and other features of the present invention will become more fully apparent from the following description, or may be learned by the practice of the invention as set forth hereinafter.
BRIEF DESCRIPTION OF THE DRAWINGS
[0020] The detailed description is described with reference to the accompanying
figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The same numbers are used throughout the drawings to refer like features and components.
[0021] Figure I illustrates a block diagram of system architecture, in accordance
with an embodiment of the present subject matter.
[0022] Figure 2 illustrates event title extraction as practiced in accordance with one
other embodiment of the present subject matter.
[0023] Figure 3 shows compilation of sub-events in accordance, in accordance with
an embodiment of the present subject matter.
DETAILED DESCRIPTION
[0024] Some embodiments of this invention, illustrating all its features, will now be
discussed in detail.
[0025] The words "comprising," "having," "containing," and "including," and other
forms thereof, are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items.
[0026] It must also be noted that as used herein and in the appended claims, the
singular forms "a," "an," and "the" include plural references unless the context clearly dictates otherwise. Although any systems and methods similar or equivalent to those described herein can be used in the practice or testing of embodiments of the present invention, the preferred, systems and methods are now described.
[0027] Definitions:
Tweet: The term ''tweet" as used in the document herein, refers to all act(s) and occurrence(s) that is being shared, updated or commented upon various social networking platforms, and is not limited in any form and expression, to tweet(s) that is posted or updated on a specific microblogging site, say Twitter.
The important event attributes that are being considered for the purposes of present invention, are being defined as:
Subject: The subject of an event is assumed to have the same connotation as the grammatical subject of the English sentence which is analyzed for event extraction. The subject is most often a noun-phrase or a pronoun.
Action: This alludes to the main activity that is being discussed in the sentence. Actions are represented by verbs or verb phrases.
Object: The object is usually the recipient of the action.
Time: Time association to events can be either explicit or implicit. Explicit co-occurrence of
date with an event in tweet is very rare, and even when it occurs, it need not necessarily be the date of the event. Implicit occurrences of time occur as people mention terms like "today", "next month", "this week" etc. Explicit computation of event time can be computed from implicit occurrences by considering the time-stamp of the tweet. Time associations can be intervals also.
Location: Location is related to place of occurrence of an event. Location also can be fairly complex to determine depending on whether it is explicitly or implicitly mentioned. Location descriptions may also be at multiple granularities without any standardized or formal notations. For example, terms like "Times Square, New York" or "the Olympics Stadium", or "the library building" or "my town" all can be indicative of locations. Some of these are absolutely deterministic in nature while others can be derived from associated contextual information.
Contextual information: Additional description provides more contexts to the core underlying activity. For example, in the tweet - "Thorpe fails in #01ympic Bid for 200 m Freestyle", the context for failing by Thorpe is provided by the event name i.e. 200 m Freestyle.
[0028] In one general embodiment, the present invention aims to provide a system
and method to analyze and correlate tweet texts and related contents, including but not limited to images, video clips and news articles, available on social platforms for enhanced presentation before the end user. The information extracted from social platforms is redefined and critically analyzed to fitter the events based on their interestingness, which are then prioritized for their further correlation. The interestingness, however, encompasses several other facets of discussion, including but not limited to information content, popularity, recentness and novelty. Finally, the system evolves an event map from the
information correlated based on their content similarity and their capability to generate public interest. Such event maps are being derived, processed and presented in a most concise and attractive manner with event order based on the significance assigned to them.
[0029] In another embodiment, the method of event extraction involves in-context
analysis of multimodal data comprising of tweet data and other hyperlinked information sources. The input data may further comprise original tweets as well as re-tweets. Most of the processing is done on the original tweets to avoid redundancy. Tweets often embed one or more URL's to point to external information sources, such as image / video sharing sites and news archives. Information related to an event is also collected by searching public and private media repositories and through Internet search engines. The news articles are processed to extract the important fields, namely the title, the theme image and a short description. The images and videos are processed through image and video processing techniques to identify duplicates. The tweets are also indexed by their hash tags.
[0030] Once the tweets have been processed, events are extracted from the tweet
texts and news headlines. News headlines generally contain more well-formed text than the tweets and, whenever present, become rich sources of events. The event, for the purposes of present invention, includes all planned and unplanned, small or big events from an incoming stream of tweets revolving around a major event. The preferred embodiment of the present invention extracts all unique activities that are reported as planned or executed on social networking site around any major event. Assuming that the major event is known, in the absence of specific information about location or time available for the sub-events, these are assumed co-located with the major event. Further attempt is to provide a mechanism for easy reporting and browsing of tweet content around the identified events and provide a measure of the interest generated around the event.
[0031] The events can be generated by first-hand experience of the users or
announcements and reports in public media capable of arousing public interest. Thus, in one embodiment of the present invention, an event is characterized by the time when it is being discussed, which can be quite distinct from when it actually happened or will happen.
[0032] Referring now to Figure 1. a block diagram of overall system 100
employing different constituting modules is presented. Detailed functioning of the relevant modules is provided here below:
[0033] Handler 101: This module is primarily responsible for collecting
multimodal input inclusive of tweets and other hyperlinked information cited therein. The handler 101 further attempts to extract additional description about the event from the source tweets or hyperlinked information sources. The additional description provides context to the underlying sub-event title extracted from the tweet (will be explained in later sections). Event descriptions and context together provide interesting inputs about how descriptions or content around an event is changing over time. In an embodiment, the module 101 employs the garden-hose APIs that are provided by the twitter micro-blogging service to collect tweets about a specific topic. The tweets are provided as a stream, which acts as an input to the system.
[0034] Engine 102: The incoming multimodal data is processed for information
extraction by preprocessing module 102(a) of the engine implementable upon the processing unit. This includes tokenization, elimination of URLs & special characters. Neuro-Linguistic Programming (NLP) parser is used for tokenization, lemmatization, POS tagging and Named Entity Recognition. Further, tweets less than a specific length are eliminated.
[0035] The engine 102 further comprises of SOLR based indexer 102(b), which is
an open source content search platform based on Apache Lucene. SOLR is used as the back-end indexing platform to maintain the tweet library. SOLR provides services like grouping of similar content that is exploited to group exactly identical content. This ensures that all re-tweets are grouped together into a set, and only one representative tweet from each set is passed on to the event extractor module 102(c) of the engine 102.
[0036] Next, the event extractor module 102(c) processes the multimodal data-
including each unique tweet, and not the re-tweets, to extract event titles. These events are characterized by fine-grained attributes, namely the subject, actor, the object, context and the action, which is subsequently intelligently organized to produce event maps. While the action is characterized by verbs or verb phrases, the subject and the object are characterized
by noun-phrases or pronouns. In a preferred embodiment, the present invention employs a cascaded Conditional Random Fields (CRF) based coupled classifiers to first identify different semantic attributes of the event and then likely event titles. It shall however be noted that the use of conditional random field technique is for exemplary purposes, and shall not be understood as limiting. The invention extends the approach of processing the multimodal data to include Named Entity Recognition technique or any other linguistic methods. Accordingly, using any of the known processing techniques, the tweets are first processed to extract event attributes like actor, action, object, context, date, and location in the first phase. In the second phase, event titles comprised of various attributes are extracted.
[0037] The event titles are concise definition of events. A short title is created as an
ordered sequence of event attributes, compliant to the rules of grammar of the underlying language. Event title extraction is a multi-stage task and is accomplished, in one preferred embodiment, with multiple Conditional Random Fields (CRFs).
[0038] It is generally held that discovering events from tweets by using event
extraction algorithm poses a formidable challenge, because of malformed or insufficient text. The present invention associate these 'orphan' tweets, from which events could not be extracted, with the discovered events based on commonalty of linked contents. If an orphan tweet contains a link to content, such as an image, a video or a news article that is already associated with some events, the orphan tweet is associated with those events.
[0039] The event resolution module 102(d) of the engine 102 further provides for
resolution of all event titles extracted by the event extractor module 102(c). Resolution module 102(d) performs the task of identifying similar events based on string similarity metrics. Different groups of tweets and re-tweets with modified ordering of words, variations in spelling, different verb forms, or word abbreviations etc. may give rise to the same or nearly identical event titles.
[0040] Such minor variations in tweet texts give rise to many similar event
descriptions. Every pair of events is compared and the events are placed in an equivalence class if they are found to be sufficiently similar. The similarity is computed based on similarity of event titles as well as commonalty of images and news articles. One preferred
embodiment of the invention employs a combination of Jaro-Winkler distance, the Jaccard similarity and Q-grams similarity for measuring similarity between the event titles. If the similarity score is greater than a certain predetermined threshold Tl and if the events overlap in time, they are considered identical and are merged into one. Further, if the match score is greater than a threshold T2 (T2