Abstract: The potential impact of Twitter messages on public sentiment, whether in the context of international affairs, critical decision-making processes, elections, policies, business transactions, or entertainment and sports, the analysis of user data, which is often cryptic and unprocessed, presents an opportunity for scientific investigation through the application of sentiment analysis techniques. In the proposed invention, Natural Language Processing (NLP) techniques were employed to preprocess the data, encompassing cleaning operations such as removing noise and irrelevant information. Subsequently, sentiment analysis was conducted by determining the polarity and subjectivity of the user tweets. The dataset was normalised and subsequently employed in a Machine Learning technique. The data was then organised using Word Tokenization, and the classification method known as Support Vector Machine (SVM) was utilised. The K-Nearest Neighbour (KNN) classification method was employed during the second phase to analyse the sentiment analysis (SA) data. This method was subsequently evaluated using both Kaggle and real-time datasets. The Voting categorization mechanism was implemented during the final phase. 4 claims & 2 Figures
Description:Hybrid Classification Approach for Sentiment Analysis in Social Media Textual Data Stream
Field of Invention
Increasing demand of social media, particularly the twitter, the microblogging platform has created significant impact on the mindset of people, because anything that is shared, does influence a decision in a particular manner. Sentiment is an expression of feelings that manifest in actions, and often influence others in their behavior. Therefore, study is undertaken how to leverage these sentiments called Sentiment Analysis (SA), computationally, which determines the outcome of the result. Sentiments are analyzed using the polarity-based techniques such as ML and Deep Learning. Though both the techniques are efficient, it is established that if the ML techniques are applied, the SA model ought to have high accuracy for the training. It is because if the data is not clean, it can reduce the accuracy for the model training, therefore, the present study is undertaken to achieve high accuracy in sentiment analysis for which the model needs clean data.
Background of the Invention
The undertaking acknowledges about the increasing use of social media and how rapidly it has been influencing the sentiment analysis of people on various matters. Though the users’ opinion are different, volatile and vulnerable on each and every topics, still retrieving knowledge from the data is critical. Along with information regarding the websites they visited, understanding that the content carries important sentiments across several platforms has become essential for gauging public opinion about a certain subject. As a result, a prominent sentiment analysis technique is categorizing a text's polarity. The texts may be neutral, negative, or positive in tone, but they all generally refer to a range of emotions from happiness to melancholy. The word "sentiment" designates a topic that is both subjective and objective, as well as a topic that is both practical and imagined and blurs the line between positive or negative subjects. The strategy is grounded in text analysis since sentiment analysis depends on the propagation of myths or gossips. It involves determining the subjectivity of a belief and the outcome of a tweet, therefore, it is concerned with classifying opinion an individual into different classes according to data size and type of document. There are different schemes applied for sentiment analysis depending on a variety of methods of NLP (Natural Language Processing) and ML (Machine Classification Approach for Sentiment Analysis in Social Media Textual Data Stream 3 Learning) techniques to extract sufficient features and classify text into suitable polarity labels (CN108733653B).
Sentiment is an expression of feelings that manifest in actions, and often influence others in their behavior. Therefore, study is undertaken how to leverage these sentiments called Sentiment Analysis (SA), computationally, which determines the outcome of the result. In this study, Opinion mining is a crucial application of NLP in addressing the interplay between human and machine language. Building a system to gather and organize data from various social network sites, including blogs, online discussion boards, social media, web based surveys, is called opinion mining. Since people take decisions based on the available resources, and to some extent social media often helps them, resulting business houses getting an opportunity to understand the buyer’s attitude, appraisal, sentiment and opinion. Similarly, policymakers and politicians make adjustments while dealing with public issues. A possibility to create a new application using real-world data now exists, with a special emphasis on sorting, finding, or analyzing textual knowledge discovery techniques, specifically Twitter. Twitter is used to tweet messages via Classification Approach for Sentiment Analysis in Social Media Textual Data Stream 4 social media and blogs sites, where new issues arise as a result of different distinctive traits contained in tweets that also influence the various the diverse areas and methodologies discussed in sentiment analysis(US10127522B2).
Chhaya Chauhan, et.Al, used text analysis, NLP, text preprocessing, and stemming in their study to investigate the complex data. To understand what was being spoken by people, they used a variety of tools and procedures in a computer. The study found that multiple strategies are used to assess the sentiment of a text or sentence because the internet contains a big collection of natural language to get the desired outcomes. The writers created an authentic review by employing several algorithms and strategies to extract a feature-by feature description of the item. It was advised that there would be a possibility in the future to concentrate on higher level NLP jobs, where the best techniques or tools would be utilized to produce more precise findings when only the keywords were used in the dataset and the system removed all other terms.
Zahra Rezaeiet.Al, relied on Twitter to comprehend the content of the brief messages and how it affected others, in particular because users tweet often and at a fast rate of speed. In order to enhance SA's performance, the researchers applied filtering and wrapping techniques to obtain the separating properties of the data from Twitter. They also employed the Hoeffding tree and McDiarmid tree algorithms. The McDiarmid tree method performed better, but it was noted that because it had a lot of Twitter data, SA needed to process it quickly.
Text mining and neural networks were used to characterize feelings in a hybrid model created by Mohammed H. Abd El-Jawad et al. utilizing a variety of deep learning and machine learning algorithms. Over a million tweets were collected from five fields and evaluated in the study's dataset. After training and testing using 75% and 25% of the dataset, respectively, it was found that this model surpassed the earlier methods in terms of accuracy. The authors also suggested further research, notably in Arabic tweets that concentrated on combining sentiments and text for sentiment analysis.
Summary of the Invention
Research in Sentiment Analysis typically assumes that the entire document under analysis involves reliable sentiment polarity by a single holder towards an object, which may or may not be correct for the tweets collected from popular websites like Kaggle and Twitter, which will have positive, negative, or neutral impacts. Even if tweets are interpretable thanks to Natural Language Processing (NLP), users still aren't likely to take in all of the information presented. Because of this, Sentiment Analysis (SA) includes a wide range of methods for the automated creation, modification, and evaluation of ordinary spoken language. Pre-processing, feature extraction, and classification are all steps in SA. Subsequently, Twitter API is used to collect real-time datasets because of the noise reduction it provides by design. The Olympic 2021 twitter-data set includes both unlabeled and labelled tweets from the Kaggle website. Data was cleaned using natural language processing methods, and feelings were determined by analysing the polarity and subjectivity of user tweets. Support Vector Machine (SVM) classification was used after the normalised dataset was tokenized for organisation by the machine learning algorithm. K-Nearest Neighbour (KNN) classification was used for the SA in the second stage, with results verified on Kaggle and real-time datasets. The Voting categorization approach was used in the final round.
Brief Description of Drawings
Figure 1: Sentiment Analysis Process
Figure 2: Architecture of the Research Model
Detailed Description of the Invention
First extract the data from the well-known microblogging sites namely Kaggle and Twitter.The process of Sentiment analysis is shown in the Figure1.Once the data is captured from Twitter, which is in raw form that is noisy and rough, and required cleaning, needs to be prepared for analysis. This is a vital step, because quality of the data brings promising results. Twitter data set is pre-processed in diverse operations due to the involvement of special characters, emojis etc. Moreover, it emphasizes on other operations such includes removing fraudulent tweets or tweets with fewer than 3 characters in order to improve the format. Then, the n-gram algorithm is deployed to analyze the tweets lexically for processing the input information. The data is processed to clean and transmit the input data set into a precise form for extracting the attributes. The steps of preprocessing are, Cleaning, Data Transformation, Tokenization and word cloud. After completion of Preprocessing perform Feature extraction. The algorithm of extracting features employs the pre-processed data for input with the objective of assigning the keywords to weights to prepare classification. This sample has a variety of properties. The efficient algorithm used to extract the properties from the sample of created data. Classification Approach for Sentiment Analysis in Social Media Textual Data Stream 6 Furthermore, an evaluation is performed on the negative and positive polarity for formatting the users who employed replicated tweets. For retrieving the data, the major attributes of contents are illustrated in some ML techniques. The input attributes are computed using feature vector which assists in classifying the data. N-grams model uses to extract the properties from the model. Data verification is a cycle wherein various kinds of information are tested for accuracy and irregularities after data transfer from one to another source, is finished, and upholds processes in the new framework. When there is an unequal distribution of classes in the training dataset that time dataset must be done to equal distribution and it’s called balanced data. The distribution of classes has difference amid slight bias and risky imbalance which contains one case in the minority class for indefinite amount of cases in the largest class. Now apply Synthetic Minority Over-sampling Technique, which is a machine learning method that takes care of issues that happen when utilizing an imbalanced informational index to represent how this method works examine some training information that has s samples, and f feature in the element space of the information. SMOTE does data increase by making synthetic information focuses dependent on the main data points.
After completion of the above stages apply Voting Classifier. It is a ML algorithm whose training is done on a hybrid of several techniques and utilizes for predicting a class on the basis of higher possibility of selected class for the output. It assists in aggregating the outputs of every technique undergone from Voting System and predicting the output class on the basis of majority of votes. Unlike to generate individual frameworks and discover their accuracy, a single algorithm is developed which such techniques have trained for predicting the cases in accordance with their integrated majority of votes for every output class. The managed learning is an effective method to offer solutions for classifying the data. The unknown information is predicted so that the classification algorithm is trained easily. The attributes are extracted using K-Nearest Neighbor technique. This technique makes the deployment of KMC for defining the centroid points. These points are assisted in computing ED. One class is employed to classify the points having similarity. KNN is a SL technique. It is based on the similarity between new case and accessible cases. This algorithm emphasizes on new case into the category which is analogous to the current classes. It assists in storing all of the current data and categorizing new data on the basis of similarity. KNN is used to classify the novel data into the appropriate category. The supremacy of this algorithm is proved for classifying the data due to its non-parametric nature. No underlying data is taken in account. This algorithm is hence known as the lazy learner. KNN is useful for storing the dataset but is unable to learn from the training set quickly. The deployment of this model is done on a dataset when the data is classified. The system is trained to store the dataset using this algorithm. Data is put into a class that shares similarities with fresh data when new data is collected.
Different stages to analyze the sentiments are shown in Figure 2 and it is explained as: First one is Corpus collection. The major task of SA is to gather the labelled datasets. The text posts are captured from the Twitter API. Thereafter, a dataset of having 3 categories: pessimistic, optimistic and neural reviews, is generated by integrating these posts. Second is Data Pre-processing. The noise of data is the chief limitation to obtain the data sets from Twitter. Twitter data is available in the Pre-processing Collection of Corpus Collected data Polarity Determining Opinion Word Extraction Opinion Word Labelling Lexicon approach Machine Learning Downloaded Tweets using twitter API Result Evaluation Accuracy Calculation and Tokenization Stop Words Removal POS tagging Classification Approach for Sentiment Analysis in Social Media Textual Data Stream form of simple text, hashtags, etc. This stage is executed to preprocess the twitter data for preparing them prior to the process of extracting attributes and classifying the data. Many steps are practiced eradicating noise from the twitter dataset, which includes retweets, copies, points, and tweets with a single URL. These noises have no contribution in increasing the accuracy for classifying the data and thus, abolished. The text data is translated to the lower case and other methods namely PoS tagging, etc. are applied. After that Tokenization is performed. The discrete words or terms are called tokens. This process emphasizes on splitting a thread of text into tokens. After completion of Tokenization Stop Word Removal is performed. A limitless amount of stop words is present in all human languages. These words are eliminated to remove the information of lower level from the text for acquiring more attention towards significant information. Such words are removed for mitigating the dataset dimensionality and diminishing the training time as the training phase employs few amount of tokens. Finally, POS tagging is done. The adjectives, adverbs and certain nouns are of subjectivity and emotions are called POS. The dependency trees employ to create the syntactic dependency patterns.
In the next phase Opinion Word Extraction is performed. A huge number of attributes are Classification Approach for Sentiment Analysis in Social Media Textual Data Stream comprised in the Twitter language framework. The feature space is mitigated using some of these attributes. The method begins by removing all bigrams and unigrams from the corpus. To illustrate, all bigrams and unigrams with frequencies 5 and above are candidates for the candidate qualities. Generally, both are chosen to analyze the sentiments at word level. The trigrams are employed to stretch them. Moreover, the next goal is to compute the frequency of every discovered candidate attribute. The term frequency is utilized to generate the feature vector for each tweet as: ({word 1: frequncy1, word2: frequency2 … },"polarity"). After completion of Opinion Word Extraction next Opinion Word Labelling is performed. The next step is to compute the aggregated words on a dictionary of optimistic and negative words in which 2 files are contained, and a polarity is provided to the tweet on the basis of that. The tweet having optimistic word or hashtag denoted with +1 and tweet with pessimistic word denoted with -1. Finally Polarity Determining is done. It focuses on determining the polarity of a tweet. 2 techniques: lexicon-based method or ML are employed to estimate the impact when the SA is pre-processed.
4 Claims & 2 Figures , Claims:The scope of the invention is defined by the following claims:
Claim:
The Design of Hybrid Classification Approach for Sentiment Analysis in Social Media Textual Data Stream comprising the steps of:
a) Designed a technique that has to analyse the tweets that are gathered from different websites.
b) Adopted a method for feature extraction and then balancing the data and solve the issues related to oversampling.
c) Design architecture for describing the tweets step by step.
2. The Design of Hybrid Classification Approach for Sentiment Analysis in Social Media Textual Data Stream as claimed in claim 1, an approach Synthetic Minority Over-sampling Technique is designed that forms the relations among the tweets and balancing the oversampled data.
3. The Design of HybridClassification Approach for Sentiment Analysis in Social Media Textual Data Stream as claimed in claim1, led to the construction of principal component analysis technique.
4. The Design a Hybrid Classification Approach for Sentiment Analysis in Social Media Textual Data Stream as claimed in claim 1, Adopted a method of hybrid classification using Voting Classification method with a combination of multiple classifiers.
| # | Name | Date |
|---|---|---|
| 1 | 202341065915-REQUEST FOR EARLY PUBLICATION(FORM-9) [30-09-2023(online)].pdf | 2023-09-30 |
| 2 | 202341065915-FORM FOR STARTUP [30-09-2023(online)].pdf | 2023-09-30 |
| 3 | 202341065915-FORM FOR SMALL ENTITY(FORM-28) [30-09-2023(online)].pdf | 2023-09-30 |
| 4 | 202341065915-FORM 1 [30-09-2023(online)].pdf | 2023-09-30 |
| 5 | 202341065915-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [30-09-2023(online)].pdf | 2023-09-30 |
| 6 | 202341065915-EVIDENCE FOR REGISTRATION UNDER SSI [30-09-2023(online)].pdf | 2023-09-30 |
| 7 | 202341065915-EDUCATIONAL INSTITUTION(S) [30-09-2023(online)].pdf | 2023-09-30 |
| 8 | 202341065915-DRAWINGS [30-09-2023(online)].pdf | 2023-09-30 |
| 9 | 202341065915-COMPLETE SPECIFICATION [30-09-2023(online)].pdf | 2023-09-30 |
| 10 | 202341065915-FORM-9 [28-10-2023(online)].pdf | 2023-10-28 |