Sentiment Analysis of Twitter Data (2024)

Twitter Introduction

Recent years have witnessed the rapid growth of social mediaplatforms in which users can publish their individual thoughts andopinions (e.g., Facebook, Twitter, Google+ and several blogs). The risein popularity of social media has changed the world wide web from astatic repository to a dynamic forum for anyone to voice their opinionacross the globe. This new dimension of User Generated Contentopens up a new and dynamic source of insight to individuals,organizations and governments.

Social network sites or platforms, are defined as web-based servicesthat allow individuals to:

  • Construct a public or semi-public profile within a boundedsystem.
  • Articulate a list of other users with whom they share aconnection.
  • View and traverse their list of connections and those made by otherswithin the system.

The nature and nomenclature of these connections may vary from siteto site.

This package, saotd is focused on utilizing Twitter datadue to its widespread global acceptance. Harvested data, analyzed forsentiment can provide powerful insight into a population. This insightcan assist organizations, by letting them better understand their targetpopulation. This package will allow a user to acquire data using thePublic Twitter Application Programming Interface (API), to obtaintweets.

The saotd package is broken down into five differentphases:

  • Acquire
  • Explore
  • Topic Analysis
  • Sentiment Calculation
  • Visualization

The saotd package workflow can be observed referencedvia the below image that will take and analysis from the Twitter API tothrough a complete analysis.

Sentiment Analysis of Twitter Data (1)

Packages

library(saotd)library(dplyr)library(stringr)library(knitr)

Acquire

To explore the data manipulation functions of saotd wewill use the built in dataset saotd::raw_tweets.

However is you want to acquire your own tweets, you will first haveto:

  1. Create a twitter account orsign into existing account.

  2. Use your twitter login, to sign into TwitterDevelopers

  3. Navigate to My Applications.

  4. Fill out the new application form.

  5. Create access token.

    • Record twitter access keys and tokens

With these steps complete you now have access to the twitter API.

To acquire your own dataset of tweets you can use thesaotd::tweet_acquire function and insert your consumer key,consumer secret key, access token and access secret key gained from theTwitter Developerspage. You will also need to select the #hashtags you are interested inand the number of tweets requested per #hashtag.

consumer_api_key <- "XXXXXXXXXXXXXXXXXXXXXXXXX"consumer_api_secret_key <- "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"access_token <- "XXXXXXXXXXXXXXXXXX-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"access_token_secret <- "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"hashtags <- c("#job", "#Friday", "#fail", "#icecream", "#random", "#kitten", "#airline")tweets <- tweet_acquire( twitter_app = "twitter_app", consumer_api_key = Sys.getenv('consumer_api_key'), consumer_api_secret_key = Sys.getenv('consumer_api_secret_key'), access_token = Sys.getenv('access_token'), access_token_secret = Sys.getenv('access_token_secret'), query = "#icecream", num_tweets = 100, distinct = TRUE)

Explore

You can acquire your own data or use the dataset included with thepackage. We will be using the included data raw_tweets.This dataset was acquired from a TwitterUS Airline Sentiment Kaggle competition, from December 2017. Thedataset contains 14,487 tweets from 6 different hashtags (2,604 x#American, 2,220 x #Delta, 2,420 x #Southwest, 3,822 x #United, 2,913 x#US Airways, 504 x #Virgin America).

set.seed(4321)data("raw_tweets")TD <- raw_tweets %>%  dplyr::sample_n(size = 5000,  replace = TRUE)

The first tweet of the dataset is: “@SouthwestAir I filled in the form on thewebsite too. Darn it all. I guess I’ll just have to cross my fingers.”,and when it is cleaned and tidy’d it becomes:

TD_Tidy <-  saotd::tweet_tidy( DataFrame = TD)TD_Tidy$Token[1:9] %>%  knitr::kable("html")
x
southwestair
filled
form
website
darn
guess
ill
cross
fingers

The cleaning process removes: “@”, “#” and “RT” symbols, Weblinks,Punctuation, Emojis, and Stop Words like (“the”, “of”, etc.).

We will now investigate Uni-Grams, Bi-Grams and Tri-Grams.

saotd::unigram(DataFrame = TD) %>%  dplyr::top_n(10) %>%  knitr::kable("html", caption = "Twitter data Uni-Grams")
## Selecting by n
Twitter data Uni-Grams
wordn
united1454
flight1314
usairways1073
americanair930
southwestair860
jetblue813
cancelled380
service319
time288
im270
saotd::bigram(DataFrame = TD) %>%  dplyr::top_n(10) %>%  knitr::kable("html", caption = "Twitter data Bi-Grams")
Twitter data Bi-Grams
word1word2n
customerservice198
cancelledflightled178
lateflight85
cancelledflighted80
lateflightr52
cancelledflight49
2hours40
usairwaysamericanair38
3hours34
flightbooking31
saotd::trigram(DataFrame = TD) %>%  dplyr::top_n(10) %>%  knitr::kable("html", caption = "Twitter data Tri-Grams")
Twitter data Tri-Grams
word1word2word3n
NANANA54
cancelledflightledflight20
flightcancelledflightled17
worstcustomerservice16
poorcustomerservice10
customerservicerep8
hourslateflightr8
southwestairflightcancelled8
cancelledflightedflight7
cancelledflightledflights7
flightcancelledflighted7
hourslateflight7

Now that we have the Uni-Grams we can see that canceled and flightare referring to canceled flight and may be good set of words to mergeinto a single term. Additionally, pet and pets could also be merged toobserve more uniqueness in the data.

TD_Merge <-  merge_terms( DataFrame = TD,  term = "cancelled flight",  term_replacement = "cancelled_flight")

Now that the terms have been merged, the new N-Grams arere-computed.

saotd::unigram(DataFrame = TD_Merge) %>%  dplyr::top_n(10) %>%  knitr::kable("html", caption = "Twitter data Uni-Grams")
Twitter data Uni-Grams
wordn
united1454
flight1265
usairways1073
americanair930
southwestair860
jetblue813
service319
time288
im270
customer263
saotd::bigram(DataFrame = TD_Merge) %>%  dplyr::top_n(10) %>%  knitr::kable("html", caption = "Twitter data Bi-Grams")
Twitter data Bi-Grams
word1word2n
customerservice198
lateflight85
lateflightr52
2hours40
usairwaysamericanair38
3hours34
flightbooking31
gateagent29
unitedim26
usairwaysflight23
saotd::trigram(DataFrame = TD_Merge) %>%  dplyr::top_n(10) %>%  knitr::kable("html", caption = "Twitter data Tri-Grams")
Twitter data Tri-Grams
word1word2word3n
NANANA54
worstcustomerservice16
poorcustomerservice10
customerservicerep8
hourslateflightr8
hourslateflight7
30minlate6
centlatinasciilatinasciilatinasciicent6
customerservicedesk6
jetblueflightdelayed6
minlateflight6
southwestairflightcancelledflightled6

Now we can look at Bi-Gram Networks.

TD_Bigram <- saotd::bigram(DataFrame = TD_Merge)saotd::bigram_network( BiGramDataFrame = TD_Bigram, number = 30, layout = "fr", edge_color = "blue", node_color = "black", node_size = 3, set_seed = 1234)

Sentiment Analysis of Twitter Data (2)

Additionally we can observe the Correlation Network.

TD_Corr <-  saotd::word_corr( DataFrameTidy = TD_Tidy,  number = 100,  sort = TRUE)saotd::word_corr_network( WordCorr = TD_Corr,  Correlation = .1,  layout = "fr",  edge_color = "blue",  node_color = "black",  node_size = 1)

Sentiment Analysis of Twitter Data (3)

Sentiment Calculation

Now that the data has been explored we will need to compute theSentiment scores for the hashtags.

TD_Scores <-  saotd::tweet_scores( DataFrameTidy = TD_Tidy, HT_Topic = "hashtag")

With the scores computed we can then observe the positive andnegative words within the dataset.

saotd::posneg_words( DataFrameTidy = TD_Tidy,  num_words = 10)
## Selecting by n

Sentiment Analysis of Twitter Data (4)

As an example we can see that the negative term “fail” is dwarfingall other responses. If we would like to remove “fail” we can easily doit.

saotd::posneg_words( DataFrameTidy = TD_Tidy,  num_words = 10,  filterword = "fail")
## Selecting by n

Sentiment Analysis of Twitter Data (5)

We can see the most positive tweets hashtags within the the dataset.

saotd::tweet_max_scores( DataFrameTidyScores = TD_Scores, HT_Topic = "hashtag")
## # A tibble: 6 × 10## text method hashtags created_at key negative positive TweetSentimentScore## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>## 1 @Ameri… Bing American 2015-02-2… polp… 0 12 12## 2 @South… Bing Southwe… 2015-02-1… waln… 0 10 10## 3 @South… Bing Southwe… 2015-02-2… Nico… 0 9 9## 4 @South… Bing Southwe… 2015-02-2… Walt… 0 9 9## 5 @unite… Bing United 2015-02-2… Core… 0 9 9## 6 @JetBl… Bing Delta 2015-02-2… Dres… 0 6 6## # ℹ 2 more variables: TweetSentiment <chr>, date <date>

We can also see the most negative hashtag tweets within the dataset.

saotd::tweet_min_scores( DataFrameTidyScores = TD_Scores, HT_Topic = "hashtag")
## # A tibble: 6 × 10## text method hashtags created_at key negative positive TweetSentimentScore## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>## 1 @JetBl… Bing Delta 2015-02-1… Grac… 10 0 -10## 2 @USAir… Bing US Airw… 2015-02-1… thec… 9 0 -9## 3 @USAir… Bing US Airw… 2015-02-2… lj_v… 9 0 -9## 4 @JetBl… Bing Delta 2015-02-1… Cure… 8 0 -8## 5 @South… Bing Southwe… 2015-02-2… Dead… 8 0 -8## 6 @unite… Bing United 2015-02-2… mace… 8 0 -8## # ℹ 2 more variables: TweetSentiment <chr>, date <date>

Furthermore if we wanted to observe the most positive or negativehashtag scores associated with a specific hashtag we could also dothat.

saotd::tweet_max_scores( DataFrameTidyScores = TD_Scores,  HT_Topic = "hashtag",  HT_Topic_Selection = "United")
## # A tibble: 6 × 10## text method hashtags created_at key negative positive TweetSentimentScore## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>## 1 @unite… Bing United 2015-02-2… Core… 0 9 9## 2 @unite… Bing United 2015-02-1… vmnk… 0 6 6## 3 @unite… Bing United 2015-02-1… sash… 0 4 4## 4 @unite… Bing United 2015-02-1… SFWW… 0 4 4## 5 @unite… Bing United 2015-02-1… mcho… 2 6 4## 6 @unite… Bing United 2015-02-1… Greg… 0 4 4## # ℹ 2 more variables: TweetSentiment <chr>, date <date>

Topic Analysis

If we were interested in conducting a topic analysis on the tweets wewould then determine the number of latent topics within the tweetdata.

saotd::number_topics( DataFrame = TD,  num_cores = 4L,  min_clusters = 2,  max_clusters = 12,  skip = 1,  set_seed = 1234)

Sentiment Analysis of Twitter Data (6)

The number of topics plot shows that between 5 and 7 latent topicsreside within the dataset. For this example we could select between 5and 7 topics to categorize this data. In this case 5 topics will beselected to continue the analysis.

TD_Topics <-  saotd::tweet_topics( DataFrame = TD,  clusters = 5,  method = "Gibbs",  set_seed = 1234,  num_terms = 10)

In a markdown product the topics table does not print clearly, unlikewhen it is printed in the console. However the words associated witheach topic can be observed in the below table.

NumberTopic 1Topic 2Topic 3Topic 4Topic 5
1unitedusairwaysamericanairsouthwestairflight
2servicetimeusairwaysjetbluecancelled
3customerplaneampimhours
4dontgateholdvirginamericaflights
5bagjetbluecallguys2
6checkhourphoneflydelayed
7luggagewaitingwaitairlineflightled
8dmdelayiveflyinglate
9lostpeoplecangeseat3
10worstminutesdayloveweather

One of the challenges of using a topic model is selecting the correctnumber of topics. As we can see in the above chart. We went from 6hashtags to 5 different topics.

While this may not be the best example to use, we will continue thetopic modeling example. We would first want to rename the topics intosomething that would make sense. In this case Topic 1 could be luggage,Topic 2 could be delay, Topic 3 could be customer_service, Topic 4 couldbe enjoy, and Topic 5 could be delay These topics were chosen byobserving the words associated with each topic. This selection could bedifferent depending on experience and a deeper understanding of thetopics.

We would then want to rename the topics in the dataframe

TD_Topics <- TD_Topics %>%  dplyr::mutate(Topic = stringr::str_replace_all(Topic, "^1$", "luggage")) %>%  dplyr::mutate(Topic = stringr::str_replace_all(Topic, "^2$", "gate_delay")) %>%  dplyr::mutate(Topic = stringr::str_replace_all(Topic, "^3$", "customer_service")) %>%  dplyr::mutate(Topic = stringr::str_replace_all(Topic, "^4$", "enjoy")) %>%  dplyr::mutate(Topic = stringr::str_replace_all(Topic, "^5$", "other_delay"))

Next we would want to tidy and then score the new topic dataset.

TD_Topics_Tidy <-  saotd::tweet_tidy( DataFrame = TD_Topics)TD_Topics_Scores <-  saotd::tweet_scores( DataFrameTidy = TD_Topics_Tidy, HT_Topic = "topic")

We can see the most positive topic tweets within the data set.

saotd::tweet_max_scores( DataFrameTidyScores = TD_Topics_Scores, HT_Topic = "topic")
## # A tibble: 6 × 10## text method Topic created_at key negative positive TweetSentimentScore## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>## 1 @American… Bing lugg… 2015-02-2… polp… 0 12 12## 2 @Southwes… Bing lugg… 2015-02-1… waln… 0 10 10## 3 @Southwes… Bing lugg… 2015-02-2… Nico… 0 9 9## 4 @Southwes… Bing lugg… 2015-02-2… Walt… 0 9 9## 5 @united W… Bing lugg… 2015-02-2… Core… 0 9 9## 6 @JetBlue … Bing enjoy 2015-02-2… Dres… 0 6 6## # ℹ 2 more variables: TweetSentiment <chr>, date <date>

We can also see the most negative topics tweets within the dataset.

saotd::tweet_min_scores( DataFrameTidyScores = TD_Topics_Scores, HT_Topic = "topic")
## # A tibble: 6 × 10## text method Topic created_at key negative positive TweetSentimentScore## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>## 1 @JetBlue … Bing enjoy 2015-02-1… Grac… 10 0 -10## 2 @USAirway… Bing gate… 2015-02-1… thec… 9 0 -9## 3 @USAirway… Bing cust… 2015-02-2… lj_v… 9 0 -9## 4 @JetBlue … Bing enjoy 2015-02-1… Cure… 8 0 -8## 5 @Southwes… Bing cust… 2015-02-2… Dead… 8 0 -8## 6 @united y… Bing cust… 2015-02-2… mace… 8 0 -8## # ℹ 2 more variables: TweetSentiment <chr>, date <date>

Furthermore if we wanted to observe the most positive or negativescores associated with a specific topic we could also do that.

saotd::tweet_max_scores( DataFrameTidyScores = TD_Topics_Scores, HT_Topic = "topic", HT_Topic_Selection = "luggage")
## # A tibble: 6 × 10## text method Topic created_at key negative positive TweetSentimentScore## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>## 1 @American… Bing lugg… 2015-02-2… polp… 0 12 12## 2 @Southwes… Bing lugg… 2015-02-1… waln… 0 10 10## 3 @Southwes… Bing lugg… 2015-02-2… Nico… 0 9 9## 4 @Southwes… Bing lugg… 2015-02-2… Walt… 0 9 9## 5 @united W… Bing lugg… 2015-02-2… Core… 0 9 9## 6 @Southwes… Bing lugg… 2015-02-2… woaw… 0 6 6## # ℹ 2 more variables: TweetSentiment <chr>, date <date>

Visualizations

Hashtags

Now we will begin visualizing the hashtag data. The distribution ofthe sentiment scores can be found in the below plot.

saotd::tweet_corpus_distribution( DataFrameTidyScores = TD_Scores,  color = "black",  fill = "white")

Sentiment Analysis of Twitter Data (7)

Additionally if we wanted to see the score distributions per eachhashtag, we can find it below.

saotd::tweet_distribution( DataFrameTidyScores = TD_Scores,  HT_Topic = "hashtag",  bin_width = 1,  color = "black",  fill = "white")

Sentiment Analysis of Twitter Data (8)

We can also observe the hashtag distributions as a Box plot.

saotd::tweet_box( DataFrameTidyScores = TD_Scores,  HT_Topic = "hashtag")

Sentiment Analysis of Twitter Data (9)

Also as a Violin plot. The chevrons in each violin plot denote themedian of the data and provide a quick reference point to see if ahashtag is generally positive or negative. For example the “random”hashtag has a generally negative sentiment, where as the “kitten”hashtags has a generally positive sentiment.

saotd::tweet_violin( DataFrameTidyScores = TD_Scores, HT_Topic = "hashtag")

Sentiment Analysis of Twitter Data (10)

One of the more interesting ways to visualize the Twitter data is toobserve the change in sentiment over time. This dataset was acquired ona single day and therefore some of the hashtags did not overlap days.However some did and we can see the change in sentiment scores throughtime.

saotd::tweet_time( DataFrameTidyScores = TD_Scores, HT_Topic = "hashtag")

Sentiment Analysis of Twitter Data (11)

Finally if a Twitter user has not disabled georeferencing data thelocation of the tweet can be observed. However in many cases this maynot be very insightful because of the lack of data.

Topics

Now we will begin visualizing the topic data. The distribution of thesentiment scores can be found in the below plot.

saotd::tweet_corpus_distribution( DataFrameTidyScores = TD_Topics_Scores,  color = "black",  fill = "white")

Sentiment Analysis of Twitter Data (12)

Additionally if we wanted to see the score distributions per eachtopic, we can find it below.

saotd::tweet_distribution( DataFrameTidyScores = TD_Topics_Scores,  HT_Topic = "topic",  bin_width = 1,  color = "black",  fill = "white")

Sentiment Analysis of Twitter Data (13)

We can also observe the topic distributions as a Box plot.

saotd::tweet_box( DataFrameTidyScores = TD_Topics_Scores, HT_Topic = "topic")

Sentiment Analysis of Twitter Data (14)

Also as a Violin plot. The chevrons in each violin plot denote themedian of the data and provide a quick reference point to see if ahashtag is generally positive or negative. For example the “random”hashtag has a generally negative sentiment, where as the “kitten”hashtags has a generally positive sentiment.

saotd::tweet_violin( DataFrameTidyScores = TD_Topics_Scores, HT_Topic = "topic")

Sentiment Analysis of Twitter Data (15)

One of the more interesting ways to visualize the Twitter data is toobserve the change in sentiment over time. This dataset was acquired ona single day and therefore some of the hashtags did not overlap days.However some did and we can see the change in sentiment scores throughtime.

saotd::tweet_time( DataFrameTidyScores = TD_Topics_Scores, HT_Topic = "topic")

Sentiment Analysis of Twitter Data (16)

Sentiment Analysis of Twitter Data (2024)

FAQs

How to clean Twitter data for sentiment analysis? ›

Text Cleaning

This data cleansing technique involves eliminating special characters, URLs, @mentions, and hashtags from the tweets, which helps prevent distortion. For example, the use of special characters and HTML tags is common in web-based text.

What is the result of Twitter sentiment analysis? ›

Twitter sentiment analysis determines whether a tweet is positive, negative, or neutral. You can do it manually by analyzing each tweet and evaluating whether it is positive or negative. But it is a time-consuming process. Some tools can do the job for you.

What are the issues with sentiment analysis on Twitter? ›

Sentiment analysis is a challenging task. Some of the essential challenges in sentiment analysis of regional language tweets are sarcasm detection [4], thwarted expression [5], negation handling [6], scarce resource language [7], subjectivity detection [8] and domain dependence [9].

Which algorithm is best for Twitter sentiment analysis? ›

There are multiple types of algorithms available that can be applied to the sentiment analysis of Twitter data. Some of the most efficient algorithms are Support Vector Machine (SVM), Recurrent Neural Network (RNN), Convolutional Neural Network (CNN), Random Forest, Naïve Bayes, and Long Short-Term Memory (LSTM).

Can Twitter data be scraped? ›

Data that can be scraped on Twitter includes publicly available data, meaning anything visible without logging into the platform. Examples of prohibited actions: scraping private profiles, their data cannot be exported, shared, or used for data purposes.

How do I get detailed Twitter Analytics? ›

If you're already signed in to your account, you can access Analytics through the sidebar menu on the desktop version of Twitter. Click the ellipsis icon at the bottom of the menu and select Analytics. On the Analytics dashboard, you'll find a few different tabs where you can access the various features available.

What is sentiment analysis for beginners? ›

Sentiment analysis looks at the emotion expressed in a text. It is commonly used to analyze customer feedback, survey responses, and product reviews. Social media monitoring, reputation management, and customer experience are just a few areas that can benefit from sentiment analysis.

What is the methodology of Twitter sentiment analysis? ›

Sentiment Analysis is a technique widely used in text mining. Twitter Sentiment Analysis, therefore means, using advanced text mining techniques to analyze the sentiment of the text (here, tweet) in the form of positive, negative and neutral.

What can sentiment analysis tell us? ›

Sentiment analysis is the process of analyzing digital text to determine if the emotional tone of the message is positive, negative, or neutral. Today, companies have large volumes of text data like emails, customer support chat transcripts, social media comments, and reviews.

How accurate is sentiment analysis? ›

Setting a baseline sentiment accuracy rate

When evaluating the sentiment (positive, negative, neutral) of a given text document, research shows that human analysts tend to agree around 80-85% of the time.

What are the four pitfalls of sentiment analysis accuracy? ›

In this article, we talked about popular problems of sentiment analysis classification: sarcasm, negations, word ambiguity, and multipolarity.

How do you analyze the sentiment of your own tweets? ›

To put some data behind the question of how you are feeling, you can use Python, Twitter's recent search endpoint to explore your Tweets from the past seven days, and Microsoft Azure's Text Analytics Cognitive Service to detect languages and determine sentiment scores.

How to analyze Twitter data? ›

Go to Analysis > Twitter > Analyze Tweets and select all twitter documents that you would like to include in your analysis. The results will be shown in a table, which includes information about the author and the tweet (for example, how often the tweet has been retweeted or the number of likes a tweet received).

How do I read my Twitter data? ›

Go to https://www.twitter.com. Click More and select Settings and privacy. Click Your account. Click Download an archive of your data.

How do you measure Twitter Analytics? ›

Twitter Analytics measures key metrics, including tweet impressions, engagements, profile visits, mentions, followers, top tweets, audience interests, and audience demographics. These metrics help evaluate and optimize your Twitter strategy.

References

Top Articles
Latest Posts
Article information

Author: Francesca Jacobs Ret

Last Updated:

Views: 5346

Rating: 4.8 / 5 (68 voted)

Reviews: 83% of readers found this page helpful

Author information

Name: Francesca Jacobs Ret

Birthday: 1996-12-09

Address: Apt. 141 1406 Mitch Summit, New Teganshire, UT 82655-0699

Phone: +2296092334654

Job: Technology Architect

Hobby: Snowboarding, Scouting, Foreign language learning, Dowsing, Baton twirling, Sculpting, Cabaret

Introduction: My name is Francesca Jacobs Ret, I am a innocent, super, beautiful, charming, lucky, gentle, clever person who loves writing and wants to share my knowledge and understanding with you.