twitter-sentiment.ipynb

gedankenstuecke


This notebook uses data from a Twitter archive to perform a simple sentiment analysis and emoji usage over time using Python.


Tags & Data Sources

sentiment analysis sentiment Twitter Archive Analyzer


Last updated 6 years ago

Doing simple sentiment analyses on a Twitter archive

Here we explore how the Personal Data Notebooks can be used to get additional information out of a full Twitter archive. To use this notebook you need to have uploaded a Twitter archive into your Open Humans account through

In a first step we again start by loading all the needed modules.

After this follows a huge list of function declarations with def function():. Let's take these for granted for now. I've basically copy & pasted these from the code of the Twitter Archive Analyser. I only removed the local time zone calculations as we won't need these to get started and installing them would take a rather long time.

Once our packages are installed and we declared all of our functions we can get started with requesting the data from Open Humans. Using our access_token which we get from os.environ.get('OH_ACCESS_TOKEN') we get a json object that contains all of our data sources.

We then loop over all of them to identify which is the zipped twitter archive and save the URL. Once that's done we can just call the master function create_main_dataframe that was declared above to create a pandas dataframe out of this.

reading files
downloading files
reading index
iterate over individual files
Now we have a dataframe called twitter_data which contains a lot of metadata, along with all the tweets in the column twitter_data['text'].

hashtag latitude longitude media reply_name reply_user_name retweet_name retweet_user_name text url
2018-05-11 19:58:25+00:00 NaN NaN NaN NaN Mama Hörnchen ♿️ MamsellChaos None None @MamsellChaos oh, and ... 1.0
2018-05-11 19:38:09+00:00 1.0 NaN NaN NaN None None None None The collection of personal data analysis noteb... 1.0
2018-05-11 19:19:56+00:00 NaN NaN NaN NaN Mama Hörnchen ♿️ MamsellChaos None None @MamsellChaos und ich update dich sobald der i... NaN
2018-05-11 19:19:39+00:00 NaN NaN NaN NaN Mama Hörnchen ♿️ MamsellChaos None None @MamsellChaos fuer R kann ich 1.0
2018-05-11 19:10:14+00:00 NaN NaN NaN NaN Mama Hörnchen ♿️ MamsellChaos None None @MamsellChaos Ah, der Fitbit Data Import funkt... NaN

Let's now add the polarity and subjectivity to the tweets with textblob. polarity values range between +1 and -1, with +1 being extremely positive, -1 extremely negative and 0 being neutral. subjectivity values range between 0 and 1, with larger numbers meaning the text is more subjective.

We add these numbers to our dataframe and for a start just remove all the 0 values, as these can also indicate a lack of data/classifications.

Let's now normalize these values for each day instead of looking at individual tweets. For both polarity and subjectivity we calculate the daily maximum, minimum and mean values along with the standard deviation.

In a next step we further smooth out these values by applying a 30-day rolling average to remove the impact of daily fluctuations.

We can now pack all of this in two new data frames for the subsequent plotting.


Let's start with the polarity. Looking at the mean it seems that my tweets overall seem to be pretty neutral. Looking at the maximum/minimum polarity we see that positive & negative polarity are more or less balanced. Put in other words: For each mean-spirited tweet there's one full of praise ;-)

<matplotlib.legend.Legend at 0x7f54214639e8>
<matplotlib.legend.Legend at 0x7f53f61b7940>

Looking at the subjectivity is a bit more interesting: It seems my tweets have grown less subjective over time. Which might have plenty of reasons: My active political career coming to a finish, growing a larger audience and thus tweeting more responsibly, or just plain growing old.

I guess we'll never know. Unless one of you has a good idea of how to use the Twitter archives to investigate this further. If you do: Hit me up on twitter @gedankenstuecke.

Emoji usage

Let's have a look into the emoji usage next. To do this we loaded the emoji package on top. This has a dictionary of (many) emoji, but not of all of them. Especially never ones are bound to be absent, as are the multi-character flag emojis like 🇮🇱 🇭🇰 🇬🇷. But for a first look this list should be good enough. We write a small number_of_emoji function that as expected returns the number of emoji found for a single tweet. We can then apply this function to our twitter_data dataframe.

We can now sum up over the number of emoji per day and then again apply a rolling average to minimize the influence of daily fluctuations. Once that's done we can make our plots 🎉

<matplotlib.legend.Legend at 0x7f53f5969f60>

Let's also have a look what the most common emoji are. We iterate over all the tweets, and count the occurrences of each emoji. Ultimately we only look at those that appear at least 5 times.

😂 593
😍 265
🎉 217
💖 207
👍 155
😉 153
✈ 146
😊 121
😱 76
🐶 75
™ 69
♀ 53
🙏 41
😢 40
☺ 37
🤷 35
😭 34
✔ 34
🤔 33
👋 28
❤ 27
😘 22
🚗 22
🍆 21
🍄 21
⭐ 19
☕ 19
💩 19
😇 18
🍾 18
😔 18
🔥 18
📚 17
📊 16
🍩 16
😎 16
😴 16
🌟 16
💃 16
✅ 15
🐦 15
🍻 15
🌈 15
👌 14
🐼 14
🔬 14
👏 13
© 13
🍪 12
😀 12
🎊 11
🎈 11
🍦 11
💘 11
💓 11
☀ 10
🤦 10
🍌 10
🍰 10
🎄 10
🐧 10
🐢 10
💕 10
♥ 10
💜 9
🤓 9
🏃 9
🍨 9
🐳 9
🎸 9
💉 9
👀 9
✨ 8
🍿 8
😳 8
🍺 8
📉 8
😒 8
😞 8
🏳 8
🍜 8
😜 8
💤 8
✊ 8
🔮 8
☔ 7
🌍 7
🦄 7
🤞 7
🐻 7
🐿 7
🌊 7
🚂 7
🐌 7
😛 7
🍧 7
🙄 6
🚨 6
👇 6
🐱 6
💰 6
🐍 6
😥 6
👨 6
👓 6
🐛 6
🚙 6
🌋 6
🐴 6
☑ 6
🤘 5
🙃 5
🐈 5
👩 5
🎁 5
🍫 5
🍸 5
📈 5
🍉 5
💥 5
💸 5
🔨 5
🐰 5
🌱 5
💍 5
😋 5
🌲 5
👉 5
😕 5
💨 5
⚡ 5
🎯 4

Using this list of most common emoji I manually categorized some of them into sub-categories to see how these categories vary over time:

We can now apply a small function with these sub-categories to get the number of emoji for each tweet/category combination:

Let's get the daily sum of emoji for each of the categories and minimize fluctuations with a 90-day rolling average:

Plotting the emoji usage we can see that I'm becoming a much more joyful person over time 😂 Though I should probably check on my loving tweets. 😉

<matplotlib.legend.Legend at 0x7f53f125dcc0>

