twitter-archive-text-mining-R.ipynb
This notebook uses the programming language R
and data from a Twitter archive to perform a simple sentiment analysis and explore how your sentiment changes over time.
This performs some twitter archive analysis that is based on this chapter of Text Mining with R. To use this notebook you need to have uploaded a Twitter archive into your Open Humans account through http://twarxiv.org. Initially, what you'll see is data I've supplied, and as you run each command, it'll be replaced with your data.
For a start let's install/load all the required packages again:
Let's now get our access token and request our personal user object that will contain all of our data file downloads etc.:
If you want to look at the data sources you have on Open Humans, uncomment the line below by removing the #:
Let's now find the download URL for the Twitter archive from all files:
We can now create a temporary file that will contain the whole zipped Twitter archivee, from this we can then unzip and read the tweets.csv
file:
Now let's convert the timestamps into a proper format and plot a simple histogram of tweets over time:
We can now 'tokenize' (that is, break up into words) the tweet texts, which will make it easier to work with them. This also allows us to easily calculate word frequencies in the next step. This stage also passes words through a stopwords filter. Stopwords are those words which are frequently used (such as 'the', 'we', 'and', and 'I') but provide very little information, and it's common to filter them out during a textual analysis. Stopwords are language dependent, so you may want to change the language default. You can read more about the stopwords function here: https://www.rdocumentation.org/packages/tm/versions/0.7-3/topics/stopwords
So, what are the top words that you've used? Mine are the German equivalents of I
, all the articles (der, die, das
) along with not
and is
.
Let's now group the tweets by being "old" (pre-2013) and more recent ones (2013 or newer) to see whether the topics you tweet about have changed. After grouping you can calulcate the word-ratios and look at the most unusual and the most recent words:
You can adapt the cutoff above easily to your own needs. Just replace 2013-01-01
with your own date in the YYYY-MM-DD
format.
We can see that some things, e.g. archives
haven't changed much in frequency, as obvious from the small "odds" ratio that's close to zero. Outliers with a large positive values are those that were more frequent in the past, while outliers with values large negative values are more common in more recent tweets. To make it a bit more intuitive we can now plot the Top 15 words with the largest positive/negative odds ratio:
What we see in my data below: My activity in the Pirate Party (and living in the state of North-Rhine Westphalia - short NRW
) is clearly in the past and tweets about them occur mainly before 2013 - as demonstrated by hashtags like #piraten, #lptnrw, #lmvnrw, #nrw
etc. My activity in the Open Science world is clearly still going strong in comparison: #opencon, #mozfest, bosc201*, #csvconf
etc. are overrepresented. When you run the analysis on your data, what trends do you see? Next, let's see what would happen if we excluded hashtags.
In the analysis of my data, we see some changes: now older words are somewhat associated with the German language/dialects (moin, kriegt, bildung, anstatt, neuer
) and random Top-Level-Domains that have fallen out of popularity (ly, fm
) while the newer trends are
yay, travels, happily, worries
) portland, berkeley, iceland, zurich
), fitting an increased travel compared to earlier timeslichens, ggplot2, markov
) and modern-ish inventions. emoji
anyone? 😂
How do your trends compare with what you expect? And now let's do the whole thing by just looking at people I replied to in the past compared to now:
In my data, I notice that the people from my past are largely other Pirate Party members. People from the present heavily feature the Open* & Quantified Self crowd at large (e.g. @o_guest, @sujaik, @protohedgehog, @kevinschawinski, @kaiblin, @eramirez
), including some of the awesome people that run/ran Open Humans
with me to make this possible (👋 beaugunderson, @madprime, @betatim
).
In a next step we can now look into highly-frequent pairings of emoji with individual words. For this we filter out the emoji out of all tweet-texts and associate them with words, ignoring stop-words & URLs etc.
Looking at the top 10 emoji in my data shows that for me, #mozfest is more 😍, while #opencon is more 😂. And there's two words, fra
and lhr
where the emoji can't be rendered by R
. This would be ✈️, which associates with the IATA codes for the Frankfurt airport and London Heathrow.
Check out your own emoji trends for clues into how your emoji use changes with context.
This performs some twitter archive analysis that is based on this chapter of Text Mining with R. To use this notebook you need to have uploaded a Twitter archive into your Open Humans account through http://twarxiv.org. Initially, what you'll see is data I've supplied, and as you run each command, it'll be replaced with your data.
For a start let's install/load all the required packages again:
library(purrr)
library(stringr)
library(tidytext)
library(widyr)
library(httr)
library(lubridate)
library(ggplot2)
library(dplyr)
library(readr)
Let's now get our access token and request our personal user object that will contain all of our data file downloads etc.:
access_token <- Sys.getenv("OH_ACCESS_TOKEN")
url <- paste("https://www.openhumans.org/api/direct-sharing/project/exchange-member/?access_token=",access_token,sep="")
resp <- GET(url)
user <- content(resp, "parsed")
If you want to look at the data sources you have on Open Humans, uncomment the line below by removing the #:
# user$data
Let's now find the download URL for the Twitter archive from all files:
for (data_source in user$data){
if (data_source$source == "direct-sharing-70"){
twitter_archive_url <- data_source$download_url
}
}
We can now create a temporary file that will contain the whole zipped Twitter archivee, from this we can then unzip and read the tweets.csv
file:
temp <- tempfile()
download.file(twitter_archive_url,temp,method='wget')
#unzip(temp, list=TRUE) # this would list all files in the zip archive
data <- read_csv(unz(temp, "tweets.csv"))
Now let's convert the timestamps into a proper format and plot a simple histogram of tweets over time:
tweets <- mutate(data,timestamp = ymd_hms(timestamp))
ggplot(tweets, aes(x = timestamp)) +
geom_histogram(position = "identity", bins = 20, show.legend = FALSE) + theme_minimal()
We can now 'tokenize' (that is, break up into words) the tweet texts, which will make it easier to work with them. This also allows us to easily calculate word frequencies in the next step. This stage also passes words through a stopwords filter. Stopwords are those words which are frequently used (such as 'the', 'we', 'and', and 'I') but provide very little information, and it's common to filter them out during a textual analysis. Stopwords are language dependent, so you may want to change the language default. You can read more about the stopwords function here: https://www.rdocumentation.org/packages/tm/versions/0.7-3/topics/stopwords
replace_reg <- "https://t.co/[A-Za-z\\d]+|http://[A-Za-z\\d]+|&|<|>|RT|https"
unnest_reg <- "([^A-Za-z_\\d#@']|'(?![A-Za-z_\\d#@]))"
tidy_tweets <- tweets %>%
filter(!str_detect(text, "^RT")) %>%
mutate(text = str_replace_all(text, replace_reg, "")) %>%
unnest_tokens(word, text, token = "regex", pattern = unnest_reg) %>%
filter(!word %in% stop_words$word,
str_detect(word, "[a-z]"))
frequency <- tidy_tweets %>%
count(word, sort = TRUE)
frequency$freq <- frequency$n / sum(frequency$n)
So, what are the top words that you've used? Mine are the German equivalents of I
, all the articles (der, die, das
) along with not
and is
.
head(frequency)
Let's now group the tweets by being "old" (pre-2013) and more recent ones (2013 or newer) to see whether the topics you tweet about have changed. After grouping you can calulcate the word-ratios and look at the most unusual and the most recent words:
cutoff_date <- as.Date("2013-01-01")
You can adapt the cutoff above easily to your own needs. Just replace 2013-01-01
with your own date in the YYYY-MM-DD
format.
library(tidyr)
tidy_tweets$date_group <- ifelse(tidy_tweets$timestamp < cutoff_date,"past","today")
word_ratios <- tidy_tweets %>%
filter(!str_detect(word, "^@")) %>%
count(word, date_group) %>%
filter(sum(n) >= 10) %>%
ungroup() %>%
spread(date_group, n, fill = 0) %>%
mutate_if(is.numeric, funs((. + 1) / sum(. + 1))) %>%
mutate(logratio = log(past / today)) %>%
arrange(desc(logratio))
word_ratios %>%
arrange(abs(logratio))
We can see that some things, e.g. archives
haven't changed much in frequency, as obvious from the small "odds" ratio that's close to zero. Outliers with a large positive values are those that were more frequent in the past, while outliers with values large negative values are more common in more recent tweets. To make it a bit more intuitive we can now plot the Top 15 words with the largest positive/negative odds ratio:
word_ratios %>%
group_by(logratio < 0) %>%
top_n(15, abs(logratio)) %>%
ungroup() %>%
mutate(word = reorder(word, logratio)) %>%
ggplot(aes(word, logratio, fill = logratio < 0)) +
geom_col(show.legend = FALSE) +
coord_flip() +
ylab("log odds ratio past (before cutoff) / now (after cutoff)") +
scale_fill_discrete(name = "", labels = c("past", "now")) + theme_minimal()
What we see in my data below: My activity in the Pirate Party (and living in the state of North-Rhine Westphalia - short NRW
) is clearly in the past and tweets about them occur mainly before 2013 - as demonstrated by hashtags like #piraten, #lptnrw, #lmvnrw, #nrw
etc. My activity in the Open Science world is clearly still going strong in comparison: #opencon, #mozfest, bosc201*, #csvconf
etc. are overrepresented. When you run the analysis on your data, what trends do you see? Next, let's see what would happen if we excluded hashtags.
word_ratios %>%
filter(!str_detect(word, "^#")) %>%
group_by(logratio < 0) %>%
top_n(15, abs(logratio)) %>%
ungroup() %>%
mutate(word = reorder(word, logratio)) %>%
ggplot(aes(word, logratio, fill = logratio < 0)) +
geom_col(show.legend = FALSE) +
coord_flip() +
ylab("log odds ratio past (before cutoff) / now (after cutoff)") +
scale_fill_discrete(name = "", labels = c("past", "now")) + theme_minimal()
In the analysis of my data, we see some changes: now older words are somewhat associated with the German language/dialects (moin, kriegt, bildung, anstatt, neuer
) and random Top-Level-Domains that have fallen out of popularity (ly, fm
) while the newer trends are
yay, travels, happily, worries
) portland, berkeley, iceland, zurich
), fitting an increased travel compared to earlier timeslichens, ggplot2, markov
) and modern-ish inventions. emoji
anyone? 😂
How do your trends compare with what you expect? And now let's do the whole thing by just looking at people I replied to in the past compared to now:
tidy_tweets %>%
filter(str_detect(word, "^@")) %>%
filter(!str_detect(word, "^@ny")) %>%
count(word, date_group) %>%
filter(sum(n) >= 10) %>%
ungroup() %>%
spread(date_group, n, fill = 0) %>%
mutate_if(is.numeric, funs((. + 1) / sum(. + 1))) %>%
mutate(logratio = log(past / today)) %>%
arrange(desc(logratio)) %>%
group_by(logratio < 0) %>%
top_n(15, abs(logratio)) %>%
ungroup() %>%
mutate(word = reorder(word, logratio)) %>%
ggplot(aes(word, logratio, fill = logratio < 0)) +
geom_col(show.legend = FALSE) +
coord_flip() +
ylab("log odds ratio past (before cutoff) / now (after cutoff)") +
scale_fill_discrete(name = "", labels = c("past", "now")) + theme_minimal()
In my data, I notice that the people from my past are largely other Pirate Party members. People from the present heavily feature the Open* & Quantified Self crowd at large (e.g. @o_guest, @sujaik, @protohedgehog, @kevinschawinski, @kaiblin, @eramirez
), including some of the awesome people that run/ran Open Humans
with me to make this possible (👋 beaugunderson, @madprime, @betatim
).
In a next step we can now look into highly-frequent pairings of emoji with individual words. For this we filter out the emoji out of all tweet-texts and associate them with words, ignoring stop-words & URLs etc.
emoji_tweets <- tweets %>%
filter(!str_detect(text, "^RT")) %>%
filter(!str_detect(text, "^@")) %>%
filter(str_detect(text, "[\\uD83C-\\uDBFF\\uDC00-\\uDFFF]+")) %>%
mutate(Emoji = str_extract_all(text,
"[\\uD83C-\\uDBFF\\uDC00-\\uDFFF]+")) %>%
select(tweet_id, timestamp, Emoji,text)
emoji_tweets <- emoji_tweets %>%
select(-Emoji) %>%
unnest_tokens(word, text) %>%
left_join(emoji_tweets) %>%
mutate(Emoji = map_chr(Emoji, ~ ifelse(length(.x) > 0, .x[[1]], ""))) %>%
mutate(word = str_replace_all(word, "’", "'")) %>%
filter(!(Emoji %in% c("", "-"))) %>%
filter(!(word %in% c("t.co", "http",'https'))) %>%
filter(!word %in% stop_words$word,
str_detect(word, "[a-z]"))
emo_twids <- emoji_tweets %>%
select(tweet_id, Emoji) %>%
distinct() %>%
rename(word = Emoji)
emoji_tweets %>%
select(tweet_id, word) %>%
bind_rows(emo_twids) %>%
pairwise_count(word, tweet_id, sort = TRUE) %>%
filter(item1 %in% unique(emoji_tweets$Emoji)) %>%
group_by(item1) %>%
slice(1:2) %>%
ungroup() %>%
filter(nchar(item2) > 2) %>%
arrange(desc(n)) %>% head(n=10)
Looking at the top 10 emoji in my data shows that for me, #mozfest is more 😍, while #opencon is more 😂. And there's two words, fra
and lhr
where the emoji can't be rendered by R
. This would be ✈️, which associates with the IATA codes for the Frankfurt airport and London Heathrow.
Check out your own emoji trends for clues into how your emoji use changes with context.