Recipe 5 Extracting a Retweet’s Origins

5.1 Problem

You want to extract the originating source from a retweet.

5.2 Solution

If the tweet’s retweet_count field is greater than 0, extract name out of the tweet’s user field; also parse the text of the tweet with a regular expression.

5.3 Discussion

Twitter is pretty darn good about weaponizingutilizing the data on its platform. There aren’t many cases nowadays when you need to parse RT or via in hand-crafted retweets, but it’s good to have the tools in your aresenal when needed. We can pick out all the retweets from #rstats (warning: it’s a retweet-heavy hashtag) and who they refer to using the retweet_count but also looking for a special regular expression (regex) and extracting data that way.

First, the modern, API-centric way:

library(rtweet)
library(tidyverse)
rstats <- search_tweets("#rstats", n=500)
glimpse(rstats)
## Observations: 500
## Variables: 42
## $ status_id              <chr> "948262548570177536", "9482624955583692...
## $ created_at             <dttm> 2018-01-02 18:39:44, 2018-01-02 18:39:...
## $ user_id                <chr> "22462234", "22462234", "3367336625", "...
## $ screen_name            <chr> "bffo", "bffo", "HeathrTurnr", "_RCharl...
## $ text                   <chr> "RT @sellorm: Introducing the Field Gui...
## $ source                 <chr> "Twitter for iPhone", "Twitter for iPho...
## $ reply_to_status_id     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ reply_to_user_id       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ reply_to_screen_name   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ is_quote               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALS...
## $ is_retweet             <lgl> TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TR...
## $ favorite_count         <int> 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 1, ...
## $ retweet_count          <int> 112, 25, 2, 1, 1, 25, 8, 0, 112, 4, 112...
## $ hashtags               <list> ["rstats", "rstats", "rstats", <"rstat...
## $ symbols                <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ urls_url               <list> ["blog.sellorm.com/2018/01/01/fie\u202...
## $ urls_t.co              <list> ["https://t.co/Hfrs1fi74u", NA, NA, NA...
## $ urls_expanded_url      <list> ["http://blog.sellorm.com/2018/01/01/f...
## $ media_url              <list> [NA, NA, NA, NA, "http://pbs.twimg.com...
## $ media_t.co             <list> [NA, NA, NA, NA, "https://t.co/HpqJm7L...
## $ media_expanded_url     <list> [NA, NA, NA, NA, "https://twitter.com/...
## $ media_type             <list> [NA, NA, NA, NA, "photo", NA, NA, "pho...
## $ ext_media_url          <list> [NA, NA, NA, NA, "http://pbs.twimg.com...
## $ ext_media_t.co         <list> [NA, NA, NA, NA, "https://t.co/HpqJm7L...
## $ ext_media_expanded_url <list> [NA, NA, NA, NA, "https://twitter.com/...
## $ ext_media_type         <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ mentions_user_id       <list> ["14351134", "2167059661", <"567537377...
## $ mentions_screen_name   <list> ["sellorm", "JennyBryan", <"KKulma", "...
## $ lang                   <chr> "en", "en", "en", "en", "en", "en", "en...
## $ quoted_status_id       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ quoted_text            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ retweet_status_id      <chr> "947909537859809281", "9479691476058849...
## $ retweet_text           <chr> "Introducing the Field Guide to the #rs...
## $ place_url              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ place_name             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ place_full_name        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ place_type             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ country                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ country_code           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ geo_coords             <list> [<NA, NA>, <NA, NA>, <NA, NA>, <NA, NA...
## $ coords_coords          <list> [<NA, NA>, <NA, NA>, <NA, NA>, <NA, NA...
## $ bbox_coords            <list> [<NA, NA, NA, NA, NA, NA, NA, NA>, <NA...
filter(rstats, retweet_count > 0) %>%
select(text, mentions_screen_name, retweet_count) %>%
mutate(text = substr(text, 1, 30)) %>%
unnest()
## # A tibble: 505 x 3
##                              text retweet_count mentions_screen_name
##                             <chr>         <int>                <chr>
##  1 RT @sellorm: Introducing the F           112              sellorm
##  2 RT @JennyBryan: This overview             25           JennyBryan
##  3 RT @KKulma: Help us improve an             2               KKulma
##  4 RT @KKulma: Help us improve an             2        RLadiesLondon
##  5 RT @ma_salmon: My first #rstat             1            ma_salmon
##  6 RT @ma_salmon: My first #rstat             1         Thoughtfulnz
##  7 My first #rstats post of the y             1         Thoughtfulnz
##  8 RT @edzerpebesma: New #rspatia            25         edzerpebesma
##  9 RT @dataandme: R-code to find              8            dataandme
## 10 RT @dataandme: R-code to find              8              LVaudor
## # ... with 495 more rows

The text column was pared down for display brevity. If you run that code snippet you can examine it to see that it identifies the retweets and the first screen name is usually the main reference, but you get all of the screen names from the original tweet for free.

Here’s the brute-force way. A regular expression is used that matches the vast majority of retweet formats. The patten looks for them then extracts the first found screen name:

# regex mod from https://stackoverflow.com/questions/655903/python-regular-expression-for-retweets
filter(rstats, str_detect(text, "(RT|via)((?:[[:blank:]:]\\W*@\\w+)+)")) %>%
select(text, mentions_screen_name, retweet_count) %>%
mutate(extracted = str_match(text, "(RT|via)((?:[[:blank:]:]\\W*@\\w+)+)")[,3]) %>%
mutate(text = substr(text, 1, 30)) %>%
unnest()
## # A tibble: 445 x 4
##                                         text retweet_count
##                                        <chr>         <int>
##  1            RT @rweekly_org: https://t.co/            23
##  2            RT @AnalyticsVidhya: What are              4
##  3 "RT @dataandme: Still a fave \U0001f4fd "             1
##  4            RT @sellorm: Introducing the F           108
##  5            RT @rstudio: Word Embeddings w             4
##  6            RT @sellorm: Introducing the F           108
##  7            RT @AnalyticsVidhya: What are              4
##  8            RT @StatGarrett: Thank you @co             3
##  9            RT @StatGarrett: Thank you @co             3
## 10            RT @StatGarrett: Thank you @co             3
## # ... with 435 more rows, and 2 more variables: extracted <chr>,
## #   mentions_screen_name <chr>

You should try the above snippets for other tags as there will be cases when the regex will pick up retweets Twitter has failed to capture.

5.4 See Also