Recipe 5 Extracting a Retweet’s Origins

5.1 Problem

You want to extract the originating source from a retweet.

5.2 Solution

If the tweet’s retweet_count field is greater than 0, extract name out of the tweet’s user field; also parse the text of the tweet with a regular expression.

5.3 Discussion

Twitter is pretty darn good about ~~weaponizing~~utilizing the data on its platform. There aren’t many cases nowadays when you need to parse RT or via in hand-crafted retweets, but it’s good to have the tools in your arsenal when needed. We can pick out all the retweets from #rstats (warning: it’s a retweet-heavy hashtag) and who they refer to using the retweet_count but also looking for a special regular expression (regex) and extracting data that way.

First, the modern, API-centric way:

library(rtweet)
library(tidyverse)

rstats <- search_tweets("#rstats", n=500)

glimpse(rstats)

## Observations: 500
## Variables: 87
## $ user_id                 <chr> "2199421213", "744962021121089538", "8...
## $ status_id               <chr> "994211013644898304", "994211023715491...
## $ created_at              <dttm> 2018-05-09 13:42:31, 2018-05-09 13:42...
## $ screen_name             <chr> "JunsongW", "jwiggi21", "RGLoreto", "J...
## $ text                    <chr> "RT @dataandme: ICYMI, a classic how-t...
## $ source                  <chr> "Twitter Lite", "Twitter Lite", "Twitt...
## $ display_text_width      <dbl> 144, 222, 140, 140, 93, 144, 140, 194,...
## $ reply_to_status_id      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ reply_to_user_id        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ reply_to_screen_name    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ is_quote                <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALS...
## $ is_retweet              <lgl> TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, T...
## $ favorite_count          <int> 0, 2, 0, 0, 0, 0, 0, 36, 0, 0, 1, 0, 0...
## $ retweet_count           <int> 30, 0, 54, 14, 34, 14, 10, 10, 14, 18,...
## $ hashtags                <list> ["rstats", "rstats", NA, <"BigData", ...
## $ symbols                 <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ urls_url                <list> ["buff.ly/2GuxS2K", "twitter.com/BES_...
## $ urls_t.co               <list> ["https://t.co/0c0kwsFQIb", "https://...
## $ urls_expanded_url       <list> ["https://buff.ly/2GuxS2K", "https://...
## $ media_url               <list> [NA, NA, NA, NA, NA, NA, NA, "http://...
## $ media_t.co              <list> [NA, NA, NA, NA, NA, NA, NA, "https:/...
## $ media_expanded_url      <list> [NA, NA, NA, NA, NA, NA, NA, "https:/...
## $ media_type              <list> [NA, NA, NA, NA, NA, NA, NA, "photo",...
## $ ext_media_url           <list> [NA, NA, NA, NA, NA, NA, NA, "http://...
## $ ext_media_t.co          <list> [NA, NA, NA, NA, NA, NA, NA, "https:/...
## $ ext_media_expanded_url  <list> [NA, NA, NA, NA, NA, NA, NA, "https:/...
## $ ext_media_type          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ mentions_user_id        <list> [<"3230388598", "2802843016">, NA, "2...
## $ mentions_screen_name    <list> [<"dataandme", "MarkusGesmann">, NA, ...
## $ lang                    <chr> "en", "en", "en", "en", "in", "en", "e...
## $ quoted_status_id        <chr> NA, "994186582012977152", NA, NA, NA, ...
## $ quoted_text             <chr> NA, "3 days left. Tickets still availa...
## $ quoted_created_at       <dttm> NA, 2018-05-09 12:05:26, NA, NA, NA, ...
## $ quoted_source           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ quoted_favorite_count   <int> NA, 5, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ quoted_retweet_count    <int> NA, 8, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ quoted_user_id          <chr> NA, "1064189490", NA, NA, NA, NA, NA, ...
## $ quoted_screen_name      <chr> NA, "BES_QE_SIG", NA, NA, NA, NA, NA, ...
## $ quoted_name             <chr> NA, "BES Quant. Ecology", NA, NA, NA, ...
## $ quoted_followers_count  <int> NA, 2482, NA, NA, NA, NA, NA, NA, NA, ...
## $ quoted_friends_count    <int> NA, 884, NA, NA, NA, NA, NA, NA, NA, N...
## $ quoted_statuses_count   <int> NA, 1646, NA, NA, NA, NA, NA, NA, NA, ...
## $ quoted_location         <chr> NA, "United Kingdom", NA, NA, NA, NA, ...
## $ quoted_description      <chr> NA, "The Quantitative Ecology group of...
## $ quoted_verified         <lgl> NA, FALSE, NA, NA, NA, NA, NA, NA, NA,...
## $ retweet_status_id       <chr> "993925006991208448", NA, "99397477452...
## $ retweet_text            <chr> "ICYMI, a classic how-to w/ code &amp;...
## $ retweet_created_at      <dttm> 2018-05-08 18:46:02, NA, 2018-05-08 2...
## $ retweet_source          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ retweet_favorite_count  <int> 121, NA, 135, 3, 139, 21, 67, NA, 21, ...
## $ retweet_retweet_count   <int> 30, NA, 54, 14, 34, 14, 10, NA, 14, 18...
## $ retweet_user_id         <chr> "3230388598", NA, "2544177302", "42630...
## $ retweet_screen_name     <chr> "dataandme", NA, "patilindrajeets", "g...
## $ retweet_name            <chr> "Mara Averick", NA, "Indrajeet Patil",...
## $ retweet_followers_count <int> 24093, NA, 1769, 40544, 58409, 5813, 2...
## $ retweet_friends_count   <int> 2947, NA, 831, 21794, 36, 1328, 2947, ...
## $ retweet_statuses_count  <int> 26660, NA, 4823, 38809, 17358, 4386, 2...
## $ retweet_location        <chr> "Massachusetts", NA, "Cambridge, MA", ...
## $ retweet_description     <chr> "tidyverse dev advocate @rstudio #rsta...
## $ retweet_verified        <lgl> FALSE, NA, FALSE, FALSE, FALSE, FALSE,...
## $ place_url               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ place_name              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ place_full_name         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ place_type              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ country                 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ country_code            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ geo_coords              <list> [<NA, NA>, <NA, NA>, <NA, NA>, <NA, N...
## $ coords_coords           <list> [<NA, NA>, <NA, NA>, <NA, NA>, <NA, N...
## $ bbox_coords             <list> [<NA, NA, NA, NA, NA, NA, NA, NA>, <N...
## $ name                    <chr> "Junsong Wang", "Jodie Wiggins", "Raqu...
## $ location                <chr> "People's Republic of China", "Stillwa...
## $ description             <chr> "", "Wife, mom, evolutionary ecologist...
## $ url                     <chr> NA, "https://t.co/0rrln5NIEU", NA, NA,...
## $ protected               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FAL...
## $ followers_count         <int> 81, 334, 327, 407, 81, 788, 81, 2192, ...
## $ friends_count           <int> 647, 446, 197, 54, 647, 1191, 647, 599...
## $ listed_count            <int> 54, 19, 10, 29, 54, 36, 54, 209, 14, 5...
## $ statuses_count          <int> 7597, 2896, 599, 26143, 7597, 9763, 75...
## $ favourites_count        <int> 9, 2887, 1891, 8109, 9, 2135, 9, 4328,...
## $ account_created_at      <dttm> 2013-11-17 12:11:18, 2016-06-20 18:36...
## $ verified                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FAL...
## $ profile_url             <chr> NA, "https://t.co/0rrln5NIEU", NA, NA,...
## $ profile_expanded_url    <chr> NA, "http://jodiewiggins.wix.com/mysit...
## $ account_lang            <chr> "en", "en", "en", "ja", "en", "en", "e...
## $ profile_banner_url      <chr> NA, "https://pbs.twimg.com/profile_ban...
## $ profile_background_url  <chr> "http://abs.twimg.com/images/themes/th...
## $ profile_image_url       <chr> "http://abs.twimg.com/sticky/default_p...

filter(rstats, retweet_count > 0) %>% 
  select(text, mentions_screen_name, retweet_count) %>% 
  mutate(text = substr(text, 1, 30)) %>% 
  unnest()

## # A tibble: 711 x 3
##    text                             retweet_count mentions_screen_name
##    <chr>                                    <int> <chr>               
##  1 RT @dataandme: ICYMI, a classi              30 dataandme           
##  2 RT @dataandme: ICYMI, a classi              30 MarkusGesmann       
##  3 RT @patilindrajeets: The R pac              54 patilindrajeets     
##  4 "RT @gp_pulipaka: Learn How to "            14 gp_pulipaka         
##  5 RT @Rbloggers: Tidying messy E              34 Rbloggers           
##  6 RT @EcographyJourna: SIDER: an              14 EcographyJourna     
##  7 RT @dataandme: .@grrrck contin              10 dataandme           
##  8 RT @dataandme: .@grrrck contin              10 grrrck              
##  9 Just published an update to th              10 <NA>                
## 10 RT @EcographyJourna: SIDER: an              14 EcographyJourna     
## # ... with 701 more rows

The text column was pared down for display brevity. If you run that code snippet you can examine it to see that it identifies the retweets and the first screen name is usually the main reference, but you get all of the screen names from the original tweet for free.

Here’s the brute-force way. A regular expression is used that matches the vast majority of retweet formats. The pattern looks for them then extracts the first found screen name:

# regex mod from https://stackoverflow.com/questions/655903/python-regular-expression-for-retweets
filter(rstats, str_detect(text, "(RT|via)((?:[[:blank:]:]\\W*@\\w+)+)")) %>% 
  select(text, mentions_screen_name, retweet_count) %>% 
  mutate(extracted = str_match(text, "(RT|via)((?:[[:blank:]:]\\W*@\\w+)+)")[,3]) %>% 
  mutate(text = substr(text, 1, 30)) %>% 
  unnest()

## # A tibble: 614 x 4
##    text                 retweet_count extracted         mentions_screen_n…
##    <chr>                        <int> <chr>             <chr>             
##  1 RT @dataandme: ICYM…            30 " @dataandme"     dataandme         
##  2 RT @dataandme: ICYM…            30 " @dataandme"     MarkusGesmann     
##  3 RT @patilindrajeets…            54 " @patilindrajee… patilindrajeets   
##  4 "RT @gp_pulipaka: L…            14 " @gp_pulipaka"   gp_pulipaka       
##  5 RT @Rbloggers: Tidy…            34 " @Rbloggers"     Rbloggers         
##  6 RT @EcographyJourna…            14 " @EcographyJour… EcographyJourna   
##  7 RT @dataandme: .@gr…            10 " @dataandme: .@… dataandme         
##  8 RT @dataandme: .@gr…            10 " @dataandme: .@… grrrck            
##  9 RT @EcographyJourna…            14 " @EcographyJour… EcographyJourna   
## 10 RT @Rbloggers: Open…            18 " @Rbloggers"     Rbloggers         
## # ... with 604 more rows

You should try the above snippets for other tags as there will be cases when the regex will pick up retweets Twitter has failed to capture.

5.4 See Also

Twiter official documentation on what happens to retweets when origin tweets are deleted