Recipe 5 Extracting a Retweet’s Origins
5.1 Problem
You want to extract the originating source from a retweet.
5.2 Solution
If the tweet’s retweet_count
field is greater than 0
, extract name out of the tweet’s user field; also parse the text of the tweet with a regular expression.
5.3 Discussion
Twitter is pretty darn good about weaponizingutilizing the data on its platform. There aren’t many cases nowadays when you need to parse RT
or via
in hand-crafted retweets, but it’s good to have the tools in your arsenal when needed. We can pick out all the retweets from #rstats
(warning: it’s a retweet-heavy hashtag) and who they refer to using the retweet_count
but also looking for a special regular expression (regex) and extracting data that way.
First, the modern, API-centric way:
## Observations: 500
## Variables: 87
## $ user_id <chr> "2199421213", "744962021121089538", "8...
## $ status_id <chr> "994211013644898304", "994211023715491...
## $ created_at <dttm> 2018-05-09 13:42:31, 2018-05-09 13:42...
## $ screen_name <chr> "JunsongW", "jwiggi21", "RGLoreto", "J...
## $ text <chr> "RT @dataandme: ICYMI, a classic how-t...
## $ source <chr> "Twitter Lite", "Twitter Lite", "Twitt...
## $ display_text_width <dbl> 144, 222, 140, 140, 93, 144, 140, 194,...
## $ reply_to_status_id <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ reply_to_user_id <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ reply_to_screen_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ is_quote <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALS...
## $ is_retweet <lgl> TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, T...
## $ favorite_count <int> 0, 2, 0, 0, 0, 0, 0, 36, 0, 0, 1, 0, 0...
## $ retweet_count <int> 30, 0, 54, 14, 34, 14, 10, 10, 14, 18,...
## $ hashtags <list> ["rstats", "rstats", NA, <"BigData", ...
## $ symbols <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ urls_url <list> ["buff.ly/2GuxS2K", "twitter.com/BES_...
## $ urls_t.co <list> ["https://t.co/0c0kwsFQIb", "https://...
## $ urls_expanded_url <list> ["https://buff.ly/2GuxS2K", "https://...
## $ media_url <list> [NA, NA, NA, NA, NA, NA, NA, "http://...
## $ media_t.co <list> [NA, NA, NA, NA, NA, NA, NA, "https:/...
## $ media_expanded_url <list> [NA, NA, NA, NA, NA, NA, NA, "https:/...
## $ media_type <list> [NA, NA, NA, NA, NA, NA, NA, "photo",...
## $ ext_media_url <list> [NA, NA, NA, NA, NA, NA, NA, "http://...
## $ ext_media_t.co <list> [NA, NA, NA, NA, NA, NA, NA, "https:/...
## $ ext_media_expanded_url <list> [NA, NA, NA, NA, NA, NA, NA, "https:/...
## $ ext_media_type <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ mentions_user_id <list> [<"3230388598", "2802843016">, NA, "2...
## $ mentions_screen_name <list> [<"dataandme", "MarkusGesmann">, NA, ...
## $ lang <chr> "en", "en", "en", "en", "in", "en", "e...
## $ quoted_status_id <chr> NA, "994186582012977152", NA, NA, NA, ...
## $ quoted_text <chr> NA, "3 days left. Tickets still availa...
## $ quoted_created_at <dttm> NA, 2018-05-09 12:05:26, NA, NA, NA, ...
## $ quoted_source <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ quoted_favorite_count <int> NA, 5, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ quoted_retweet_count <int> NA, 8, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ quoted_user_id <chr> NA, "1064189490", NA, NA, NA, NA, NA, ...
## $ quoted_screen_name <chr> NA, "BES_QE_SIG", NA, NA, NA, NA, NA, ...
## $ quoted_name <chr> NA, "BES Quant. Ecology", NA, NA, NA, ...
## $ quoted_followers_count <int> NA, 2482, NA, NA, NA, NA, NA, NA, NA, ...
## $ quoted_friends_count <int> NA, 884, NA, NA, NA, NA, NA, NA, NA, N...
## $ quoted_statuses_count <int> NA, 1646, NA, NA, NA, NA, NA, NA, NA, ...
## $ quoted_location <chr> NA, "United Kingdom", NA, NA, NA, NA, ...
## $ quoted_description <chr> NA, "The Quantitative Ecology group of...
## $ quoted_verified <lgl> NA, FALSE, NA, NA, NA, NA, NA, NA, NA,...
## $ retweet_status_id <chr> "993925006991208448", NA, "99397477452...
## $ retweet_text <chr> "ICYMI, a classic how-to w/ code &...
## $ retweet_created_at <dttm> 2018-05-08 18:46:02, NA, 2018-05-08 2...
## $ retweet_source <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ retweet_favorite_count <int> 121, NA, 135, 3, 139, 21, 67, NA, 21, ...
## $ retweet_retweet_count <int> 30, NA, 54, 14, 34, 14, 10, NA, 14, 18...
## $ retweet_user_id <chr> "3230388598", NA, "2544177302", "42630...
## $ retweet_screen_name <chr> "dataandme", NA, "patilindrajeets", "g...
## $ retweet_name <chr> "Mara Averick", NA, "Indrajeet Patil",...
## $ retweet_followers_count <int> 24093, NA, 1769, 40544, 58409, 5813, 2...
## $ retweet_friends_count <int> 2947, NA, 831, 21794, 36, 1328, 2947, ...
## $ retweet_statuses_count <int> 26660, NA, 4823, 38809, 17358, 4386, 2...
## $ retweet_location <chr> "Massachusetts", NA, "Cambridge, MA", ...
## $ retweet_description <chr> "tidyverse dev advocate @rstudio #rsta...
## $ retweet_verified <lgl> FALSE, NA, FALSE, FALSE, FALSE, FALSE,...
## $ place_url <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ place_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ place_full_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ place_type <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ country <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ country_code <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ geo_coords <list> [<NA, NA>, <NA, NA>, <NA, NA>, <NA, N...
## $ coords_coords <list> [<NA, NA>, <NA, NA>, <NA, NA>, <NA, N...
## $ bbox_coords <list> [<NA, NA, NA, NA, NA, NA, NA, NA>, <N...
## $ name <chr> "Junsong Wang", "Jodie Wiggins", "Raqu...
## $ location <chr> "People's Republic of China", "Stillwa...
## $ description <chr> "", "Wife, mom, evolutionary ecologist...
## $ url <chr> NA, "https://t.co/0rrln5NIEU", NA, NA,...
## $ protected <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FAL...
## $ followers_count <int> 81, 334, 327, 407, 81, 788, 81, 2192, ...
## $ friends_count <int> 647, 446, 197, 54, 647, 1191, 647, 599...
## $ listed_count <int> 54, 19, 10, 29, 54, 36, 54, 209, 14, 5...
## $ statuses_count <int> 7597, 2896, 599, 26143, 7597, 9763, 75...
## $ favourites_count <int> 9, 2887, 1891, 8109, 9, 2135, 9, 4328,...
## $ account_created_at <dttm> 2013-11-17 12:11:18, 2016-06-20 18:36...
## $ verified <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FAL...
## $ profile_url <chr> NA, "https://t.co/0rrln5NIEU", NA, NA,...
## $ profile_expanded_url <chr> NA, "http://jodiewiggins.wix.com/mysit...
## $ account_lang <chr> "en", "en", "en", "ja", "en", "en", "e...
## $ profile_banner_url <chr> NA, "https://pbs.twimg.com/profile_ban...
## $ profile_background_url <chr> "http://abs.twimg.com/images/themes/th...
## $ profile_image_url <chr> "http://abs.twimg.com/sticky/default_p...
filter(rstats, retweet_count > 0) %>%
select(text, mentions_screen_name, retweet_count) %>%
mutate(text = substr(text, 1, 30)) %>%
unnest()
## # A tibble: 711 x 3
## text retweet_count mentions_screen_name
## <chr> <int> <chr>
## 1 RT @dataandme: ICYMI, a classi 30 dataandme
## 2 RT @dataandme: ICYMI, a classi 30 MarkusGesmann
## 3 RT @patilindrajeets: The R pac 54 patilindrajeets
## 4 "RT @gp_pulipaka: Learn How to " 14 gp_pulipaka
## 5 RT @Rbloggers: Tidying messy E 34 Rbloggers
## 6 RT @EcographyJourna: SIDER: an 14 EcographyJourna
## 7 RT @dataandme: .@grrrck contin 10 dataandme
## 8 RT @dataandme: .@grrrck contin 10 grrrck
## 9 Just published an update to th 10 <NA>
## 10 RT @EcographyJourna: SIDER: an 14 EcographyJourna
## # ... with 701 more rows
The text
column was pared down for display brevity. If you run that code snippet you can examine it to see that it identifies the retweets and the first screen name is usually the main reference, but you get all of the screen names from the original tweet for free.
Here’s the brute-force way. A regular expression is used that matches the vast majority of retweet formats. The pattern looks for them then extracts the first found screen name:
# regex mod from https://stackoverflow.com/questions/655903/python-regular-expression-for-retweets
filter(rstats, str_detect(text, "(RT|via)((?:[[:blank:]:]\\W*@\\w+)+)")) %>%
select(text, mentions_screen_name, retweet_count) %>%
mutate(extracted = str_match(text, "(RT|via)((?:[[:blank:]:]\\W*@\\w+)+)")[,3]) %>%
mutate(text = substr(text, 1, 30)) %>%
unnest()
## # A tibble: 614 x 4
## text retweet_count extracted mentions_screen_n…
## <chr> <int> <chr> <chr>
## 1 RT @dataandme: ICYM… 30 " @dataandme" dataandme
## 2 RT @dataandme: ICYM… 30 " @dataandme" MarkusGesmann
## 3 RT @patilindrajeets… 54 " @patilindrajee… patilindrajeets
## 4 "RT @gp_pulipaka: L… 14 " @gp_pulipaka" gp_pulipaka
## 5 RT @Rbloggers: Tidy… 34 " @Rbloggers" Rbloggers
## 6 RT @EcographyJourna… 14 " @EcographyJour… EcographyJourna
## 7 RT @dataandme: .@gr… 10 " @dataandme: .@… dataandme
## 8 RT @dataandme: .@gr… 10 " @dataandme: .@… grrrck
## 9 RT @EcographyJourna… 14 " @EcographyJour… EcographyJourna
## 10 RT @Rbloggers: Open… 18 " @Rbloggers" Rbloggers
## # ... with 604 more rows
You should try the above snippets for other tags as there will be cases when the regex will pick up retweets Twitter has failed to capture.
5.4 See Also
- Twiter official documentation on what happens to retweets when origin tweets are deleted