17 Day 15: Names

17.1 Technologies/Techniques

Scraping data with R for use in a map context
Creating a silhouette of Maine for use in making a Maine shaped word cloud
Prepping final data for use in another context

17.2 Data Source: Baby Names

The U.S. Social Security Administration maintains historical, popular baby names by state database that they’ve put online⁵³. Where there may be a data file for this out there somewhere I just threw together a quick scraper for it vs spend time googling.

library(sf)
library(rvest)
library(hrbrthemes)
library(curlconverter)
library(tidyverse)

This is similar to what we’ve done in a previous challenge. We make a custom scraping function from a captured cURL request and then grab and cache the data.

if (!file.exists(here::here("data/maine-names.rds"))) {

  cURL <- "curl 'https://www.ssa.gov/cgi-bin/namesbystate.cgi' -H 'Connection: keep-alive' -H 'Cache-Control: max-age=0' -H 'DNT: 1' -H 'Upgrade-Insecure-Requests: 1' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.36 Safari/537.36' -H 'Sec-Fetch-User: ?1' -H 'Origin: https://www.ssa.gov' -H 'Content-Type: application/x-www-form-urlencoded' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' -H 'Sec-Fetch-Site: same-origin' -H 'Sec-Fetch-Mode: navigate' -H 'Referer: https://www.ssa.gov/cgi-bin/namesbystate.cgi' -H 'Accept-Encoding: gzip, deflate, br' -H 'Accept-Language: en-US,en;q=0.9,la;q=0.8' -H 'Cookie: TS014661b0=01cfd667a591a309251d155252645f7ba8d3e92ca313a0cc50797331d66b9eca4cda5b1bf09262fc2dd204124dd994ed097d2b0147; TS01838516=017e2f91c304e1561be5e212ee69533011530d7386dacfbaf9306ecd64d33947273a6a054c612f05721de9c9378d41fde9c08cf713' --data 'state=ME&year=2017' --compressed"

  straighten() %>%
    make_req() -> req

  get_names <- function(yr = 2018) {

    httr::POST(
      url = "https://www.ssa.gov/cgi-bin/namesbystate.cgi",
      body = list(
        state = "ME",
        year = as.character(yr)
      ),
      encode = "form"
    ) -> res

    out <- httr::content(res, as = "parsed", encoding = "UTF-8")

    html_node(out, xpath = ".//table[@bordercolor = '#aaabbb']") %>%
      html_table(header = TRUE, trim = TRUE) %>%
      as_tibble() %>%
      janitor::clean_names() %>%
      mutate(year = yr)

  }

  maine_names <- map_df(1960:2018, get_names)

  saveRDS(maine_names, here::here("data/maine-names.rds"))

}

maine_names <- readRDS(here::here("data/maine-names.rds"))

glimpse(maine_names)
## Observations: 5,900
## Variables: 6
## $ rank              <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
## $ male_name         <chr> "David", "Michael", "Robert", "James", "John", "Mar…
## $ number_of_males   <int> 590, 582, 405, 390, 378, 362, 344, 243, 235, 212, 2…
## $ female_name       <chr> "Susan", "Linda", "Brenda", "Karen", "Donna", "Lisa…
## $ number_of_females <int> 280, 227, 224, 221, 220, 218, 210, 205, 166, 163, 1…
## $ year              <int> 1960, 1960, 1960, 1960, 1960, 1960, 1960, 1960, 196…

17.3 Prepping Our Word Cloud Map and Data

We’re going to kinda cheat and use WordArt.com⁵⁴ for our final product since they have a nicer word cloud generator than anything available in R or Python (I include the latter since we can easily use icky Python from R when necessary).

We’re going to use one of the more interesting features of WordArt.com and have the word cloud take the shape of an input silhouette image. In our case, this is Maine. We’re just making a solid black Maine image and writing it out:

if (!file.exists(here::here("img/me-silhouette.png"))) {

  st_read(here::here("data/me-counties.json")) %>%
    st_set_crs(4326) -> maine

  ggplot() +
    geom_sf(data = maine, fill = "black", color = "black") +
    coord_sf(datum = NA) +
    theme_ipsum_es(grid="") +
    theme(axis.text = element_blank()) -> gg

  ggsave(here::here("img/me-silhouette.png"), plot = gg, width = 500/72, height = 500/72, dpi = 96)
  
}

The WordArt.com site lets us specify colors as well as frequency so we can use two different palettes for SSA categories and compute the color based on the in-year frequency:

maine_names %>%
  {
    bind_rows(
      select(., name = male_name, ct = number_of_males) %>%
        count(name, wt = ct, name = "ct") %>%
        mutate(color = scales::brewer_pal(palette = "PuBu")(9)[cut(ct, 9)]),
      select(., name = female_name, ct = number_of_females) %>%
        count(name, wt = ct, name = "ct") %>%
        mutate(color = scales::brewer_pal(palette="OrRd")(9)[cut(ct, 9)])
    ) %>%
      arrange(desc(ct))
  } %>%
  write_delim(here::here("data/wordart.csv"), delim=";", col_names = FALSE) -> wc_df

glimpse(wc_df)
## Observations: 780
## Variables: 3
## $ name  <chr> "Michael", "Christopher", "David", "James", "Matthew", "John", …
## $ ct    <int> 13421, 9232, 9140, 8306, 8205, 7927, 7520, 6879, 6399, 6205, 60…
## $ color <chr> "#023858", "#0570B0", "#0570B0", "#3690C0", "#3690C0", "#3690C0…

17.4 Drawing the Map

This is really “upload the PNG and CSV to WordArt.com and play with remaining aesthetics”. It’s great fun!

17.5 In Review

We used R for data gathering, data prep, and making a silhouette for use in other programs. While I try to do as much as I can in R, sometimes you need to go outside of R to get things just the way you’re looking for.

17.6 Try This At Home

Try this with your state data!

There are a couple word cloud packages for R that can get close to this finished product. Give them a go and see if the effort to get to a similar end-results justifies the time spent.