Drop #332 (2023-09-11): The AWK Programming Language ๐ @ 35 & CLI Sparkline Bars
(This was originally published on hrbrmstrโs Daily Drop.)
For folks who did not ๐ the Bonus Drop (free for all!) over the weekend, but are still somewhat concerned over Googleโs rollout of their laughable โPrivacy Sandboxโ (it is anything but that), you can use this checker I built to see if your Chrome configs need some tweaking.
No TL;DR, today, as weโre taking a deep dive into two combined topics, and thereโs going to be quite a bit to go through; plus, some of the bash code examples have fun bits in them that may be useful in other contexts.
The AWK Programming Language ๐ @ 35 & CLI Sparkline Bars
Weโre combining coverage of two resources in this section since I want to use spark (GH) to spice up some of the AWK examples. Spark does one thing and does it pretty well: make sparkline bar charts at the CLI. Itโs 100% bash, so itโs super lightweight and runs everywhere. Youโll need to install it if youโre going to run the examples on your own.
The second edition of The AWK Programming Language comes out at the end of September. Itโs been thirty-five years since the first edition was released, making me feel less bad about the nine years it has been since Data-Driven Security has been around. AWK itself is nearly 50 years old.
Fundamentally, AWK โjustโ scans text input files and splits each input line into fields automatically, leaving you to process that with AWKโs somewhat arcane-yet-fairly-easy-to-grok processing language.
While the AWK ecosystem has certainly evolved over the years, a big change in the one, true awk is the direct support for CSV files. Yes, AWK finally groks CSV, and each CSV column gets put into the ordered input field. The awk
binary that ships with my macOS and Ubuntu systems does not have the --csv
option (โyetโ, I guess). If you head to the link in the first sentence of this paragraph, clone the repo, check out the โcsv
โ branch, run make
, and then mv a.out cawk
, you can follow along with the upcoming examples.
With this change to the AWK program, the second edition has an entire chapter on exploratory data analysis (EDA) where the authors (all three original authors!) walk through some use cases.
Weโll do a bit of the same, here, with a World Bank โtourismโ file I grabbed in 2021. Weโll stick the filename in a variable to make the examples shorter:
# https://rud.is/dl/world-bank-tourism-arrivals-2000-2020.csv
$ DATAFILE="${HOME}/Data/world-bank-tourism-arrivals-2000-2020.csv"
$ head "${DATAFILE}
year,country,arrivals,lat,lng
2000,Albania,317000,41.3317,19.8172
2001,Albania,354000,41.3317,19.8172
2002,Albania,470000,41.3317,19.8172
2003,Albania,557000,41.3317,19.8172
2004,Albania,645000,41.3317,19.8172
2005,Albania,748000,41.3317,19.8172
2006,Albania,937000,41.3317,19.8172
2007,Albania,1127000,41.3317,19.8172
2008,Albania,1420000,41.3317,19.8172
(The dataset is a decent reminder of how bad 2020 was for the human race #NeverForget
#CovidIsNotDoneWithUs
.)
One suggested use of AWK (in combo with other *nix utils) is file structure validation. The authors go into more detail than I will here, but itโs stupid easy to make sure all records in a CSV file have the same number of columns:
$ ./cawk --csv 'NR > 1 { print NF }' "${DATAFILE}" | uniq
5
We need to include NR > 1
since AWK knows the CSV format, but wonโt exclude the header (if it exists) by default.
To prove it truly groks CSV, letโs see the first three countries in the file:
$ ./cawk --csv 'NR > 1 { print $2 }' "${DATAFILE}" | uniq | head -3
Albania
Algeria
American Samoa
It does! But, itโs fairly clear the authors donโt actually do a ton of formal, reproducible EDA since they continue to use the $NUMBER
syntax for column references, which every decent data scientist knows is not great. We can do better, and this is a more readable version of the command we just ran:
$ ./cawk --csv -v country="2" \
'NR > 1 { print $country }' ${DATAFILE}
The -v
lets us map variables to values, so we have a (janky) way to get column names back.
Letโs pick a random country from countries that have records for 2020 (youโll get different results):
$ ./cawk --csv \
-v year="1" \
-v country="2" \
'NR > 1 && $year == "2020" { print $country }' "${DATAFILE}" | \
shuf -n 1
Singapore
Letโs use spark
to see if there was a stark drop off in tourism arrivals for Singapore:
$ ./cawk --csv \
-v country="2" \
-v arrivals="3" \
'NR > 1 && $country == "Singapore" { print $arrivals }' "${DATAFILE}" | \
xargs spark
โโโโโโโโโโ
โ
โ
โ
โ
โโโโโโโ
As expected, tourism was super hurt at the start of the pandemic.
AWKโs a fully baked language, so we can do some data ops on it, like see the total tourism influx to Singapore for all the years in the data file:
$ ./cawk --csv \
-v country="2" \
-v arrivals="3" \
'$country == "Singapore" \
{ arrivals_cum_sum += $arrivals } \
END \
{ print arrivals_cum_sum } \
' "${DATAFILE}"
245411500
You can write full-on programs in AWK, so it can do much of what you may be used to in Python, R, Perl, etc. Iโm not sure your team would appreciate that, though, since AWK is not really the stats cruncher in any modern data science stack.
Letโs do one more example. First, weโll save off the list of countries that have records in 2020:
$ ./cawk --csv \
-v year="1" \
-v country="2" \
'NR > 1 && $year == "2020" { print $country }' \
"${DATAFILE}" > /tmp/2020-countries
Now, weโll use that list to work on only countries with 2020 entries.
The NR==FNR
line creates an associative array from /tmp/2020-countries
which we can use to test for inclusion, and we use another associative array to keep the grouped values together (the second line). The last line iterates over that array and prints out country:#### #### #### #### โฆ
. We, then, rely on some core bash idioms to get our CLI dashboard:
$ ./cawk --csv \
-v year="1" \
-v country="2" \
-v arrivals="3" \
'NR==FNR { countries2020[$1]++; next } $country in countries2020 \
{ arrivals_by[$country] = arrivals_by[$country] $arrivals " " } END \
{ for (country in arrivals_by) print country ":" arrivals_by[country] } \
' \
"${DATAFILE}" | \
/tmp/2020-countries head -20 | \
while IFS=: read country arrivals; do
echo "${country}\t$(echo ${arrivals} | xargs spark)"
done | \
column -t -s $'\t'
Cote d'Ivoire โโโโโโโโโ
โโโโโ
Denmark โโโ
โ
โ
โ
โ
โ
โ
โ
โ
โ
โโโโโโ
Belize โโโโโโโโโโโโโโโ
โ
โโโโโ
Namibia โโโโโโโโ
โ
โ
โ
โ
โโโโโโโโโ
St. Lucia โโโโโโโโ
โ
โ
โ
โ
โ
โ
โ
โโ
โโโโ
Liechtenstein โโโโโโโโโโโโโโโโโโ
โโโ
Estonia โโโโโโโ
โโโโโโโโโโ
Mongolia โโโโโโโ
โ
โ
โ
โโโโโโ
โ
โโโโ
Spain โโโโโโ
โ
โ
โ
โ
โ
โ
โ
โโโโโโโโ
Andorra โโโโโโโโโโ
โโโโโโโโโโโ
Togo โโโโโโโโโโโโโโโโโโโ
โโ
Indonesia โโโโโโโโโโโโโโโโโ
โโโโ
Montenegro โโโโโโโโโโโโโโ
โ
โโโโ
New Zealand โโโโโโโโโโโโโโ
โ
โโโโโโ
Japan โโโโโโโโโโโโโโโโโโโโโ
El Salvador โโโโโโโโโ
โโโโโ
โ
โ
โ
โโโโ
Brunei Darussalam โโโโโโโโโโโโ
Belgium โ
โ
โ
โ
โ
โ
โ
โ
โ
โ
โ
โโโโโโโโโโ
Bolivia โโโโโโโโโโโโโโโ
โ
โ
โโโโ
Greece โโโโโโโโโ
โ
โโโโโ
Weโll cover one more spark
example (that also uses AWKโs new CSV powers), but before we do that, I encourage everyone to grab the book, read the EDA chapter, and keep Appendix A (the AWK reference manual) around. Itโs chock full of useful snippets that will make your CLI life easier, and help you out in a pinch.
We can use AWKโs CSV parsing capabilities and spark
to see when it might rain over the coming hours with a little help from our old pal Tomorrow.io. ip-api.com
lets you grab your IP geolocation information sans key and in CSV format, which we can parse with AWK.
# Get what ip-api thinks is our lat/lng (it's very very wrong for me).
# And, we are also pretending ip-api doesn't have the field= query parameter.
# And, yes, we could have just used JSON and jq.
latlng=$(curl -s http://ip-api.com/csv/ | ./cawk --csv '{ print $8 "," $9 }')
# Get the forecast for that location
fcast=$(curl --silent --header 'accept: application/json' \
"https://api.tomorrow.io/v4/weather/forecast?location=${latlng}&apikey=${TOMORROWIO_API_KEY}")
# Get a graph of precipitation % chance over the coming hours
echo ${fcast} | jq ".timelines.hourly[].values.precipitationProbability" | xargs spark
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโ
โ
โโโโโโฆ
Both the new and improved AWK and spark
are fun and useful tools that make doing some CLI work a bit more engaging and speedy.
FIN
I highly doubt AWK will unseat any of the popular CLI data science tools any time soon. And, I have no idea when/if the CSV support will come baked into distros or package updates. But, itโs easy to compile, comes self-contained, and requires fewer resources and dependencies than, say, R or Python. It might not be a bad idea to use it plus some other CLI tools to do some data validation before production scripts run, or you dig into a new dataset.
Iโve put the full second edition table of contents online and am looking forward to replacing my dead tree copy of the first edition with the updated one. โฎ