Chopped
A TidyTuesday project analyzing Chopped chef contestant location data.
This week we are looking at data from the Food Network’s cooking show “Chopped”. I decided to focus more on the data analysis portion of the project and just use a simple bar chart for my viz.
I noticed that some contestant records in the data set contained the hometown or home “base” of the Chef. The overwhelming majority are from New York City, no doubt because of the massive number of restaurants and also because the show is filmed there.
So that fact did not seem that interesting for an analysis. But I was curious which city had the most cooks on the show, per capita. In other words, which city is ‘punching above their weight’? I had a hunch that it might be Portland, Maine, which apparently has the most restaurants per square mile in the country.
Brief sidenote. My wife and I were married in Portland, Maine and our head chef, Matt Ginn, was actually on Chopped! (Room for ’Shrooms episode) If you ever go to Portland, please try Evo. It is truly outstanding.
OK now I’m getting hungry. Let’s dive in.
Setup
# library(tidyverse)
# library(extrafont)
# library(ggthemes)
# library(tidyverse)
# library(knitr)
# library(kableExtra)
Load Data
I am trying something new here where I save the data locally and store in content/blog folder instead of pinging github everytime I rebuild the site with blogdown. Seems friendlier to do this, at the least.
# chopped <- readr::read_tsv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-08-25/chopped.tsv')
# write_csv(chopped, "chopped.csv")
chopped <- read_csv("chopped.csv")
head(chopped)
## # A tibble: 6 x 21
## season season_episode series_episode episode_rating episode_name episode_notes
## <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 1 1 1 9.2 Octopus, Du… This is the …
## 2 1 2 2 8.8 Tofu, Blueb… This is the …
## 3 1 3 3 8.9 Avocado, Ta… <NA>
## 4 1 4 4 8.5 Banana, Col… In the appet…
## 5 1 5 5 8.8 Yucca, Wate… <NA>
## 6 1 6 6 8.5 Canned Peac… <NA>
## # … with 15 more variables: air_date <chr>, judge1 <chr>, judge2 <chr>,
## # judge3 <chr>, appetizer <chr>, entree <chr>, dessert <chr>,
## # contestant1 <chr>, contestant1_info <chr>, contestant2 <chr>,
## # contestant2_info <chr>, contestant3 <chr>, contestant3_info <chr>,
## # contestant4 <chr>, contestant4_info <chr>
Manipulate Data
Here I attempt to stack the contestant info, using the tidy long form instead of wide. The data are stored as four columns, like contestant1:contestant4, and then each contestant has some info about them, in columns like contestant_info1:contestant_info4. The location data I want is in that info column.
I could not figure out how to do this in one pivot_longer command, which in theory would have solved this problem. Please chime in if you see how to do this. Instead, I created two dataframes, one with the contestant name data and one with the contestant info data, and binded the columns together. I kept the episode info in case I needed to filter out certain episodes, like kid/teen centric, celebrity charities, or tournaments. I was unable to set the names of the new columns within the bind_cols function, which seemed odd. I was able to handle using rename easily enough.
contestants <- chopped %>%
select(episode_name, episode_notes, contestant1:contestant4_info)
info <- select(contestants, starts_with("episode"), ends_with("_info")) %>%
pivot_longer(cols = contains("contestant"))
conts <- select(contestants, starts_with("episode"), !ends_with("_info")) %>%
pivot_longer(cols = contains("contestant")) %>%
bind_cols(select(info, value)) %>%
rename("contestant_name" = "value...4", "contestant_info" = "value...5")
Clean the Data
First step here is to filter out the childrens, using regex “^age”, as that is the prefix for every minor contestant. I only want to look at professional chefs. Turns out only the professional chef contestants have their city name included, so this saved me some steps.
TIL: You can use str_squish to get rid of whitespace in the middle of the string. Similar to str_trim. This fixed location data which had an extra space: ex.(“New York NY”).
Next I had to parse the location data from the info. The challenge here is that some entries have “contestant x is from New York, NY”. This is great, I could just use regex to extract everything after the from. BUT other entries just have “contestant x San Fransisco, CA”. So I would need some clever regex to handle both cases. I honestly couldn’t figure out how to do that. So I hatched a plan B.
Using the maps package, I could pull in a list of verified US city names, formatted, and with population data, which I need anyway! Perfect! One issue is that even though this package has 1005 entries for US cities, it is missing a few because the Chopped data uses boroughs like Brooklyn, NY and Queens, NY instead of lumping them into New York City. If anything, I am actually undercounting the number of cooks from the NYC region! So one big caveat here is that my data is very likely missing suburbs or nearby towns to many other cities.
To get the location data, I need to iterate through each city of us.cities$name, and see if there is a match in the contestant_info field. I struggled for a while with map and str_extract, but couldn’t get it to work. A kind soul at reddit eventually set me up with working code. In order to make this work, I needed to discard NA (no matches). I don’t think I ever would have figured that out. The data is stored as a list in each record, so then I just have to unnest, drop NAs, and de-duplicate the contestant names so I’m not double-counting anyone who was on the show more than once.
cities <- maps::us.cities$name
clean <- conts %>%
filter(!str_detect(contestant_info, "^age")) %>%
mutate(contestant_info = str_squish(contestant_info))
extract_cities <- function(x, cities) {
y <- discard(str_extract(x, coll(cities)), is.na)
if(length(y) == 0L) y <- y[!is.na(y)]
return(y)
}
clean$location <- map(clean$contestant_info, extract_cities, cities)
clean <- clean %>%
unnest(location) %>%
drop_na(location) %>%
distinct(contestant_name, location, .keep_all = TRUE)
Here I create a counting table and calculate the ‘cooks per capita’ metric.
uscities <- tibble(maps::us.cities)
clean2 <- clean %>%
count(location, name = "cook_count", sort=TRUE) %>%
left_join(select(uscities, name, pop), by = c("location" = "name")) %>%
mutate(per_capita = cook_count / pop * 1000) %>%
arrange(-per_capita)
Plotting
The fct_reorder function is pretty much essential to use with sideways bar charts like this. Hat tip to David Robinson who uses it often. I try to do something kind of fancy with this bar chart, where I show the number of cooks per city in a bubble using geom_label before the actual bar chart which shows ‘cooks per capita’. I doubt I used the best method to space this out. This is a common issue I have with ggplot. I need to spend more time understanding how the plot is displayed, and how to effectively traverse the plot in order to place grobs (graphical objects) properly.
clean2 %>%
top_n(per_capita, n=20) %>%
mutate(location = fct_reorder(location, per_capita)) %>%
ggplot(aes(location, per_capita)) +
geom_label(aes(y = 0, label=cook_count),
color="darkgrey",
hjust = .7,
nudge_y = -.008,
label.r = unit(.5, "lines")) +
geom_col(aes(fill=location), show.legend = FALSE) +
coord_flip() +
labs(
x = '',
y = '',
title = 'Portland, Maine With Most Cooks Per Capita',
subtitle = "But only 13 cooks, compared to NYC's 516",
caption = 'Joseph Pope | @joepope44'
) +
ggthemes::theme_fivethirtyeight() +
theme(
text = element_text(family = "Garamond"),
axis.title.y = element_text(size = 14),
plot.title.position = "plot",
plot.title = element_text(hjust = .5),
plot.subtitle = element_text(hjust = .5)
) +
scale_fill_viridis_d(option = "E", direction = -1)

All in all, fairly happy with how this came out. Sure enough, Portland, Maine is at the top.
