Data Science :: Analytics :: Visualizations

By Joseph Pope

Coffee Ratings

TidyTuesday project modelling coffee ratings

Joseph Pope

19-Minute Read

Giant Cup O' Joe

Intro

Today I am going to explore Coffee Ratings for this week’s Tidy Tuesday challenge. The data comes from the Coffee Quality Database. The great thing about Tidy Tuesday is that we can do whatever we want with the data. There are no rules or directions!

This data is very interesting with many features and looks ripe for a predictive model. I can use the various ratings provided to predict either a quantitative method, like what the ‘total_cup_score’ (Overall rating) is, or a qualitative measure, like what species of bean (Arabica or Robusta) or perhaps the country of origin.

Predicting scores strikes me as a classic regression problem, best tackled by linear models or random forests, and classification could be solved with logistic regression, Naive Bayes, SVM or even neural networks. I’ve NEVER used tidymodels before, so let’s give it a go, make some mistakes, and see what we can learn.

But first, some brief background of our data from the TidyTuesday repo:

There is data for both Arabica and Robusta beans, across many countries and professionally rated on a 0-100 scale. All sorts of scoring/ratings for things like acidity, sweetness, fragrance, balance, etc - may be useful for either separating into visualizations/categories or for modeling/recommenders.

Wikipedia on Coffee Beans:

The two most economically important varieties of coffee plant are the Arabica and the Robusta; ~60% of the coffee produced worldwide is Arabica and ~40% is Robusta. Arabica beans consist of 0.8–1.4% caffeine and Robusta beans consist of 1.7–4% caffeine.

Setup

library(tidytuesdayR)
library(tidyverse)
## ── Attaching packages ──────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2purrr   0.3.4
## ✓ tibble  3.0.1dplyr   1.0.0
## ✓ tidyr   1.1.0stringr 1.4.0
## ✓ readr   1.3.1forcats 0.5.0
## ── Conflicts ─────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(tidymodels)
## ── Attaching packages ─────────────────── tidymodels 0.1.0 ──
## ✓ broom     0.5.6rsample   0.0.6 
## ✓ dials     0.0.6tune      0.1.0 
## ✓ infer     0.5.1workflows 0.1.1 
## ✓ parsnip   0.1.1yardstick 0.0.6 
## ✓ recipes   0.1.12
## ── Conflicts ────────────────────── tidymodels_conflicts() ──
## x scales::discard() masks purrr::discard()
## x dplyr::filter()   masks stats::filter()
## x recipes::fixed()  masks stringr::fixed()
## x dplyr::lag()      masks stats::lag()
## x dials::margin()   masks ggplot2::margin()
## x yardstick::spec() masks readr::spec()
## x recipes::step()   masks stats::step()
library(knitr)
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
knitr::opts_chunk$set(
    fig.height = 5,
    fig.width = 8,
    message = FALSE,
    warning = FALSE,
    dpi = 180,
    include = TRUE
)

Load the data

tuesdata <- tidytuesdayR::tt_load(2020, week = 28)
coffee_ratings <- tuesdata$coffee_ratings

EDA

Now that we have our data and let’s dive in. Let’s review the summary statistics with skimr.

coffee_ratings %>% 
  skimr::skim()
Table 1: Data summary
Name Piped data
Number of rows 1339
Number of columns 43
_______________________
Column type frequency:
character 24
numeric 19
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
species 0 1.00 7 7 0 2 0
owner 7 0.99 3 50 0 315 0
country_of_origin 1 1.00 4 28 0 36 0
farm_name 359 0.73 1 73 0 571 0
lot_number 1063 0.21 1 71 0 227 0
mill 315 0.76 1 77 0 460 0
ico_number 151 0.89 1 40 0 847 0
company 209 0.84 3 73 0 281 0
altitude 226 0.83 1 41 0 396 0
region 59 0.96 2 76 0 356 0
producer 231 0.83 1 100 0 691 0
bag_weight 0 1.00 1 8 0 56 0
in_country_partner 0 1.00 7 85 0 27 0
harvest_year 47 0.96 3 24 0 46 0
grading_date 0 1.00 13 20 0 567 0
owner_1 7 0.99 3 50 0 319 0
variety 226 0.83 4 21 0 29 0
processing_method 170 0.87 5 25 0 5 0
color 218 0.84 4 12 0 4 0
expiration 0 1.00 13 20 0 566 0
certification_body 0 1.00 7 85 0 26 0
certification_address 0 1.00 40 40 0 32 0
certification_contact 0 1.00 40 40 0 29 0
unit_of_measurement 0 1.00 1 2 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
total_cup_points 0 1.00 82.09 3.50 0 81.08 82.50 83.67 90.58 ▁▁▁▁▇
number_of_bags 0 1.00 154.18 129.99 0 14.00 175.00 275.00 1062.00 ▇▇▁▁▁
aroma 0 1.00 7.57 0.38 0 7.42 7.58 7.75 8.75 ▁▁▁▁▇
flavor 0 1.00 7.52 0.40 0 7.33 7.58 7.75 8.83 ▁▁▁▁▇
aftertaste 0 1.00 7.40 0.40 0 7.25 7.42 7.58 8.67 ▁▁▁▁▇
acidity 0 1.00 7.54 0.38 0 7.33 7.58 7.75 8.75 ▁▁▁▁▇
body 0 1.00 7.52 0.37 0 7.33 7.50 7.67 8.58 ▁▁▁▁▇
balance 0 1.00 7.52 0.41 0 7.33 7.50 7.75 8.75 ▁▁▁▁▇
uniformity 0 1.00 9.83 0.55 0 10.00 10.00 10.00 10.00 ▁▁▁▁▇
clean_cup 0 1.00 9.84 0.76 0 10.00 10.00 10.00 10.00 ▁▁▁▁▇
sweetness 0 1.00 9.86 0.62 0 10.00 10.00 10.00 10.00 ▁▁▁▁▇
cupper_points 0 1.00 7.50 0.47 0 7.25 7.50 7.75 10.00 ▁▁▁▇▁
moisture 0 1.00 0.09 0.05 0 0.09 0.11 0.12 0.28 ▃▇▅▁▁
category_one_defects 0 1.00 0.48 2.55 0 0.00 0.00 0.00 63.00 ▇▁▁▁▁
quakers 1 1.00 0.17 0.83 0 0.00 0.00 0.00 11.00 ▇▁▁▁▁
category_two_defects 0 1.00 3.56 5.31 0 0.00 2.00 4.00 55.00 ▇▁▁▁▁
altitude_low_meters 230 0.83 1750.71 8669.44 1 1100.00 1310.64 1600.00 190164.00 ▇▁▁▁▁
altitude_high_meters 230 0.83 1799.35 8668.81 1 1100.00 1350.00 1650.00 190164.00 ▇▁▁▁▁
altitude_mean_meters 230 0.83 1775.03 8668.63 1 1100.00 1310.64 1600.00 190164.00 ▇▁▁▁▁

Check Initial Assumptions

I originally wanted to predict “total_cup_points”. However, after some review it appears that this field is simply the sum of ten other ratings, from aroma to cupper_points. What I really want to predict is Cupper Points. This is the reviewer’s (aka the Cupper’s) personal overall rating for the cup of joe. I can use this as dependant variable in a regression analysis. This should help me identify which features (aroma, acidity, elevation, country of origin?) influence the final rating the most.

Potential Features

Before we dig further into the data, let’s speculate what might factors go into a superior coffee bean. If coffee is like wine, then perhaps region (lat, long, altitude, preciptation, average hours of sunlight, etc) makes a big difference. Perhaps certain companies utilize best practices and farming techniques. Maybe certain processing methods are better than others. We are given the testing date and the expiration date, so we can add a feature that measures how fresh the beans are. These features and more could be really important or totally useless. I have no idea! Let’s see what the data has to say about it.

The wiki in the readme suggests that coffee beans are split 60/40 worldwide, by Arabica and Robusta. However, for this analysis the split is much more imbalanced. I thought it would be very interesting to build an Arabica vs. Robusta classifier but with such an imbalanced set I don’t think upsampling or other methods will be enough to build a robust model.

table(coffee_ratings$species)
## 
## Arabica Robusta 
##    1311      28

Let’s take a peak at the distribution of country of origin field. It is mentioned in the articles that Ethiopia wins top prize, however most of the coffees are from Central and South America.

coffee_ratings %>% 
  group_by(country_of_origin) %>% 
  tally() %>% 
  top_n(., 10) %>% 
  arrange(desc(n))
## # A tibble: 10 x 2
##    country_of_origin                n
##    <chr>                        <int>
##  1 Mexico                         236
##  2 Colombia                       183
##  3 Guatemala                      181
##  4 Brazil                         132
##  5 Taiwan                          75
##  6 United States (Hawaii)          73
##  7 Honduras                        53
##  8 Costa Rica                      51
##  9 Ethiopia                        44
## 10 Tanzania, United Republic Of    40

Let’s pull this list of top 10 countries into a list to reference later.

top_countries <- coffee_ratings %>% 
  group_by(country_of_origin) %>% 
  tally() %>% 
  top_n(., 10) %>% 
  arrange(desc(n)) %>% 
  pull(country_of_origin)

Out of curiousity, let’s see the count of coffee bean color. This was news to me, but apparently coffee beans are mostly blue or green in color when fresh. It is only after roasting that the color darkens to that familiar, uh, “coffee-brown” color. If you learn nothing else today, you can at least learn this.

The Coffee Bean Spectrum

The Coffee Bean Spectrum

table(coffee_ratings$color)
## 
##   Blue-Green Bluish-Green        Green         None 
##           85          114          870           52

Robusta is seen below with a higher modal peak, suggesting the species, known for its higher caffeine levels, is rated higher than Arabica. This could be due to a small sample size though.

coffee_ratings %>% 
  mutate(id = row_number()) %>% 
  select(id, species, cupper_points) %>% 
  ggplot(aes(cupper_points, fill = species)) +
  geom_density(alpha = 0.1)

  # geom_histogram(position = "identity", alpha = 0.7, show.legend = FALSE)

TIDYMODELS

I have never used tidymodels, so let’s see how this goes. Let’s try to tackle the regression model first. From the skim package, we know we have no missing values or junky data for this. Nice!

First thing to do is narrow our data and make it ripe for modeling. TidyTuesday mentions that Yorgos Askalidis already determined that altitude, processing_method and color have no bearing to the cupping score, and I think that makes sense. I will omit those features for now. That leaves us with species, country of origin and several ratings and measurements of the coffee samples. I am going to take the top ten countries only (out of 36 in the data) and lump the rest into “Other” with forcats. This is simply to speed up the processing without losing too much information gain for the feature.

Note - I went back after trying the models a few times and removed ‘species’. In this smaller, filtered dataset there are only 3 records with Robusta coffee beans. It caused me some issues, and with only 3 members of the minority class, I decided to just remove it completely.

coffee <- coffee_ratings %>%
  mutate(id = row_number()) %>% # add ID field, convert NA to Other
  select(id, country_of_origin, variety, aroma:moisture) %>% # select features
  mutate(country_of_origin = case_when(
    country_of_origin %in% top_countries ~ country_of_origin,
    TRUE ~ "Other"
  )) %>%
  mutate_if(is.character, as.factor) %>% # convert country and variety to factors
  filter(cupper_points != 0) %>% # remove 0 point scores as junk data
  drop_na() # drop NA for country and variety fields. only affects 200 records

Split the data

Note - I initially tried to stratify by species, even though there aren’t many Robusta coffee samples in the data. I ended up removing this due to the severe class imbalance. It may have been worth the effort to upsample Robusta and then stratify by species.

set.seed(1234)

coffee_split <- coffee %>% 
  initial_split()

coffee_train <- training(coffee_split)
coffee_test <- testing(coffee_split)

Naive linear model with no scaling or pre-processing

lm_spec <- linear_reg() %>%
  set_engine(engine = "lm")

lm_spec
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm
## Linear Regression Model Specification (regression)
##
## Computational engine: lm
lm_fit <- lm_spec %>%
  fit(cupper_points ~ .,
    data = coffee_train
  )

lm_fit
## parsnip model object
## 
## Fit time:  10ms 
## 
## Call:
## stats::lm(formula = formula, data = data)
## 
## Coefficients:
##                                   (Intercept)  
##                                      2.501509  
##                                            id  
##                                     -0.000309  
##                     country_of_originColombia  
##                                     -0.099026  
##                   country_of_originCosta Rica  
##                                     -0.020488  
##                     country_of_originEthiopia  
##                                      0.029321  
##                    country_of_originGuatemala  
##                                     -0.045107  
##                     country_of_originHonduras  
##                                     -0.013921  
##                       country_of_originMexico  
##                                      0.030019  
##                        country_of_originOther  
##                                      0.003251  
##                       country_of_originTaiwan  
##                                     -0.134839  
## country_of_originTanzania, United Republic Of  
##                                      0.052325  
##       country_of_originUnited States (Hawaii)  
##                                     -0.316031  
##                          varietyBlue Mountain  
##                                     -0.234205  
##                                varietyBourbon  
##                                     -0.004215  
##                                varietyCatimor  
##                                      0.018467  
##                                 varietyCatuai  
##                                      0.015110  
##                                varietyCaturra  
##                                      0.036397  
##                    varietyEthiopian Heirlooms  
##                                     -0.094942  
##                  varietyEthiopian Yirgacheffe  
##                                      0.011110  
##                                  varietyGesha  
##                                      0.046588  
##                          varietyHawaiian Kona  
##                                      0.296594  
##                                   varietyJava  
##                                      0.007740  
##                             varietyMandheling  
##                                      0.001044  
##                             varietyMarigojipe  
##                                     -0.006565  
##                             varietyMundo Novo  
##                                     -0.047911  
##                                  varietyOther  
##                                     -0.019686  
##                               varietyPacamara  
##                                     -0.067598  
##                                  varietyPacas  
##                                     -0.001151  
##                            varietyPache Comun  
##                                     -0.134161  
##                               varietyPeaberry  
##                                     -0.022037  
##                               varietyRuiru 11  
##                                     -0.016437  
##                                   varietySL14  
##                                      0.017365  
##                                   varietySL28  
##                                     -0.256004  
##                                   varietySL34  
##                                     -0.055397  
##                               varietySulawesi  
##                                      0.040376  
##                                varietySumatra  
##                                      0.096590  
##                        varietySumatra Lintong  
##                                      0.080224  
##                                 varietyTypica  
##                                     -0.035801  
##                         varietyYellow Bourbon  
##                                      0.086077  
##                                         aroma  
##                                     -0.003595  
##                                        flavor  
##                                      0.241316  
##                                    aftertaste  
##                                      0.278652  
##                                       acidity  
##                                      0.040398  
##                                          body  
##                                      0.022329  
##                                       balance  
##                                      0.194880  
##                                    uniformity  
##                                     -0.022991  
##                                     clean_cup  
##                                      0.013229  
##                                     sweetness  
##                                     -0.043329  
##                                      moisture  
##                                     -0.505604
library(ranger)

rf_spec <- rand_forest(mode = "regression") %>%
  set_engine("ranger")

rf_spec
## Random Forest Model Specification (regression)
## 
## Computational engine: ranger
## Random Forest Model Specification (regression)
##
## Computational engine: ranger
rf_fit <- rf_spec %>%
  fit(cupper_points ~ .,
    data = coffee_train
  )

rf_fit
## parsnip model object
## 
## Fit time:  468ms 
## Ranger result
## 
## Call:
##  ranger::ranger(formula = formula, data = data, num.threads = 1,      verbose = FALSE, seed = sample.int(10^5, 1)) 
## 
## Type:                             Regression 
## Number of trees:                  500 
## Sample size:                      834 
## Number of independent variables:  13 
## Mtry:                             3 
## Target node size:                 5 
## Variable importance mode:         none 
## Splitrule:                        variance 
## OOB prediction error (MSE):       0.04135314 
## R squared (OOB):                  0.7374583

Both models produce a hefty R squared, > .7! This makes me nervous as it’s likely too good to be true (overfit). I know that I skipped some preprocessing steps just to see how these models worked, so let’s go back and see if some recipes can bring my models back down to earth.

Preprocessing With Recipes

  1. Let’s create a bland, vanilla recipe to start off
rec_obj <- recipe(cupper_points ~ ., data = coffee_train) %>% 
  update_role(id, new_role = "ID") %>% 
  step_normalize(all_predictors(), -all_nominal()) %>% 
  step_dummy(all_nominal())

summary(rec_obj)
## # A tibble: 14 x 4
##    variable          type    role      source  
##    <chr>             <chr>   <chr>     <chr>   
##  1 id                numeric ID        original
##  2 country_of_origin nominal predictor original
##  3 variety           nominal predictor original
##  4 aroma             numeric predictor original
##  5 flavor            numeric predictor original
##  6 aftertaste        numeric predictor original
##  7 acidity           numeric predictor original
##  8 body              numeric predictor original
##  9 balance           numeric predictor original
## 10 uniformity        numeric predictor original
## 11 clean_cup         numeric predictor original
## 12 sweetness         numeric predictor original
## 13 moisture          numeric predictor original
## 14 cupper_points     numeric outcome   original

I tell the recipe not to use the “id” column in the analysis and we can see that it is classified as a “ID” and not “predictor”. I run summary on the recipe object and can see the type of role the variables have, as well as the type. It is important to note that ‘nominal’ is yet another term to describe a factor/string/non-numeric.

Then I use the step_* functions to normalize the data (it centers and scales numerical features) and to create dummy variables for all of my nominal (non-numeric) features.

So we built our recipe. Next step is to prep and juice. I’m getting thirsty.

coff_train <- juice(prep(rec_obj))

dim(coff_train)
## [1] 834  50
names(coff_train)
##  [1] "id"                                            
##  [2] "aroma"                                         
##  [3] "flavor"                                        
##  [4] "aftertaste"                                    
##  [5] "acidity"                                       
##  [6] "body"                                          
##  [7] "balance"                                       
##  [8] "uniformity"                                    
##  [9] "clean_cup"                                     
## [10] "sweetness"                                     
## [11] "moisture"                                      
## [12] "cupper_points"                                 
## [13] "country_of_origin_Colombia"                    
## [14] "country_of_origin_Costa.Rica"                  
## [15] "country_of_origin_Ethiopia"                    
## [16] "country_of_origin_Guatemala"                   
## [17] "country_of_origin_Honduras"                    
## [18] "country_of_origin_Mexico"                      
## [19] "country_of_origin_Other"                       
## [20] "country_of_origin_Taiwan"                      
## [21] "country_of_origin_Tanzania..United.Republic.Of"
## [22] "country_of_origin_United.States..Hawaii."      
## [23] "variety_Blue.Mountain"                         
## [24] "variety_Bourbon"                               
## [25] "variety_Catimor"                               
## [26] "variety_Catuai"                                
## [27] "variety_Caturra"                               
## [28] "variety_Ethiopian.Heirlooms"                   
## [29] "variety_Ethiopian.Yirgacheffe"                 
## [30] "variety_Gesha"                                 
## [31] "variety_Hawaiian.Kona"                         
## [32] "variety_Java"                                  
## [33] "variety_Mandheling"                            
## [34] "variety_Marigojipe"                            
## [35] "variety_Moka.Peaberry"                         
## [36] "variety_Mundo.Novo"                            
## [37] "variety_Other"                                 
## [38] "variety_Pacamara"                              
## [39] "variety_Pacas"                                 
## [40] "variety_Pache.Comun"                           
## [41] "variety_Peaberry"                              
## [42] "variety_Ruiru.11"                              
## [43] "variety_SL14"                                  
## [44] "variety_SL28"                                  
## [45] "variety_SL34"                                  
## [46] "variety_Sulawesi"                              
## [47] "variety_Sumatra"                               
## [48] "variety_Sumatra.Lintong"                       
## [49] "variety_Typica"                                
## [50] "variety_Yellow.Bourbon"
coff_train
## # A tibble: 834 x 50
##       id aroma flavor aftertaste acidity  body balance uniformity clean_cup
##    <int> <dbl>  <dbl>      <dbl>   <dbl> <dbl>   <dbl>      <dbl>     <dbl>
##  1     2  3.84   3.55       3.27    3.35 3.35     2.68      0.312     0.225
##  2     3  2.77   3.03       3.04    2.84 3.02     2.68      0.312     0.225
##  3     5  2.22   3.03       2.54    3.10 3.35     2.41      0.312     0.225
##  4    10  1.66   3.27       3.27    3.10 0.575    2.68      0.312     0.225
##  5    12  2.22   2.78       2.30    2.56 2.09     1.94      0.312     0.225
##  6    13  1.66   3.55       2.77    2.84 1.80     1.68      0.312     0.225
##  7    16  1.40   3.03       3.51    2.05 2.42     1.44      0.312     0.225
##  8    19  2.77   2.26       2.03    2.05 1.50     1.44      0.312     0.225
##  9    21  1.40   2.26       2.03    3.10 2.72     1.44      0.312     0.225
## 10    23  1.96   2.26       2.30    1.51 1.17     1.94      0.312     0.225
## # … with 824 more rows, and 41 more variables: sweetness <dbl>, moisture <dbl>,
## #   cupper_points <dbl>, country_of_origin_Colombia <dbl>,
## #   country_of_origin_Costa.Rica <dbl>, country_of_origin_Ethiopia <dbl>,
## #   country_of_origin_Guatemala <dbl>, country_of_origin_Honduras <dbl>,
## #   country_of_origin_Mexico <dbl>, country_of_origin_Other <dbl>,
## #   country_of_origin_Taiwan <dbl>,
## #   country_of_origin_Tanzania..United.Republic.Of <dbl>,
## #   country_of_origin_United.States..Hawaii. <dbl>,
## #   variety_Blue.Mountain <dbl>, variety_Bourbon <dbl>, variety_Catimor <dbl>,
## #   variety_Catuai <dbl>, variety_Caturra <dbl>,
## #   variety_Ethiopian.Heirlooms <dbl>, variety_Ethiopian.Yirgacheffe <dbl>,
## #   variety_Gesha <dbl>, variety_Hawaiian.Kona <dbl>, variety_Java <dbl>,
## #   variety_Mandheling <dbl>, variety_Marigojipe <dbl>,
## #   variety_Moka.Peaberry <dbl>, variety_Mundo.Novo <dbl>, variety_Other <dbl>,
## #   variety_Pacamara <dbl>, variety_Pacas <dbl>, variety_Pache.Comun <dbl>,
## #   variety_Peaberry <dbl>, variety_Ruiru.11 <dbl>, variety_SL14 <dbl>,
## #   variety_SL28 <dbl>, variety_SL34 <dbl>, variety_Sulawesi <dbl>,
## #   variety_Sumatra <dbl>, variety_Sumatra.Lintong <dbl>, variety_Typica <dbl>,
## #   variety_Yellow.Bourbon <dbl>

Baking applies the pre-processing steps that were ‘juiced’ above to the test data. This way, the train and test data will have the same pre-processing steps (aka recipe) applied to them. Clever!

coff_test <- rec_obj %>%
  prep() %>% 
  bake(coffee_test) 

So now we see all our predictors including all the dummy variables that were generated. Neat.

Better, Faster, Stronger

Now we can go back to the linear and random forest regression models used earlier, but now substitute the data with better processing. I think I see why they went with the “recipe” motif! Like baking, I can swap in ingredients without having to start over from scratch.

lm_fit1 <- fit(lm_spec, cupper_points ~ ., coff_train)
glance(lm_fit1$fit)
## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>     <dbl> <int>  <dbl> <dbl>  <dbl>
## 1     0.751         0.736 0.204      49.4 2.55e-203    49   168. -237. -0.335
## # … with 2 more variables: deviance <dbl>, df.residual <int>
tidy(lm_fit1) %>% 
  arrange(desc(abs(statistic)))
## # A tibble: 49 x 5
##    term                        estimate std.error statistic  p.value
##    <chr>                          <dbl>     <dbl>     <dbl>    <dbl>
##  1 (Intercept)                 7.71     0.109         70.7  0.      
##  2 id                         -0.000309 0.0000479     -6.45 1.99e-10
##  3 aftertaste                  0.0944   0.0163         5.79 1.01e- 8
##  4 balance                     0.0665   0.0132         5.02 6.39e- 7
##  5 flavor                      0.0789   0.0173         4.56 6.07e- 6
##  6 country_of_origin_Taiwan   -0.135    0.0462        -2.92 3.63e- 3
##  7 moisture                   -0.0230   0.00823       -2.79 5.36e- 3
##  8 country_of_origin_Colombia -0.0990   0.0434        -2.28 2.28e- 2
##  9 sweetness                  -0.0177   0.00819       -2.16 3.09e- 2
## 10 variety_SL28               -0.256    0.119         -2.16 3.14e- 2
## # … with 39 more rows
lm_predicted <- augment(lm_fit1$fit, data = coff_train) 
select(lm_predicted, id, cupper_points, .fitted:.std.resid)
## # A tibble: 834 x 9
##       id cupper_points .fitted .se.fit   .resid   .hat .sigma .cooksd .std.resid
##    <int>         <dbl>   <dbl>   <dbl>    <dbl>  <dbl>  <dbl>   <dbl>      <dbl>
##  1     2          8.58    8.52  0.0749  0.0562  0.135   0.204 2.80e-4     0.297 
##  2     3          9.25    8.45  0.0431  0.795   0.0447  0.202 1.52e-2     3.99  
##  3     5          8.58    8.39  0.0707  0.187   0.120   0.204 2.67e-3     0.977 
##  4    10          8.5     8.49  0.0721  0.00789 0.125   0.204 5.00e-6     0.0414
##  5    12          8.5     8.34  0.0379  0.162   0.0345  0.204 4.75e-4     0.807 
##  6    13          8.33    8.43  0.0435 -0.0979  0.0456  0.204 2.36e-4    -0.492 
##  7    16          8.17    8.43  0.0443 -0.264   0.0472  0.204 1.78e-3    -1.33  
##  8    19          8.42    8.21  0.0570  0.206   0.0783  0.204 1.92e-3     1.05  
##  9    21          8.17    8.25  0.0407 -0.0779  0.0398  0.204 1.29e-4    -0.390 
## 10    23          8.58    8.24  0.0402  0.344   0.0388  0.204 2.44e-3     1.72  
## # … with 824 more rows
ggplot(lm_predicted, aes(.fitted, cupper_points)) +
  geom_point(alpha = .2) +
  ggrepel::geom_label_repel(aes(label = id),
                            data = filter(lm_predicted, abs(.resid) > 2)) +
  labs(title = "Linear Model: Actual vs. Predicted Cupper Points") +
  geom_smooth()

Whoa, what’s up with coffee sample 963? We predicted a score of 7.4 or so, and the real score was a 5.25. This appears just to be a major outlier, so we can likely ignore.

filter(coffee, id == 963)
## # A tibble: 1 x 14
##      id country_of_orig… variety aroma flavor aftertaste acidity  body balance
##   <int> <fct>            <fct>   <dbl>  <dbl>      <dbl>   <dbl> <dbl>   <dbl>
## 1   963 Taiwan           Typica   7.83   7.75       7.67    7.83  7.83    7.75
## # … with 5 more variables: uniformity <dbl>, clean_cup <dbl>, sweetness <dbl>,
## #   cupper_points <dbl>, moisture <dbl>

Now let’s repeat for the random forest model. Note - glance does not work for ranger objects like rf_fit1$fit.

rf_fit1 <- fit(rf_spec, cupper_points ~ ., coff_train)
rf_fit1$fit
## Ranger result
## 
## Call:
##  ranger::ranger(formula = formula, data = data, num.threads = 1,      verbose = FALSE, seed = sample.int(10^5, 1)) 
## 
## Type:                             Regression 
## Number of trees:                  500 
## Sample size:                      834 
## Number of independent variables:  49 
## Mtry:                             7 
## Target node size:                 5 
## Variable importance mode:         none 
## Splitrule:                        variance 
## OOB prediction error (MSE):       0.04222097 
## R squared (OOB):                  0.7319486
rf_predicted <- bind_cols(.fitted = rf_fit1$fit$predictions, data = coff_train)
select(rf_predicted, id, cupper_points, .fitted)
## # A tibble: 834 x 3
##       id cupper_points .fitted
##    <int>         <dbl>   <dbl>
##  1     2          8.58    8.60
##  2     3          9.25    8.31
##  3     5          8.58    8.54
##  4    10          8.5     8.41
##  5    12          8.5     8.30
##  6    13          8.33    8.36
##  7    16          8.17    8.33
##  8    19          8.42    8.24
##  9    21          8.17    8.29
## 10    23          8.58    8.14
## # … with 824 more rows
ggplot(rf_predicted, aes(.fitted, cupper_points)) +
  geom_point(alpha = .2) +
  # ggrepel::geom_label_repel(aes(label = id),
  #                           data = filter(rf_predicted, abs(.resid) > 2)) +
  labs(title = "Random Forest: Actual vs. Predicted Cupper Points") +
  geom_smooth()

results_train <- lm_fit1 %>%
  predict(new_data = coff_train) %>%
  mutate(
    truth = coff_train$cupper_points,
    model = "lm"
  ) %>% 
bind_rows(rf_fit1 %>%
    predict(new_data = coff_train) %>%
    mutate(
      truth = coff_train$cupper_points,
      model = "rf"
    ))
results_train %>%
  group_by(model) %>%
  rmse(truth = truth, estimate = .pred)
## # A tibble: 2 x 4
##   model .metric .estimator .estimate
##   <chr> <chr>   <chr>          <dbl>
## 1 lm    rmse    standard       0.198
## 2 rf    rmse    standard       0.114

Apply model to the test set

lm_fit1 %>%
  predict(coff_test) %>%
  bind_cols(coff_test) %>%
  metrics(truth = cupper_points,
          estimate = .pred) %>%
  bind_cols(model = "lm") %>% 
  bind_rows(
    rf_fit1 %>%
      predict(coff_test) %>%
      bind_cols(coff_test) %>%
      metrics(truth = cupper_points,
        estimate = .pred
      ) %>% 
  bind_cols(model = "rf")
  )
## # A tibble: 6 x 4
##   .metric .estimator .estimate model
##   <chr>   <chr>          <dbl> <chr>
## 1 rmse    standard       0.244 lm   
## 2 rsq     standard       0.629 lm   
## 3 mae     standard       0.144 lm   
## 4 rmse    standard       0.231 rf   
## 5 rsq     standard       0.664 rf   
## 6 mae     standard       0.130 rf

We see that the random forest and linear model are both very close together in performance. However, the random forest does have a small RMSE and higher R squared, which means it does a slightly better job explaining the variance and has less overall error in predicting the ratings.

Next Steps

I did all of this analysis without knowing anything about tidymodels. Additionally, I am fairly new to using R for predicting values as well. It would be interesting to see how these models perform with cross-validation, optimized parameters and better feature selection.

The concept of a “recipe” which you can modify and reuse is incredibly interesting in this context. I have a lot more to learn here, and would direct readers to the great YouTube videos made by Julia Silge.

Say Something

Comments

Nothing yet.

Recent Posts

categories

About

BI and Data Science Content