Data Science :: Analytics :: Visualizations

By Joseph Pope

Endangered Plants

A TidyTuesday project exploring endangered plant life.

Joseph Pope

7-Minute Read

Today we are analyzing extinct and endangered plants for TidyTuesday.

Every week I find myself utilizing the truly excellent site Data-to-Viz by Yan Holtz and Conor Healy for inspiration on what I want to try and learn with R. The site is interactive and helps you figure out the best visualization given numerical or categorical data. It’s well worth a bookmark in your favorite browser.

This week’s data set has three tables, many of which have several columns/features and ways in which to cut up the data. I decided to try something I’ve never done before, and build a Circular Barplot.

The code is long and complicated. I do not recommend this for beginner’s. However, the effect is pretty striking and it’s a nice skill to learn. I was tempted to build a stacked circular bar chart, which is REALLY impressive, AND looks cool, but for my first foray into polar coordinates, I decided to just do a grouped bar chart, which was honestly pretty tough on its own.

Setup

# Contents of Setup.R. 
# Using Source method to suppress warnings and messages from Rmd output

# library(tidyverse)
# library(extrafont)
# library(ggthemes)
# library(tidyverse)
# library(knitr)
# library(kableExtra)

Load Data

# tuesdata <- tidytuesdayR::tt_load('2020-08-18')
# threats <- tuesdata$threats

threats <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-08-18/threats.csv')

Explore Data

I am a fan of the skmir package. This gives you a great overview of the size of your data, your missing values, mean rates, and a basic histogram to show the distribution of numerical data-type columns.

skimr::skim(threats)
Table 1: Data summary
Name threats
Number of rows 6000
Number of columns 8
_______________________
Column type frequency:
character 7
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
binomial_name 0 1.00 11 35 0 500 0
country 0 1.00 4 44 0 72 0
continent 0 1.00 4 13 0 6 0
group 0 1.00 5 16 0 6 0
year_last_seen 180 0.97 9 11 0 7 0
red_list_category 0 1.00 7 19 0 2 0
threat_type 0 1.00 7 28 0 12 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
threatened 0 1 0.15 0.36 0 0 0 0 1 ▇▁▁▁▂

Explore Data MORE

The Table function is an excellent EDA tool to compare one or two features. It’s quick, it’s easy, and it’s effective. In this case, I want to check out threat types vs “threatened” plants. “Threatened” is a binary column (1 or 0) that shows if a plant is threatened with extinction or not. You can see very quickly that Agriculture and Aquaculture is the most common threat towards these plants. Note - there are 12 unique threat types, and 500 unique plants.

table(threats$threat_type, threats$threatened)
##                               
##                                  0   1
##   Agriculture & Aquaculture    273 227
##   Biological Resource Use      347 153
##   Climate Change               466  34
##   Commercial Development       414  86
##   Energy Production & Mining   455  45
##   Geological Events            482  18
##   Human Intrusions             480  20
##   Invasive Species             419  81
##   Natural System Modifications 397 103
##   Pollution                    486  14
##   Transportation Corridor      489  11
##   Unknown                      393 107

Data Processing

After some ideation and online research (again, let me plug Data-to-Viz) I decided to tackle a circular bar graph. Below I filter to look only at threatened plants, and count them by continent and country.

I decided to scale the count, between 0 and 100, but this was mainly done for expediency, as I wanted to better fit my data into the existing code “template” I used/borrowed/lifted/nicked from data-to-viz. This decision means that I lose the true count of threatened plants, but still keep the visual effect. The countries that have the most threatened plants per continent will still be evident. If I was doing this chart for a professional project, I would not scale here. The Saint Helena code is just to truncate a long country name.

data <- threats %>% 
  filter(threatened == 1) %>% 
  group_by(continent, country) %>% 
  count(name = "threatened_plants", sort = TRUE) %>% 
  ungroup() %>% 
  mutate(continent = factor(continent),
         country = ifelse(str_detect(country, "Saint Helena"), "Saint Helena", country),
         threatened_plants = scales::rescale(threatened_plants, to = c(0, 100))) %>% 
  filter(threatened_plants != 0)

Plotting Setup

There are a number of steps necessary to setup the data before plotting in a circular bar chart, with the clever grouped formatting strucutre. I kept the comments from the data-to-viz website, and added a few when worthwhile.

The steps are essentially to: 1. Create empty bar rows in the data to allow grouping 2. Build labels and add to data. In this case, we will label the country per bar and use continent for the grouping. Thankfully all the geometry was already calculated and I did not adjust it. 3. Base Data - Dataframe handles continent labelling and line arc aesthetics. 4. Grid Data - Dataframe handles the 0-20-40-60-80 bars. It basically acts as the y-axis label.

# Set a number of 'empty bar' to add at the end of each group
empty_bar=3
to_add = data.frame( matrix(NA, empty_bar*nlevels(data$continent), ncol(data)) )

colnames(to_add) = colnames(data)
to_add$continent = rep(levels(data$continent), each=empty_bar)

data=rbind(data, to_add)
data=data %>% arrange(continent)
data$id=seq(1, nrow(data))


# Get the name and the y position of each label
label_data=data
number_of_bar=nrow(label_data)
angle= 90 - 360 * (label_data$id-0.5) /number_of_bar     # I substract 0.5 because the letter must have the angle of the center of the bars. Not extreme right(1) or extreme left (0)
label_data$hjust<-ifelse( angle < -90, 1, 0)
label_data$angle<-ifelse(angle < -90, angle+180, angle)


# prepare a data frame for base lines
base_data=data %>% 
  group_by(continent) %>% 
  summarize(start=min(id), end=max(id) - empty_bar) %>% 
  rowwise() %>% 
  mutate(title=mean(c(start, end)))

# prepare a data frame for grid (scales)
grid_data = base_data
grid_data$end = grid_data$end[ c( nrow(grid_data), 1:nrow(grid_data)-1)] + 1
grid_data$start = grid_data$start - 1
grid_data=grid_data[-1,]

Build Plot

I did not alter this code very much, except for variables (obv), the margin() in the theme function to allow the plot title to be displayed, and hjust() to fiddle with the alignment of the continent labels in the inner circle.

I hope this was helpful and good luck on your next project.

p = ggplot(data, aes(x=as.factor(id), y=threatened_plants, fill=continent)) +       # Note that id is a factor. If x is numeric, there is some space between the first bar
  
  geom_bar(aes(x=as.factor(id), y=threatened_plants, fill=continent), stat="identity", alpha=0.5) +
  
  # Add a val=100/75/50/25 lines. I do it at the beginning to make sur barplots are OVER it.
  geom_segment(data=grid_data, aes(x = end, y = 80, xend = start, yend = 80), colour = "white", alpha=1, size=0.3 , inherit.aes = FALSE ) +
  geom_segment(data=grid_data, aes(x = end, y = 60, xend = start, yend = 60), colour = "white", alpha=1, size=0.3 , inherit.aes = FALSE ) +
  geom_segment(data=grid_data, aes(x = end, y = 40, xend = start, yend = 40), colour = "white", alpha=1, size=0.3 , inherit.aes = FALSE ) +
  geom_segment(data=grid_data, aes(x = end, y = 20, xend = start, yend = 20), colour = "white", alpha=1, size=0.3 , inherit.aes = FALSE ) +
  
  # Add text showing the value of each 100/75/50/25 lines
  annotate("text", x = rep(max(data$id),4), y = c(20, 40, 60, 80), label = c("20", "40", "60", "80") , color="white", size=3 , angle=0, fontface="bold", hjust=1) +
  
  geom_bar(aes(x=as.factor(id), y=threatened_plants, fill=continent), stat="identity", alpha=0.5) +
  ylim(-100,120) +
  labs(
    title = "Number of Threatened Plants by Geo",
    subtitle = "Scaled from 0-100",
    caption = "Joseph Pope | @joepope44"
  ) + 
  theme_minimal() +
  theme(
    legend.position = "none",
    axis.text = element_blank(),
    axis.title = element_blank(),
    panel.grid = element_blank(),
    plot.background = element_rect(fill = "#a3b18a"),
    plot.title = element_text(hjust = 0.5, size = 24, family  = "Garamond"),
    plot.subtitle = element_text(hjust = 0.5, family  = "Garamond"),
    plot.caption = element_text(hjust = 0.5, family = "Garamond")
    # plot.margin = unit(rep(-1,4), "cm") <- this hides the plot title if not commented out
  ) +
  coord_polar() + # This makes the circle! 
  geom_text(data=label_data, aes(x=id, y=threatened_plants+10, label=country, hjust=hjust), color="black", fontface="bold",alpha=0.7, size=3.5, angle= label_data$angle, inherit.aes = FALSE ) +
  
  # Add base line information
  geom_segment(data=base_data, aes(x = start, y = -5, xend = end, yend = -5), colour = "black", alpha=0.8, size=0.6 , inherit.aes = FALSE )  +
  geom_text(data=base_data, aes(x = title, y = -18, label=continent),
            hjust=c(1,1,.5,0,0,0), colour = "black", alpha=0.8, size=4, 
            fontface="bold", inherit.aes = FALSE) 
 
p

Say Something

Comments

Nothing yet.

Recent Posts

categories

About

BI and Data Science Content