Day 3



1 ggplot: A grammar of graphics

How to create a beautiful masterpiece in 4 easy steps.

1. Load some data


We stumbled upon a torn page from a field notebook with a few penguin observations. The only problem is the species and island information is torn off. Let’s use graphicial analysis to fill in the gaps for these lost penguins.

library(tidyverse)

url <- "https://tidy-mn.github.io/R-camp-penguins/data/torn_penguins.csv"

lost_penguins <- read_csv(url)

2. Start a blank canvas

ggplot(lost_penguins)

3. Add aesthetics with aes()

Assign the X and Y axes inside the aesthetics.

ggplot(lost_penguins, 
       aes(x = bill_length_mm, 
           y = flipper_length_mm))

4. Add geom’s

Let’s start with geom_point(). This will create a scatterplot comparing the bill length to flipper length of the lost penguins.

NOTE

Use a plus sign (+) to add more layers to your ggplot masterpiece. Similar to how the %>% chains functions together, the + in ggplot stacks on more layers to your plot. If we were to say the + out loud, we could read it as “take my plot and ADD X to it”.

ggplot(lost_penguins, 
       aes(x = bill_length_mm, 
           y = flipper_length_mm)) +
  geom_point()


Increase the size of the points with the argument size =. Let’s bump the size up to 10.

ggplot(lost_penguins, 
       aes(x = bill_length_mm, 
           y = flipper_length_mm)) +
  geom_point(size = 10)


Now that we’ve taken a quick look at the lost penguins, let’s bring in the rest of our penguin observations to see if we can determine where they best fit it the overall population.

Load the untorn penguins observations

url <- "https://tidy-mn.github.io/R-camp-penguins/data/untorn_penguins.csv"

good_penguins <- read_csv(url)


bind_rows()

It helps to have all our data in a single table when creating plots. Let’s combine the good penguin data and the lost penguin data into one dataframe with the function bind_rows(). We pass this function the name of two dataframes, and then it returns a combined table with the first table’s rows appearing on top. Let’s call our combined table "all_penguins".

Combine / stack the two penguin tables with bind_rows()

all_penguins <- bind_rows(good_penguins, lost_penguins)


Update our scatterplot to use the all_penguins data.

ggplot(all_penguins, 
       aes(x = bill_length_mm, 
           y = flipper_length_mm)) +
  geom_point(size = 10)

Oof! That is a lot of penguins.


The alpha argument is used to control the transparency of the points. Let’s set it to 0.25 (25%) and see if that helps view the data’s distribution better. The darker areas will show where there are more overlayed points, which indicates that there is a cluster of penguin observations in that area with the same combination of bill and flipper length.

ggplot(all_penguins, 
       aes(x = bill_length_mm, 
           y = flipper_length_mm)) +
  geom_point(size = 10, 
             alpha = 0.25)


Let’s color the points by the species column in the data and see if the clusters are related to penguin species. We can bump up the alpha value to better see the individual “missing” species penguins.

ggplot(all_penguins, 
       aes(x = bill_length_mm, 
           y = flipper_length_mm)) +
  geom_point(aes(color = species),
             size = 10, 
             alpha = 0.45)

Looks like the clusters are species related! And even better news, our lost penguins appear to all be chilling up top with the Gentoo cluster.


As another visual check for the “missing” penguin species, let’s set the X-axis to the species column to more clearly see how the flippers length of our lost penguins stack up against each of the three species.

ggplot(all_penguins, 
       aes(x = species, 
           y = flipper_length_mm)) +
  geom_point(aes(color = species),
             size = 10, 
             alpha = 0.45)

Well then. It certainly looks like our lost penguins have flippers most similar to the Gentoo species.

A geom for every occasion

View the full list of geoms in the ggplot cheatsheet.

geom_text()

Let’s use geom_text() to add a layer that makes the species clusters in our plot easier to pick out. It’s good practice to use something other than color changes to convey important information in your plots.


The label = argument used insidegeom_text() tells ggplot which values to print on the plot. Remember the + sign when adding new layers!

ggplot(all_penguins, 
       aes(x = bill_length_mm, 
           y = flipper_length_mm)) +
  geom_point(aes(color = species),
             size = 10, 
             alpha = 0.45) +
  geom_text(aes(label = species)) 

Now, that’s a bit too busy to pick out where the missing species penguins are. I wonder if there is a way to zoom in on them in the chart?

lims()

The lims() or limits function is used to set the boundaries of the X and Y axes. This can be helpful to zoom in on an area, or to force multiple charts to have the same axes range regardless of their values. Let’s use lims() to zoom in on the upper right portion of our chart. We can set the X-axis to focus on the upper range of bill lengths, let’s say (40, 55), and set the Y-axis limit to focus on the upper range of flipper lengths, say (200, 225).


The X and Y limit range is placed inside our old friend the c(min_value, max_value) vector. The first value will be used for the minimum of the axis and the second value will be used for the maximum.

ggplot(all_penguins, 
       aes(x = bill_length_mm, 
           y = flipper_length_mm)) +
  geom_point(aes(color = species),
             size = 10, 
             alpha = 0.45) +
  geom_text(aes(label = species)) +
  lims(x = c(40, 55),
       y = c(200, 225))

Island plot

Now let’s make an entire new plot that compares bill length and flipper length for each of the islands. Shucks, do we have to start all over? Nope! It’s time for some code magic.

We can start with the code from our last plot, and simply swap in island column for species. Voila!


Island comparison

ggplot(all_penguins, 
       aes(x = bill_length_mm, 
           y = flipper_length_mm)) +
  geom_point(aes(color = island),
             size = 10, 
             alpha = 0.45) +
  geom_text(aes(label = island), size = 3) +
  lims(x = c(40, 55),
       y = c(200, 225))

Now that is a beautiful masterpiece. And it sure looks like our lost penguins came from Biscoe island.


Good work! The lost penguin saga has come to a close. The three Gentoo penguins from Biscoe island can now safely join the data with the rest of the penguin population.

Glossary for ggplot

Common aesthetics used in aes()

aes() Description
x = X-axis values.
y = Y-axis values.
size = The size of the point, column or line.
alpha = The transparency of the object (0.25 equates to 25%).
fill = The fill color of a column or area (points and lines do not have fill).
color = The color for points and lines, the outline color for columns and areas.
linetype = The type of line to use. Examples include solid, dotted, dashed and more.
shape = The shape for points. Includes the default solid circle, diamonds, squares and more.


The ggplot cheatsheet - View more examples and the aesthetics used in each geom

2 ggplot: Facets, colors, and labels

Load the raw R/V Topscy-scurvy data

library(tidyverse)

url <- "https://tidy-mn.github.io/R-camp-penguins/data/rv_scurvy_penguins.csv"

rv_penguins <- read_csv(url)

Something is fishy

Let’s dig into this data to see if something stands out and helps explain the unexpected size difference between the years. First we’ll need to break out of those column boxes and plot the the raw bill length values.

1. We begin with a ggplot( ) canvas

ggplot(rv_penguins)

2. Assign X/Y axes in aes()

ggplot(rv_penguins,
       aes(x = year, y = bill_length_mm))


3. Add a geom_point layer with a + sign

ggplot(rv_penguins,
       aes(x = year, y = bill_length_mm)) +
  geom_point(alpha = 0.40)


4. Add some color and jitter with geom_jitter

ggplot(rv_penguins,
       aes(x = year, y = bill_length_mm)) +
  geom_jitter(aes(color = year), 
              alpha = 0.4)


5. Increase point size

ggplot(rv_penguins,
       aes(x = year, y = bill_length_mm)) +
  geom_jitter(aes(color = year), 
              alpha = 0.40, 
              size = 4) 


6. Hide the legend

ggplot(rv_penguins,
       aes(x = year, y = bill_length_mm)) +
  geom_jitter(aes(color = year), 
              alpha = 0.40, 
              size = 4, 
              show.legend = FALSE) 


7. Since we’re data sleuthing, let’s plot in undercover night mode

We’ll install the hrbrthemes package.

install.packages("hrbrthemes")

Now we can add a custom theme.

We can safely ignore all the font WARNINGS. We’ll save changing fonts for another day.

library(hrbrthemes)

ggplot(rv_penguins,
       aes(x = year, y = bill_length_mm)) +
  geom_jitter(aes(color = year), 
              alpha = 0.40, 
              size = 4, 
              show.legend = FALSE) +
  theme_ft_rc()


Sooo…looking over at the ’09 the observations do look a bit sparse on the low end under 40 mm or so. What’s that about?


How about flipper length?

ggplot(rv_penguins,
       aes(x = year, y = flipper_length_mm)) +
  geom_jitter(aes(color = year), 
              alpha = 0.40, 
              size = 4, 
              show.legend = FALSE) +
  theme_ft_rc()


Now that is odd. Where did all the smallies on the low end go in 2008 and 2009?

Let’s check things out by island. Earlier we learned how to split our plot into groups by color, but now we want each island to be on its own separate chart.

I ❤️ facet_wrap()

Facet wrap will create a new version of your chart for every value in a column of your choosing. These charts are also referred to as multi-panel or small-multiples.

They are popular with maps as well. Here is one showing how much/little sunshine we get each month of the year.


Let’s split our plot so each island has its own panel

Facet by island

ggplot(rv_penguins,
       aes(x = year, y = bill_depth_mm)) +
  geom_point(aes(color = year), 
             alpha = 0.40, 
             size = 4, 
             show.legend = FALSE) +
  facet_wrap(vars(island)) +
  theme_ft_rc()


Oh my lands! Look at the later years on Torgersen. Now that is fishy. The observations on that island plummeted in 2008, and only 3 penguins in 2009! Somebody was really struggling to catch those little penguins!

But still, why the longer bills in later years? Let’s see if this sampling bias affected the species as well. We can make the same chart but swap in species for island in the facet wrap.

Facet by species

ggplot(rv_penguins,
       aes(x = year, y = bill_length_mm)) +
  geom_point(aes(color = year), 
             alpha = 0.40, 
             size = 4, 
             show.legend = FALSE) +
  facet_wrap(vars(species)) +
  theme_ft_rc()


It’s not as glaring as the island chart, but it does look like the Adelie penguins really take a hit in ’09.

Simmer time

To double check, let’s simmer things down and count the number of observations for each species by year.

species_count <- rv_penguins %>% 
                 group_by(species, year) %>%
                 summarize(n = n())

Now, we can use the simple column chart with facet_wrap.

ggplot(species_count, aes(x = year, y = n)) +
  geom_col(aes(color = species)) +
  facet_wrap(vars(species)) +
  theme_ft_rc()

Woops-a-daisy! I did color = instead of fill =. For columns, color controls the color of the column border. Oh, and I forgot to hide the legend.

Here we go:

ggplot(species_count, aes(x = year, y = n)) +
  geom_col(aes(fill = species), show.legend = FALSE) +
  facet_wrap(vars(species)) +
  theme_ft_rc()


Add some labels for the people in back.

ggplot(species_count, aes(x = year, y = n)) +
  geom_col(aes(fill = species), show.legend = FALSE) +
  geom_label(aes(label = n), color = "black") +
  facet_wrap(vars(species)) +
  theme_ft_rc()


Colors

A variety of color palettes are available right in ggplot. A quick way to see some options is to begin by typing scale_ and then pressing TAB .

A menu will appear with a long list of options. The key is to pick the “scale_” that matches the aesthetic we’re using. In this case we’re using fill so we want to use one of the scale_fill_* functions.

ggplot(species_count, aes(x = year, y = n)) +
  geom_col(aes(fill = species), show.legend = FALSE) +
  geom_label(aes(label = n), color = "black") +
  #scale_fill_fivethirtyeight() +
  scale_fill_viridis_d(option = "viridis") +
  facet_wrap(vars(species)) +
  theme_ft_rc()

Titles and labels

ggplot(species_count, aes(x = year, y = n)) +
  geom_col(aes(fill = species), show.legend = FALSE) +
  geom_label(aes(label = n), color = "black", size = 6) +
  scale_fill_viridis_d(option = "viridis") +
  facet_wrap(vars(species)) +
  labs(title = "Sampling bias in Adelie species",
       subtitle = "Number of penguin observations by species",
       x = "Year",
       y = "Count",
       caption = "Source: Analytical bot Bitsy-42, 2023") +
  theme_ft_rc()


Pump up the text size

We can add base_size = as an option inside our theme_minimal() layer to adjust the overall size of the plot’s text.

ggplot(species_count, aes(x = year, y = n)) +
  geom_col(aes(fill = species), show.legend = FALSE) +
  geom_label(aes(label = n), color = "black", size = 6) +
  scale_fill_viridis_d(option = "viridis") +
  facet_wrap(vars(species)) +
  labs(title = "Sampling bias in Adelie species",
       subtitle = "Number of penguin observations by species",
       x = "Year",
       y = "Count",
       caption = "Source: Analytical bot Bitsy-42, 2023") +
  theme_ft_rc(base_size = 22)


Save our masterpiece

We’ve made some plots we can be proud of, so let’s save them so we can cherish them forever. There’s a function called ggsave to do just that.

So how do we ggsave our plots?

Let’s try help(ggsave) or ?ggsave.

# Get help
help(ggsave)
?ggsave

# Run the R code for your favorite plot first
ggplot(data, aes()) +
   .... +
   ....

# Then save your plot to a png file of your choosing
ggsave("my_masterpiece.png")


Okay, let’s give it a shot.

ggsave("2023_giant_penguins_debunked.png", 
       width = 10, 
       height = 7)



Seems the real question here is where did all the Adelie penguins go on Torgersen? Did they get slippery and harder to catch or did they swim off to a new home?

I dare say the sea is full of serious mystery. But unfortunately I don’t expect to see penguinZILLA anytime soon.

Over and out.
-Bitsy-42