Before you start…
Jump back into the project you created for the previous sessions exercises and then create a new R script. Save it as soon as you make it and give it a good name like ex_1_day_5.R
or jamie.R
and you’ll be ready to go!
The story so far
After careful study and pain-staking note collection we finally have the data that we have been waiting for…THE PENGUINS’ FAVORITE FOOD!
Now we can make delicious seafood puff pastries that penguins will love because it will have all their favorites!
The only problem is that the island data and the food data are in different data files…so in order for us to figure out where we should take the completed pastries we have to make the data play nice with each other.
Let’s start first with getting the data, that seems like a pretty good place to start!
Load the tidyverse
and palmerpenguins
packages
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
We need to read in the new data and it even includes the penguin names! We can customize their orders! This is excellent news, maybe we can even mark each pie with their initial so they’ll know who’s is who’s…
## Rows: 344 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): studyName, Species, Sex, Favorite food, Name
## dbl (2): Sample Number, Swim speed
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Now that we’ve read the data in, we need to know what columns it contains so that we know how to join the data frames together! We’re going to just take a quick peek at the data and then pull the names from our new data and see which ones match to our old data.
## Rows: 344
## Columns: 7
## $ studyName <chr> "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL0708",…
## $ `Sample Number` <dbl> 1, 3, 6, 7, 9, 11, 13, 15, 17, 19, 222, 23, 25, 29, 32…
## $ Species <chr> "Gentoo penguin (Pygoscelis papua)", "Gentoo penguin (…
## $ Sex <chr> "FEMALE", "FEMALE", "FEMALE", "FEMALE", "FEMALE", "FEM…
## $ `Favorite food` <chr> "shrimp", "shrimp", "krill", "krill", "krill", "krill"…
## $ `Swim speed` <dbl> 19, 20, 18, 18, 18, 19, 19, 17, 19, 20, 20, 19, 18, 17…
## $ Name <chr> "Qua", "Zoe", "Cam", "Rhi", "Zee", "Nag", "Lia", "Jam"…
## [1] "studyName" "Sample Number" "Species" "Sex"
## [5] "Favorite food" "Swim speed" "Name"
## [1] "studyName" "Sample Number" "Species"
## [4] "Region" "Island" "Stage"
## [7] "Individual ID" "Clutch Completion" "Date Egg"
## [10] "Culmen Length (mm)" "Culmen Depth (mm)" "Flipper Length (mm)"
## [13] "Body Mass (g)" "Sex" "Delta 15 N (o/oo)"
## [16] "Delta 13 C (o/oo)" "Comments"
Fantastic! It looks like the spelling of the first four columns used to ID each penguin in penguin_names match the spelling in penguins_raw: "studyName"
, "Sample Number"
, "Species"
and "Sex"
. To be really sure we can use the function intersect
to see where the names line up!
## [1] "studyName" "Sample Number" "Species" "Sex"
Now it’s time to join up! Join functions are smart, so if your columns have the same names it will automatically figure out which ones to join on so you don’t have to specify. If they do have different column names,all you have to do is pass the function an argument called by
where you give it the names of the columns to join by.
## Joining with `by = join_by(studyName, `Sample Number`, Species, Sex)`
Annnnd just like that, it figured out what to join! Here’s how it specified the join_by
statement: by = join_by(studyName, 'Sample Number', Species, Sex)
So we did a left join, meaning that we expect to keep all the records from the first dataframe we pass it and only those in the second that match.
Let’s filter our new dataset to see if we have all the data that we need!
## # A tibble: 4 × 20
## studyName `Sample Number` Species Region Island Stage `Individual ID`
## <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 PAL0708 22 Adelie Penguin … Anvers Biscoe Adul… N11A2
## 2 PAL0910 133 Adelie Penguin … Anvers Dream Adul… N76A1
## 3 PAL0708 22 Gentoo penguin … Anvers Biscoe Adul… N41A2
## 4 PAL0708 22 Chinstrap pengu… Anvers Dream Adul… N73A2
## # ℹ 13 more variables: `Clutch Completion` <chr>, `Date Egg` <date>,
## # `Culmen Length (mm)` <dbl>, `Culmen Depth (mm)` <dbl>,
## # `Flipper Length (mm)` <dbl>, `Body Mass (g)` <dbl>, Sex <chr>,
## # `Delta 15 N (o/oo)` <dbl>, `Delta 13 C (o/oo)` <dbl>, Comments <chr>,
## # `Favorite food` <chr>, `Swim speed` <dbl>, Name <chr>
Uh-oh. We have some penguins that don’t have matches…which seems odd because both datasets have the same number of rows…that must mean that something happened to the data!
Let’s use anti-join
to figure out which rows from penguin_names
didn’t join to penguins_raw
!
## Joining with `by = join_by(studyName, `Sample Number`, Species, Sex)`
## # A tibble: 4 × 7
## studyName `Sample Number` Species Sex `Favorite food` `Swim speed` Name
## <chr> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 PAL0708 222 Gentoo pen… FEMA… krill 20 Gam
## 2 PAL0910 153 Adelie Pen… FEMA… fish 5 Bet
## 3 PAL0708 222 Adelie Pen… MALE fish 4 Abe
## 4 PAL0708 222 Chinstrap … MALE krill 15 Ace
Looks like someone made a typo…do you think we can fix it?
First things first, let’s check and make sure that those values don’t exist in our penguins_raw
data - it would be bad if we fixed data that wasn’t actually wrong!
## # A tibble: 0 × 17
## # ℹ 17 variables: studyName <chr>, Sample Number <dbl>, Species <chr>,
## # Region <chr>, Island <chr>, Stage <chr>, Individual ID <chr>,
## # Clutch Completion <chr>, Date Egg <date>, Culmen Length (mm) <dbl>,
## # Culmen Depth (mm) <dbl>, Flipper Length (mm) <dbl>, Body Mass (g) <dbl>,
## # Sex <chr>, Delta 15 N (o/oo) <dbl>, Delta 13 C (o/oo) <dbl>, Comments <chr>
Whew, okay, so those numbers don’t exist in the original data, so looks like we just need to update the values to the correct numbers! We should also check to make sure it worked!
Fixing typos
Use the values from the filtered penguins_raw
to correct the sample numbers in penguin_names
so they match and then check to make sure all values join.
penguin_names <- penguin_names %>% mutate(`Sample Number` = case_when(`Sample Number` == ______, ________ == _______, TRUE ~ ________)) ______(penguins_raw, penguin_names) # should have 0 rows if this works!
penguin_names <- penguin_names %>% mutate(`Sample Number` = case_when(`Sample Number` == 222 ~ 22, `Sample Number` == 153 ~ 133, TRUE ~ `Sample Number`)) anti_join(penguin_names, penguins_raw)
Yay! Let’s try this again…also let’s add clean_names
after we join because those names are hard to keep spelling correctly.
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
## Joining with `by = join_by(studyName, `Sample Number`, Species, Sex)`
Now, let’s make a plot to see what the penguins favorite food by island is! Check the solution for a way to bring everything we’ve learned together into one beautiful code chunk and for exciting new color developments!
Foodie plot
Feel free to make whatever plot you think is best: we want to know how many penguins like each food by island. Use labels, summaries, or other methods to convey this information and have some fun!
# We want the columns in decending order by count! ggplot(named_penguins %>% group_by(______, ______) %>% summarise(count = _____), aes(x=_____, y = ____, fill = reorder(favorite_food, -count))) + geom_col(position = ___________) + labs(title = "Number of penguins by their favorite food and island", x = "Island home", y = "Number of penguins we need to serve!")+ geom_text(aes(label = count, y = _____), position = position_dodge2(width = 0.9, preserve = "single")) + scale_fill_xxxx(name = _______) + theme_xxxx()
ggplot(named_penguins %>% group_by(island, favorite_food) %>% summarise(count = n()), aes(x=island, y = count, fill = reorder(favorite_food, -count))) + geom_col(position = position_dodge2(preserve = "single")) + labs(title = "Number of penguins by their favorite food and island", x = "Island home", y = "Number of penguins we need to serve!")+ geom_text(aes(label = count, y = count + 5), position = position_dodge2(width = 0.9, preserve = "single")) + scale_fill_viridis_d(name = "Favorite Food") + theme_minimal()
We might just be showing off now…but look how pretty it looks!
Anyway, krill seem to be the winner by a wide margin, so let’s get baking!