Before you start…
Jump back into the project you created for the previous exercise and then create a new R script. Save it as soon as you make it and give it a good name like ex_2_day_1.R
or gloria.R
and you’re be ready to go!
Auditor-in-training
The lead baking-bot is very impressed with our recent work. And now they’ve asked us perform a complete audit. This is our absolute dream come true. No one has been so excited to do an audit as we are RIGHT NOW!
In the audit we need to:
- Ensure good labels: Check that each ingredient has a category.
- Check for overstocking: No ingredient is more than 10 above its order threshold.
- Calculate summary statistics for both dry and liquid measures.
Now that we have our marching orders, it’s time to read in the data. Time to break out our handy-dandy read_csv
and wait a second… this data isn’t comma-delimited! What on earth are we going to do now?
Audit data
: https://tidy-mn.github.io/R-camp-penguins/data/inventory_data_expanded.txt
Here are a few general steps to follow when you receive a file in an unknown format:
- Open up the file and see if you can determine what the delimiter is and then pick the correct function to read it in (like
read_tsv
for tab-delimited orread_csv
for comma-delimited). - Use RStudio’s “Import Dataset” feature, same idea but gives you instant visual feedback on whether it is loading correctly.
- Use
read_delim
which has a nice “guess” feature that looks at the data and makes its best guess for what the delimiter is. - Ask someone - Try to find someone closer to the source of the data and ask them about the file type.
Side-bar discussion: What is a delimiter?
A delimiter is a sequence of one or more characters for specifying the boundary between independent regions in plain text, mathematical expressions or other data streams. An example of a delimiter is the comma character, which acts as a field delimiter in a sequence of comma-separated values. Another example of a delimiter is the
:
used to separate parts of the time, such as in 03:30.Ref: Wikipedia
Getting to know the data
Depending on how you read in the data, you may feel that you already know it quite well, but it’s still a good idea to take a general look at the data to get a feel for it.
## Rows: 36
## Columns: 8
## $ INGREDIENT <chr> "PLAIN FLOUR", "SELF-RAISING FLOUR", "WHOLE …
## $ CATEGORY <chr> "FLOUR", "FLOUR", "FLOUR", "FLOUR", "SUGAR",…
## $ DRY_OR_LIQUID_MEASUREMENT <chr> "DRY", "DRY", "DRY", "DRY", "DRY", "DRY", "D…
## $ PRICE_PER_UNIT <dbl> 1.49, 3.00, 3.00, 5.00, 2.00, 2.00, 1.89, 1.…
## $ SIZE_UNITS <chr> "5LBS BAG", "5LBS BAG", "5LBS BAG", "3LBS BA…
## $ OZ <dbl> 80, 80, 80, 48, 32, 80, 32, 32, 32, 32, 16, …
## $ NUMBER_OF_UNITS_IN_STOCK <dbl> 19, 23, 12, 3, 15, 14, 24, 21, 37, 2, 11, 8,…
## $ ORDER_THRESHOLD <dbl> 20, 15, 10, 5, 10, 20, 15, 15, 30, 5, 5, 2, …
Excellent, let’s start working on our audit!
Categories for all
Our first task it to make sure that each ingredient has a category. Let’s start by determining if anything doesn’t have a category.
Category: Missing
It can often be easier to look for the absence of something than the presence of something, in this case we want to identify all rows where Category is missing.
# What goes inside is.na() and what does is.na() do? filter(audit_df, is.na( _____ ) )
filter(audit_df, is.na(CATEGORY))
AH-HA! There is indeed an ingredient that does not have a CATEGORY
. We should fix that… but how would we go about it?
Two new functions to save the day
So the first question we have is what are the possible categories for an ingredient to have? We can use the function unique
to figure out what the unique values are!
## [1] "FLOUR" "SUGAR" "FLAVORING"
## [4] "RAISING AGENT" NA "DAIRY"
## [7] "DAIRY-ALTERNATIVE" "ADDITITIVES"
Pick the category that you think is the best fit.
Now, we just need to determine where in the dataframe the missing value is. We know that CATEGORY is the second column so that’s good. So we just need to find which row SALT is in.
Enter the function which
! We can give this function a condition or test and it will tell us the row numbers where it evaluates to true or the test is passed. In other words, if we ask the function whether each of the values in the INGREDIENT row is equal to SALT, it will tell us the location where that is TRUE.
## [1] 18
ALRIGHT!
Now for the finishing touch! Similar to Excel, we can now specify the location to update using matrix-style indices: [Row number
, Column position
]
Writing wrongs
Now we have the row and column location of the missing value and the types of acceptable values that it should be, overwrite the NA with the new value you picked.
audit_df[ROW, COLUMN] <- INSERT VALUE HERE
audit_df[18, 2] <- "ADDITITIVES" # OR audit_df[18, "CATEGORY"] <- "ADDITITIVES"
Overstock: Too many goods
- Check for overstocking: “No ingredient is more than 10 above the order threshold.”
We don’t want too much food laying around or it might spoil. Let’s use filter to check each of the ingredient stock values and see if any are more than 10 above its order threshold. We’ve got this!
Not on my watch!
Filter the audit dataset to find rows where the number of units in stock are more than 10 above the order threshold.
# Try doing some math in the filter function filter(audit_df, SOMETHING_IS > ORDER_THRESHOLD + 10)
filter(audit_df, NUMBER_OF_UNITS_IN_STOCK > ORDER_THRESHOLD + 10) # OR filter(audit_df, NUMBER_OF_UNITS_IN_STOCK - ORDER_THRESHOLD > 10)
Looks like we might need to make some emergency Oatmeal Raisin cookies to get our stock down a bit, but I don’t think people will mind that much. Nice work!
Summary statistics by group
Last item on our audit check list and then we can take a break (and maybe eat some of those Oatmeal Raisin cookies… yum.)
- Calculate summary statistics for both dry and liquid measures.
To make things easier we can split the data into two dataframes, one for all the dry ingredients and one of all the liquid.
liquid <- filter(audit_df, DRY_OR_LIQUID_MEASUREMENT == "LIQUID")
dry <- filter(audit_df, DRY_OR_LIQUID_MEASUREMENT == "DRY")
Hmmm… something feels funny. Our original dataset had 36 observation, now we have one with 28 and another with 7. That only adds up to 35… seems we lost an observation somewhere.
Maybe there’s a typo in the column?
Let’s use unique
to see if there are more than two values in the column DRY_OR_LIQUID_MEASUREMENT.
## [1] "DRY" "LIQUID" NA
Uh-oh, we have an NA value. Filter will only keep a row if it knows for sure that it meets the condition. So if there is a missing or NA value, we literally don’t know the value, and R is unable to determine if it equals “DRY” or not.
Let’s looks at the row that has the NA, maybe we can fix it like we did for CATEGORY.
## # A tibble: 1 × 8
## INGREDIENT CATEGORY DRY_OR_LIQUID_MEASUREMENT PRICE_PER_UNIT SIZE_UNITS OZ
## <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 EGGS DAIRY <NA> 1.25 12 NA
## # ℹ 2 more variables: NUMBER_OF_UNITS_IN_STOCK <dbl>, ORDER_THRESHOLD <dbl>
Oh. It’s eggs. Huh…should they be dry or liquid? What’s the number of ounces in a dozen eggs? Honestly maybe this is best left as NA and we can ask the lead baking-bot about it later. Also, we’ve never thought so much about measuring eggs before.
Enough about eggs! Let’s make our summary stats and wrap up this audit.
Summary Stats
Use the datasets that you made using filter to calculate summary statistics for each one and store them in another object (give it a good, descriptive name).
liquid_summary <- summary(_______) dry_summary <- summary(_____)
liquid_summary <- summary(liquid) dry_summary <- summary(dry)
We’ve got to stop doing so much good work…now the lead baking-bot wants us to start organizing orders.
No rest for the weary it seems, but there are warm Oatmeal Raisin cookies.