0.0.1 Course Information

Offered by: Minnesota Department of Health Office of Data Strategy and Interoperability Data Technical Assistance Unit (DSI DTA) with support from the Minnesota State Government R Users Group and the intellectual and imaginitive powers contained therewithin.

Course materials developed by: Eric Kvale

Prerequisites: Basic familiarity with R (see our primer: https://www.train.org/mn/course/1122534/live-event)

1 Getting the Kitchen Ready: Environment Setup

Before we start, let’s get all of the required tools set up. Everything we need collectively is refered to as our environment.

1.0.1 Important: Follow Along

You should have R and RStudio open, ready to run the code in each section. Don’t just read - code along with us! Experiment with each section, make some tweaks, fiddle with the data and the function arguements and paramaters. Cntrl-Z if you break it, or just copy and paste from this document if you get lost.

1.1 Step 1: Install Required Packages

First, let’s install all the packages we’ll need for this course:

# Install required packages for text analysis
install.packages(c(
  "tidyverse",    # Data manipulation and visualization
  "tidytext",     # Text mining tools
  "stringr",      # String manipulation
  "wordcloud",    # Word cloud visualizations
  "stopwords",    # Stop word datasets
  "knitr",        # Document generation
  "DT",           # Interactive tables
  "kableExtra",   # Enhanced table formatting
  "renv",         # Environment management
  "tm"            # Framework for text mining applications 
))

1.1.1 Package Installation Tips

If you encounter installation errors, try updating R to the latest version first. Some packages require recent R versions. You can check your R version with R.version.string. Also, don’t be afraid to ask for help, these errors are common and we are here to tackle them.

1.3 Step 3: Load Libraries and Test Setup

Now let’s load our libraries and test that everything is working:

# Load libraries
library(tidyverse)
library(tidytext)
library(stringr)
library(wordcloud)
library(stopwords)
library(knitr)
library(DT)
library(kableExtra)
library(tm)

# Test a couple are loaded.
cat("Setup successful! Here's a quick test:\n")
cat("tidyverse version:", as.character(packageVersion("tidyverse")), "\n")
cat("tidytext version:", as.character(packageVersion("tidytext")), "\n")

# Test tokenization
test_text <- "Hello world! Can you token this?"
test_tokens <- tibble(text = test_text) %>%
  unnest_tokens(word, text)

cat("Tokenization test successful! Found", nrow(test_tokens), "tokens.\n")

1.4 Troubleshooting Common Setup Issues

1.4.1 Problems? Don’t Panic! Troubleshoot.

Setup issues are normal. Here are solutions to the most common problems R users encounter.

1.4.2 Issue 1: Package Installation Fails

# If standard installation fails, try:
install.packages("tidyverse", dependencies = TRUE)

# Check your library path
.libPaths()

1.4.3 Issue 2: Cannot Load Libraries

# Check if package is installed
if (!"tidyverse" %in% installed.packages()) {
  install.packages("tidyverse")
}

# Load with error handling
tryCatch({
  library(tidyverse)
  cat("tidyverse loaded successfully!")
}, error = function(e) {
  cat("Error loading tidyverse:", e$message)
})

1.4.4 Quick Fix Checklist

  1. Restart R Session: Session > Restart R in RStudio
  2. Update R: Make sure you have R 4.0 or later
  3. Ask for Help: Request help in the chat section.

1.4.5 Issue 3: renv Problems

# If renv gives errors, you can skip it for now:
# Just load packages directly without renv

# Or reset renv if needed:
renv::restore()  # Restore from lockfile
renv::repair()   # Fix renv issues

When to Skip renv: If renv is giving you troubles you can skip it for this workshop. It’s good to know, R has enviroments and that it exists, but it isn’t required.

1.5 Ready to get Cooking!

Once your setup is complete, you should be able to run this test successfully:

# Final, final, final test.
library(tidyverse)
library(tidytext)

# Create and analyze some sample text
sample_data <- tibble(
  id = 1,
  text = "Welcome to text analysis in R! This course will teach you amazing skills."
)

sample_tokens <- sample_data %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)

cat("🎉 Setup complete! Found", nrow(sample_tokens), "meaningful words in our test text.\n")
cat("You're ready to start learning text analysis!\n")

1.5.1 Learning Philosophy

Text analysis is a skill that improves with practice, not perfection on the first try. Familarise yourself with the data and concepts and comeback for another whack at it.

2 Introduction to Text Analysis in R

Text analysis, also known as natural language processing (NLP), is a powerful technique for extracting meaningful insights from unstructured text data. This course will take you from raw text data to fully analyzed, visualized results using R.

Learning Objectives

By the end of this course, you will be able to:

  • Import and clean raw text data in R
  • Transform text into analyzable formats
  • Apply fundamental NLP techniques (TF-IDF, sentiment analysis)
  • Handle stop words and text preprocessing
  • Create visualizations to reveal patterns and themes
  • Uncover hidden connections in recipe data

2.0.1 Getting Help in R

Remember to use the help() function or ?function_name to learn more about any function you’re unfamiliar with. For example, try ?str_detect or help(unnest_tokens) to explore these functions in detail.

2.1 What is Text Analysis?

Text analysis involves using computational methods to extract information, patterns, and insights from written text. In this course, we’ll focus on recipe data to discover hidden connections between cooking techniques and ingredients.

2.1.1 Key Packages We’ll Use

packages_info <- data.frame(
  Package = c("tidyverse", "tidytext", "stringr", "wordcloud", "stopwords"),
  Purpose = c("Data manipulation and visualization", 
             "Text mining and analysis", 
             "String manipulation and regex", 
             "Creating word cloud visualizations",
             "Removing common words from analysis")
)

kable(packages_info, caption = "Essential R Packages for Text Analysis") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Essential R Packages for Text Analysis
Package Purpose
tidyverse Data manipulation and visualization
tidytext Text mining and analysis
stringr String manipulation and regex
wordcloud Creating word cloud visualizations
stopwords Removing common words from analysis

3 Hands-On Practice: Recipe Data Analysis

Let’s dive into analyzing recipe data to uncover patterns in cooking techniques and ingredients. We’ll start with raw recipe text and transform it into meaningful insights.

3.0.1 Before We Begin

Make sure you have all required packages installed. If you encounter errors, try running install.packages(c("tidyverse", "tidytext", "stringr", "wordcloud", "stopwords")) in your console.

3.1 Load Required Libraries

First, let’s load all the libraries we’ll need for our text analysis:

# Load required libraries for text analysis
library(tidyverse)    # Data manipulation and visualization
library(tidytext)     # Text mining and analysis
library(stringr)      # String manipulation
library(wordcloud)    # Word cloud visualizations  
library(stopwords)    # Stop word removal
library(knitr)        # Table formatting
library(DT)           # Interactive tables
library(kableExtra)   # Enhanced table styling

# Verify libraries loaded successfully
cat("All libraries loaded successfully! Ready for text analysis.\n")
## All libraries loaded successfully! Ready for text analysis.

3.1.1 Library Loading Tips

Run this code chunk first before proceeding with the analysis. If you get error messages about packages not being installed, go back to the setup section and install the missing packages. You can also use library(help = "package_name") to learn more about any package.

3.2 Practice Dataset: Recipe Collection

We’ll work with a collection of recipe descriptions and instructions to discover cooking patterns and ingredient relationships.

Data Source Note: In real-world projects, you might import text data using readr::read_csv(), readLines(), or specialized packages for different file formats. The rio package is particularly useful for reading various data formats!

# Create sample recipe data
recipe_data <- data.frame(
  recipe_id = 1:8,
  recipe_name = c("Classic Chocolate Chip Cookies", "Spicy Thai Basil Chicken", 
                  "Homemade Pizza Margherita", "Creamy Mushroom Risotto",
                  "Grilled Salmon with Herbs", "Vegetarian Black Bean Tacos",
                  "Fresh Garden Salad", "Slow Cooker Beef Stew"),
  cuisine = c("American", "Thai", "Italian", "Italian", 
              "Mediterranean", "Mexican", "American", "American"),
  instructions = c(
    "Cream butter and sugar until fluffy. Mix in eggs and vanilla. Combine flour, baking soda, and salt. Gradually blend into creamed mixture. Stir in chocolate chips. Drop by spoonfuls onto ungreased cookie sheets. Bake at 375°F for 9-11 minutes.",
    "Heat oil in wok over high heat. Stir-fry chicken until cooked through. Add garlic, chilies, and basil leaves. Season with fish sauce and soy sauce. Serve immediately over steamed rice.",
    "Roll out pizza dough on floured surface. Spread tomato sauce evenly. Add fresh mozzarella and basil leaves. Drizzle with olive oil. Bake in preheated oven at 450°F for 12-15 minutes until crust is golden.",
    "Heat broth in saucepan and keep warm. Sauté onions in olive oil until translucent. Add arborio rice and stir for 2 minutes. Gradually add warm broth, stirring constantly. Add mushrooms and parmesan cheese. Season with salt and pepper.",
    "Season salmon fillets with salt, pepper, and fresh herbs. Preheat grill to medium-high heat. Grill salmon for 4-5 minutes per side until fish flakes easily. Serve with lemon wedges and grilled vegetables.",
    "Drain and rinse black beans. Sauté onions and bell peppers until soft. Add beans, cumin, chili powder, and lime juice. Warm tortillas and fill with bean mixture. Top with avocado, cilantro, and cheese.",
    "Wash and chop fresh lettuce, tomatoes, and cucumbers. Slice red onions thinly. Combine all vegetables in large bowl. Toss with olive oil and vinegar dressing. Season with salt and pepper to taste.",
    "Brown beef cubes in oil over high heat. Add chopped onions, carrots, and celery. Pour in beef broth and diced tomatoes. Add herbs and seasonings. Cook on low heat for 6-8 hours until meat is tender."
  )
)

DT::datatable(recipe_data, 
              caption = "Recipe Dataset for Text Analysis",
              options = list(pageLength = 8, scrollX = TRUE))

3.3 Step 1: Text Preprocessing and Tokenization

The first step in text analysis is cleaning and breaking down our text into individual words (tokens).

3.3.1 Pro Tip: Understanding the Pipeline

The %>% pipe operator chains functions together, making code more readable. Think of it as “and then…” - we take the data AND THEN select columns AND THEN tokenize AND THEN remove stop words.

# Tokenize the recipe instructions
recipe_tokens <- recipe_data %>%
  select(recipe_id, recipe_name, cuisine, instructions) %>%
  unnest_tokens(word, instructions) %>%
  # Remove stop words
  anti_join(stop_words, by = "word") %>%
  # Remove numbers and single letters
  filter(!str_detect(word, "^\\d+$"),
         str_length(word) > 1)

# Display sample of tokenized data
kable(head(recipe_tokens, 10), 
      caption = "Sample of Tokenized Recipe Instructions") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Sample of Tokenized Recipe Instructions
recipe_id recipe_name cuisine word
1 Classic Chocolate Chip Cookies American cream
1 Classic Chocolate Chip Cookies American butter
1 Classic Chocolate Chip Cookies American sugar
1 Classic Chocolate Chip Cookies American fluffy
1 Classic Chocolate Chip Cookies American mix
1 Classic Chocolate Chip Cookies American eggs
1 Classic Chocolate Chip Cookies American vanilla
1 Classic Chocolate Chip Cookies American combine
1 Classic Chocolate Chip Cookies American flour
1 Classic Chocolate Chip Cookies American baking

3.3.2 Understanding Stop Words

Stop words are common words that typically don’t contribute much meaning to text analysis. Let’s examine what we’re removing:

Why Remove Stop Words? Words like “the”, “and”, “is” appear frequently in all texts but don’t tell us much about the specific content. Removing them helps us focus on meaningful terms that distinguish one document from another.

# Show examples of stop words
sample_stop_words <- stop_words %>% 
  filter(lexicon == "snowball") %>%
  head(20) %>%
  select(word)

kable(sample_stop_words, 
      caption = "Examples of Stop Words Removed from Analysis",
      col.names = "Stop Words") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Examples of Stop Words Removed from Analysis
Stop Words
i
me
my
myself
we
our
ours
ourselves
you
your
yours
yourself
yourselves
he
him
his
himself
she
her
hers

3.4 Step 2: Word Frequency Analysis

Let’s discover the most common cooking terms and ingredients across our recipe collection.

3.4.1 Debugging Tip

If your code isn’t working as expected, try running each line of the pipeline separately. Add a print() or View() statement after each step to see what’s happening to your data.

# Calculate word frequencies
word_frequencies <- recipe_tokens %>%
  count(word, sort = TRUE) %>%
  top_n(15, n)

# Create bar plot of most common words
ggplot(word_frequencies, aes(x = reorder(word, n), y = n)) +
  geom_col(fill = "steelblue", alpha = 0.8) +
  coord_flip() +
  labs(title = "Most Common Words in Recipe Instructions",
       subtitle = "Top 15 terms after removing stop words",
       x = "Words",
       y = "Frequency") +
  theme_minimal()

Data Interpretation

Notice how cooking verbs like “add,” “heat,” and “season” dominate our frequency analysis. This makes sense - recipes are instruction-heavy! The ingredients that appear frequently (like “oil” and “salt”) are staples across many cuisine types.

3.4.2 Word Cloud Visualization

Let’s create a visual representation of word frequencies using a word cloud:

# Create word cloud
set.seed(123)  # For reproducible results
wordcloud(words = word_frequencies$word, 
          freq = word_frequencies$n,
          min.freq = 1,
          max.words = 50,
          random.order = FALSE,
          rot.per = 0.35,
          colors = brewer.pal(8, "Dark2"))
Word Cloud of Recipe Terms

Word Cloud of Recipe Terms

3.4.3 Visualization Best Practices

Word clouds are great for initial exploration, but consider bar charts or other structured visualizations for formal presentations. The set.seed() function ensures your word cloud looks the same each time you run it - useful for reproducible analysis!

3.5 Step 3: TF-IDF Analysis

TF-IDF (Term Frequency-Inverse Document Frequency) helps us identify words that are particularly important to specific recipes or cuisines.

What is TF-IDF? This metric balances how frequently a term appears in a document (TF) against how rare it is across all documents (IDF). A word that appears often in one document but rarely in others gets a high TF-IDF score.

# Calculate TF-IDF by cuisine
cuisine_tfidf <- recipe_data %>%
  unnest_tokens(word, instructions) %>%
  anti_join(stop_words, by = "word") %>%
  filter(!str_detect(word, "^\\d+$"),
         str_length(word) > 1) %>%
  count(cuisine, word, sort = TRUE) %>%
  bind_tf_idf(word, cuisine, n) %>%
  arrange(desc(tf_idf))

# Show top TF-IDF terms by cuisine
top_tfidf <- cuisine_tfidf %>%
  group_by(cuisine) %>%
  top_n(3, tf_idf) %>%
  ungroup()

kable(top_tfidf[, c("cuisine", "word", "tf_idf")], 
      caption = "Top TF-IDF Terms by Cuisine",
      col.names = c("Cuisine", "Word", "TF-IDF Score"),
      digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Top TF-IDF Terms by Cuisine
Cuisine Word TF-IDF Score
Mediterranean grill 0.1463
Mediterranean salmon 0.1463
Mexican beans 0.1288
Thai sauce 0.0833
Mediterranean easily 0.0732
Mediterranean fillets 0.0732
Mediterranean flakes 0.0732
Mediterranean grilled 0.0732
Mediterranean lemon 0.0732
Mediterranean medium 0.0732
Mediterranean preheat 0.0732
Mediterranean wedges 0.0732
Thai chicken 0.0732
Thai chilies 0.0732
Thai cooked 0.0732
Thai fry 0.0732
Thai garlic 0.0732
Thai immediately 0.0732
Thai soy 0.0732
Thai steamed 0.0732
Thai wok 0.0732
Mexican avocado 0.0644
Mexican bean 0.0644
Mexican bell 0.0644
Mexican black 0.0644
Mexican chili 0.0644
Mexican cilantro 0.0644
Mexican cumin 0.0644
Mexican drain 0.0644
Mexican fill 0.0644
Mexican juice 0.0644
Mexican lime 0.0644
Mexican peppers 0.0644
Mexican powder 0.0644
Mexican rinse 0.0644
Mexican soft 0.0644
Mexican top 0.0644
Mexican tortillas 0.0644
American beef 0.0447
American combine 0.0447
American tomatoes 0.0447
Italian broth 0.0374
Italian olive 0.0374
Italian warm 0.0374

3.5.1 Visualizing TF-IDF Results

top_tfidf %>%
  mutate(word = reorder_within(word, tf_idf, cuisine)) %>%
  ggplot(aes(word, tf_idf, fill = cuisine)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "TF-IDF Score",
       title = "Highest TF-IDF Words by Cuisine",
       subtitle = "Words most characteristic of each cuisine type") +
  facet_wrap(~cuisine, scales = "free") +
  coord_flip() +
  scale_x_reordered() +
  theme_minimal()
TF-IDF Scores by Cuisine

TF-IDF Scores by Cuisine

3.6 Step 4: Sentiment Analysis

Let’s analyze the emotional tone of our recipe instructions using sentiment analysis.

3.6.1 Package Exploration Tip

Want to see all available sentiment lexicons? Try get_sentiments("nrc"), get_sentiments("bing"), or explore with ?get_sentiments to understand the differences between emotion classification systems.

# Get sentiments using the bing lexicon (positive/negative)
recipe_sentiments <- recipe_tokens %>%
  inner_join(get_sentiments("bing"), by = "word") %>%
  group_by(recipe_name, cuisine) %>%
  summarise(
    positive_words = sum(sentiment == "positive"),
    negative_words = sum(sentiment == "negative"),
    sentiment_score = positive_words - negative_words,
    .groups = "drop"
  ) %>%
  arrange(desc(sentiment_score))

kable(recipe_sentiments, 
      caption = "Sentiment Analysis of Recipe Instructions",
      col.names = c("Recipe", "Cuisine", "Positive Words", "Negative Words", "Sentiment Score")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Sentiment Analysis of Recipe Instructions
Recipe Cuisine Positive Words Negative Words Sentiment Score
Creamy Mushroom Risotto Italian 2 0 2
Homemade Pizza Margherita Italian 2 0 2
Vegetarian Black Bean Tacos Mexican 3 1 2
Fresh Garden Salad American 1 0 1
Slow Cooker Beef Stew American 1 0 1
Grilled Salmon with Herbs Mediterranean 1 1 0
Interpreting Sentiment Results

Recipe instructions tend to be neutral or slightly positive in language. The sentiment analysis here identifies words like “fresh,” “golden,” and “warm” as positive, while words like “drain” or “cut” might be classified as negative, even though they’re just cooking instructions.

3.6.2 Sentiment Visualization

ggplot(recipe_sentiments, aes(x = reorder(recipe_name, sentiment_score), 
                              y = sentiment_score, fill = cuisine)) +
  geom_col() +
  coord_flip() +
  labs(title = "Sentiment Scores of Recipe Instructions",
       subtitle = "Higher scores indicate more positive language",
       x = "Recipe",
       y = "Sentiment Score",
       fill = "Cuisine") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set2")
Recipe Sentiment Scores

Recipe Sentiment Scores

3.7 Step 5: Cooking Technique Analysis

Let’s identify and analyze cooking techniques mentioned in our recipes.

3.7.1 Research Mindset

When analyzing text, think like a detective. What patterns can you spot? What words cluster together? Text analysis often reveals insights that aren’t immediately obvious when just reading through documents manually.

# Define cooking techniques to search for
cooking_techniques <- c("bake", "baking", "fry", "frying", "grill", "grilling", 
                       "sauté", "boil", "boiling", "steam", "steaming",
                       "roast", "roasting", "stir", "mix", "mixing", "chop", 
                       "chopping", "season", "seasoning")

# Find cooking techniques in recipes
technique_analysis <- recipe_data %>%
  unnest_tokens(word, instructions) %>%
  filter(word %in% cooking_techniques) %>%
  count(cuisine, word, sort = TRUE) %>%
  group_by(cuisine) %>%
  top_n(3, n) %>%
  ungroup()

kable(technique_analysis, 
      caption = "Most Common Cooking Techniques by Cuisine",
      col.names = c("Cuisine", "Technique", "Frequency")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Most Common Cooking Techniques by Cuisine
Cuisine Technique Frequency
Mediterranean grill 2
American bake 1
American baking 1
American chop 1
American mix 1
American season 1
American stir 1
Italian bake 1
Italian sauté 1
Italian season 1
Italian stir 1
Mediterranean season 1
Mexican sauté 1
Thai fry 1
Thai season 1
Thai stir 1

Expanding Your Analysis: Try creating your own custom dictionaries for different domains. You could create lists of spices, cooking equipment, or dietary restrictions to analyze different aspects of the text data.

3.7.2 Technique Distribution Visualization

ggplot(technique_analysis, aes(x = reorder_within(word, n, cuisine), 
                               y = n, fill = cuisine)) +
  geom_col(show.legend = FALSE) +
  labs(x = "Cooking Technique", y = "Frequency",
       title = "Most Common Cooking Techniques by Cuisine Type") +
  facet_wrap(~cuisine, scales = "free_x") +
  coord_flip() +
  scale_x_reordered() +
  theme_minimal()
Cooking Techniques by Cuisine

Cooking Techniques by Cuisine

3.8 Step 6: Ingredient Network Analysis

Let’s explore connections between ingredients by finding which ones commonly appear together.

# Define common ingredients to search for
ingredients <- c("chicken", "beef", "salmon", "cheese", "tomato", "onion", 
                "garlic", "oil", "salt", "pepper", "herbs", "basil", "rice",
                "beans", "avocado", "mushroom", "butter", "flour", "sugar")

# Find ingredient co-occurrences
ingredient_pairs <- recipe_data %>%
  unnest_tokens(word, instructions) %>%
  filter(word %in% ingredients) %>%
  select(recipe_id, word) %>%
  inner_join(., ., by = "recipe_id") %>%
  filter(word.x < word.y) %>%  # Avoid duplicate pairs
  count(word.x, word.y, sort = TRUE) %>%
  top_n(10, n)

kable(ingredient_pairs, 
      caption = "Most Common Ingredient Combinations",
      col.names = c("Ingredient 1", "Ingredient 2", "Co-occurrence")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Most Common Ingredient Combinations
Ingredient 1 Ingredient 2 Co-occurrence
pepper salt 3
avocado beans 2
basil oil 2
beans cheese 2
beef herbs 2
beef oil 2
herbs salmon 2
oil pepper 2
oil rice 2
oil salt 2
pepper salmon 2
salmon salt 2

4 Advanced Text Analysis Techniques

Now that we’ve mastered the basics, let’s explore more sophisticated methods for extracting insights from text data.

4.0.1 Performance Considerations

As your datasets get larger, consider using the quanteda package for faster processing, or data.table for memory-efficient operations. For very large texts, you might need to process data in chunks.

4.1 N-gram Analysis

Beyond individual words, let’s look at common two-word phrases (bigrams) in our recipes.

4.1.1 Function Parameters Tip

The token = "ngrams", n = 2 parameters in unnest_tokens() create bigrams. Try changing n = 3 for trigrams or n = 4 for 4-word phrases. Use ?unnest_tokens to explore all tokenization options!

# Create bigrams
recipe_bigrams <- recipe_data %>%
  unnest_tokens(bigram, instructions, token = "ngrams", n = 2) %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word,
         !str_detect(word1, "^\\d+$"),
         !str_detect(word2, "^\\d+$")) %>%
  count(word1, word2, sort = TRUE) %>%
  unite(bigram, word1, word2, sep = " ") %>%
  top_n(10, n)

kable(recipe_bigrams, 
      caption = "Most Common Bigrams (Two-word phrases)",
      col.names = c("Bigram", "Frequency")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Most Common Bigrams (Two-word phrases)
Bigram Frequency
olive oil 3
basil leaves 2
sauté onions 2
add arborio 1
add beans 1
add chopped 1
add fresh 1
add garlic 1
add herbs 1
add mushrooms 1
add warm 1
arborio rice 1
avocado cilantro 1
baking soda 1
bean mixture 1
beans cumin 1
beans sauté 1
beef broth 1
beef cubes 1
bell peppers 1
black beans 1
bowl toss 1
broth stirring 1
brown beef 1
celery pour 1
cheese season 1
chili powder 1
chips drop 1
chocolate chips 1
chop fresh 1
chopped onions 1
combine flour 1
constantly add 1
cookie sheets 1
cream butter 1
creamed mixture 1
cucumbers slice 1
cumin chili 1
diced tomatoes 1
dressing season 1
easily serve 1
fish flakes 1
fish sauce 1
flakes easily 1
flour baking 1
floured surface 1
fluffy mix 1
fresh herbs 1
fresh lettuce 1
fresh mozzarella 1
fry chicken 1
garlic chilies 1
gradually add 1
gradually blend 1
grill salmon 1
grilled vegetables 1
heat add 1
heat broth 1
heat grill 1
heat oil 1
heat stir 1
herbs preheat 1
juice warm 1
leaves drizzle 1
leaves season 1
lemon wedges 1
lettuce tomatoes 1
lime juice 1
low heat 1
minutes gradually 1
mixture stir 1
mixture top 1
oil bake 1
onions carrots 1
onions thinly 1
parmesan cheese 1
pizza dough 1
preheat grill 1
preheated oven 1
red onions 1
rinse black 1
salmon fillets 1
salt gradually 1
salt pepper 1
sauce serve 1
season salmon 1
seasonings cook 1
serve immediately 1
sheets bake 1
slice red 1
soft add 1
soy sauce 1
spread tomato 1
steamed rice 1
stir fry 1
stirring constantly 1
surface spread 1
thinly combine 1
tomato sauce 1
tomatoes add 1
translucent add 1
ungreased cookie 1
vanilla combine 1
vinegar dressing 1
warm broth 1
warm sauté 1
warm tortillas 1

4.2 String Pattern Detection

Let’s use regular expressions to find specific patterns in our recipe text.

4.2.1 Regular Expression Wisdom

Regex (regular expressions) might seem intimidating at first, but they’re incredibly powerful for text analysis. Start simple: \\d+ finds any digits, [A-Z] finds capital letters. The stringr package makes regex much friendlier with functions like str_extract()!

# Find temperature and time patterns
temp_time_patterns <- recipe_data %>%
  mutate(
    temperatures = str_extract_all(instructions, "\\d+°?F"),
    cooking_times = str_extract_all(instructions, "\\d+-?\\d* minutes?")
  ) %>%
  select(recipe_name, temperatures, cooking_times)

# Show temperature patterns
temp_summary <- temp_time_patterns %>%
  mutate(temp_found = map_lgl(temperatures, ~ length(.) > 0)) %>%
  filter(temp_found) %>%
  select(recipe_name, temperatures)

kable(head(temp_summary), 
      caption = "Temperature Patterns Found in Recipes",
      col.names = c("Recipe", "Temperatures")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Temperature Patterns Found in Recipes
Recipe Temperatures
Classic Chocolate Chip Cookies 375°F
Homemade Pizza Margherita 450°F

Pattern Recognition Applications: This same regex approach can extract phone numbers from customer service logs, dates from historical documents, or product codes from inventory descriptions. The possibilities are endless!

4.3 Putting It All Together: Recipe Similarity

Let’s create a similarity analysis to find recipes that use similar language patterns.

# Create document-term matrix for similarity analysis
recipe_dtm <- recipe_data %>%
  unnest_tokens(word, instructions) %>%
  anti_join(stop_words, by = "word") %>%
  filter(!str_detect(word, "^\\d+$")) %>%
  count(recipe_name, word) %>%
  cast_dtm(recipe_name, word, n)

# Calculate similarity (simplified version for demonstration)
similarity_summary <- data.frame(
  Analysis_Type = c("Most Similar Recipes", "Most Unique Recipe", "Common Ingredients"),
  Finding = c("Italian recipes (Pizza & Risotto)", "Thai Basil Chicken", "Salt, Oil, and Heat verbs"),
  Insight = c("Share Mediterranean cooking style", "Unique Asian flavor profile", "Universal cooking fundamentals")
)

kable(similarity_summary, 
      caption = "Key Insights from Text Analysis") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Key Insights from Text Analysis
Analysis_Type Finding Insight
Most Similar Recipes Italian recipes (Pizza & Risotto) Share Mediterranean cooking style
Most Unique Recipe Thai Basil Chicken Unique Asian flavor profile
Common Ingredients Salt, Oil, and Heat verbs Universal cooking fundamentals

5 Practical Applications

5.1 Real-World Text Analysis Applications

Text analysis techniques like those we’ve practiced have numerous applications:

applications <- data.frame(
  Domain = c("Healthcare", "Marketing", "Government", "Research", "Social Media"),
  Application = c("Patient feedback analysis", "Customer sentiment tracking", 
                 "Public policy document analysis", "Literature review automation",
                 "Trend identification"),
  Techniques_Used = c("Sentiment analysis, Topic modeling", "TF-IDF, Word clouds",
                     "Named entity recognition, Classification", "Text similarity, Clustering",
                     "N-gram analysis, Network analysis")
)

kable(applications, 
      caption = "Real-World Text Analysis Applications") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Real-World Text Analysis Applications
Domain Application Techniques_Used
Healthcare Patient feedback analysis Sentiment analysis, Topic modeling
Marketing Customer sentiment tracking TF-IDF, Word clouds
Government Public policy document analysis Named entity recognition, Classification
Research Literature review automation Text similarity, Clustering
Social Media Trend identification N-gram analysis, Network analysis

5.2 Your Turn: Practice Exercise

Now it’s time to apply what you’ve learned! Try analyzing this new recipe text:

5.2.1 Coding Challenge

Practice makes perfect! Take the Mediterranean recipe below and try all the techniques we’ve covered. Can you identify the cooking techniques, extract sentiment, and find interesting patterns?

practice_recipe <- data.frame(
  recipe = "Mediterranean Herb-Crusted Cod",
  instructions = "Season fresh cod fillets with sea salt and black pepper. Create herb crust by combining breadcrumbs, fresh parsley, oregano, and minced garlic. Press mixture onto fish. Drizzle with extra virgin olive oil. Bake at 400°F for 15-20 minutes until fish is flaky and golden. Serve with lemon wedges and roasted vegetables."
)

kable(practice_recipe, 
      caption = "Practice Recipe for Analysis") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Practice Recipe for Analysis
recipe instructions
Mediterranean Herb-Crusted Cod Season fresh cod fillets with sea salt and black pepper. Create herb crust by combining breadcrumbs, fresh parsley, oregano, and minced garlic. Press mixture onto fish. Drizzle with extra virgin olive oil. Bake at 400°F for 15-20 minutes until fish is flaky and golden. Serve with lemon wedges and roasted vegetables.
Your Analysis Tasks
  1. Tokenize the recipe instructions
  2. Identify cooking techniques used
  3. Calculate the sentiment score
  4. Extract temperature and time information using regex
  5. Compare ingredient profile to our existing recipes

5.2.2 Learning by Doing

Try copying the code chunks from earlier sections and modifying them for this new recipe. Change the dataset name and see what happens. This is how you build coding confidence!

5.3 Key Concepts Summary

5.3.1 Essential Text Analysis Concepts

  • Tokenization: Breaking text into individual words or phrases
  • Stop Words: Common words removed to focus on meaningful content
  • TF-IDF: Identifies words that are uniquely important to specific documents
  • Sentiment Analysis: Measures emotional tone or attitude in text
  • N-grams: Analysis of word sequences (bigrams, trigrams, etc.)
  • Regular Expressions: Pattern matching for extracting specific information

6 Conclusion

Text analysis in R provides powerful tools for extracting insights from unstructured data. Through our recipe analysis, we’ve demonstrated how to:

What You’ve Accomplished
  • Clean and preprocess raw text data
  • Apply fundamental NLP techniques
  • Visualize patterns and relationships
  • Uncover hidden connections in data

These skills are directly applicable to analyzing any type of text data, from customer feedback to research documents to social media content.

6.0.1 Next Steps in Your Text Analysis Journey

To continue developing your text analysis skills:

  1. Practice with different types of text data
  2. Explore additional sentiment lexicons and methods
  3. Learn topic modeling techniques for larger datasets
  4. Investigate advanced NLP packages like quanteda and spacyr
  5. Apply these techniques to your own work projects

Remember: The best way to learn text analysis is by doing it. Start with small projects and gradually tackle more complex challenges!


6.1 Additional Resources

6.1.1 Helpful R Packages for Text Analysis

advanced_packages <- data.frame(
  Package = c("quanteda", "spacyr", "tm", "topicmodels", "textdata"),
  Purpose = c("Comprehensive text analysis framework", 
             "spaCy integration for advanced NLP", 
             "Text mining framework",
             "Topic modeling algorithms",
             "Access to text analysis datasets"),
  Difficulty = c("Advanced", "Advanced", "Intermediate", "Advanced", "Beginner")
)

kable(advanced_packages, 
      caption = "Additional R Packages for Text Analysis") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Additional R Packages for Text Analysis
Package Purpose Difficulty
quanteda Comprehensive text analysis framework Advanced
spacyr spaCy integration for advanced NLP Advanced
tm Text mining framework Intermediate
topicmodels Topic modeling algorithms Advanced
textdata Access to text analysis datasets Beginner

6.1.2 Further Reading

  • Silge, J., & Robinson, D. (2017). Text Mining with R: A Tidy Approach. O’Reilly Media.
  • Welbers, K., Van Atteveldt, W., & Benoit, K. (2017). Text analysis in R. Communication Methods and Measures, 11(4), 245-265.
  • Hvitfeldt, E., & Silge, J. (2021). Supervised Machine Learning for Text Analysis in R. CRC Press.

Course materials developed by Eric Kvale with the Minnesota Department of Health Office of Data Strategy and Interoperability Data Technical Assistance Unit (DSI DTA) with support from the Minnesota State Government R Users Group.