Skip to the content.

The Secret Sauce of Five-Star Recipes

Authors: Becca Fu(beccafuu@umich.edu), Yuchen(Wayne) Yang(siriusyc@umich.edu)

Introduction

General Introduction

The central question we are interested in is: How average rating could be determined by key attributes of a recipe?

Dataset Details Introduction

Recipes

Interactions

Data Cleaning and Exploratory Data Analysis

Part I: Data Cleaning

  1. We start with two individual datasets, one for recipes, the other for ratings. To proceed, we need to left merge the recipes and interactions datasets together.
    • We choose left merge because we want to ensure that all the recipes are retained in the merged dataset, even if some recipes have no ratings or interactions.
  2. Once we get the combined dataset, we need to deal with 0-star ratings.
    • In this case, we choose to treat all ratings of 0 as missing values first.
    • 0 is likely indicate that no ratings is given, so it currently doesn’t provide meaningful information about user preferences.
  3. Before proceeding to any column operations, we want to first deal with missing values:
    • we start by examing which columns contains missing values.
    • then we fill missing values in columns relevant to our analysis, including
      • nutritions
      • rating
      • minutes
      • n_steps
  4. Among all columns relevant to our analysis, only rating contains missing value.
    • We decided to use listwise imputation and probabilistic imputation, Imputation section for justification.
      • listwise imputation: we drop all recipe that does not receive any rating.
      • probabilistic imputation, we fill missing value in rating by drawing random sample from all ratings received by that recipe.
  5. We will measure a recipe’s popularity by its average rating, so we created a new column that records average rating per recipe.

  6. Previously, we notice that calories are stored in the nutrition column.
    • nutrition column contains values that looks like lists, but they are actually strings.
      • [calories (#), total fat (PDV), sugar (PDV), sodium (PDV), protein (PDV), saturated fat (PDV), carbohydrates (PDV)]
    • so we need to extract these information from the list-like string.
    • after that, we create columns that record them per receipe.
  7. We define a new categorical column named ‘Time Involved’
    • we categorize recipes into three different categories based on their preparation time
      • less than 30 minutes
      • 4 hours or less
      • 1-day-or-more
  8. from now on, we will only focus on relevant columns, including -average_rating
    • minutes
    • time involved
    • calories
    • n_steps
    • total fat(PDV)
    • sugar(PDV)
    • sodium(PDV)
    • protein(PDV)
    • saturated fat(PDV)
    • carbohydrates(PDV)

Below is the head of our league_clean dataframe.

  id minutes calories n_steps total fat(PDV) sugar(PDV) sodium(PDV) protein(PDV) saturated fat(PDV) carbohydrates(PDV) average_rating time involved
0 275022 50 386.1 11 34 7 24 41 62 8 3 0.5 hour to 4 hours
1 275024 55 377.1 6 18 208 13 13 30 20 3 0.5 hour to 4 hours
2 275026 45 326.6 7 30 12 27 37 51 5 3 0.5 hour to 4 hours
3 275030 45 577.7 11 53 149 19 14 67 21 5 0.5 hour to 4 hours
4 275032 25 386.9 8 0 347 0 1 0 33 5 less than 30 minutes

Part II: Univariate Analysis

1. Difficulty Level

  1. We are interested in difficulty level of recipes. Specifically, we wonder if recipes posted are complex in general.
  2. We think two features can reflect the difficulty level, time and n_steps.
    • Intuitively, this makes sense because
      • the more complex a recipe is, the longer time it takes to prepare.
      • the more complex a recipe is, the more steps it involves.
  3. To get some insights, we decided to draw two histograms to get some insights.
    • Histograms of Time
    • Histograms of n_steps
Preparation Time

we start with summary statistics, and note this a good way to distinguish between recipe’s preparation time:

Number of Steps

we continue with analyzing the distribution of steps taken.

2. Energy Provided: Calories

  1. We’re also interested in calories of recipes. Specifically, we wonder if
    • there are more recipes associated with high calories, or
    • most recipes tend to have low calories.

Part III: Bivariate Analysis

Complexity

Energy Level

Recipe Ingredients

Part IV: Interesting Aggregates

Time Involved Average Rating Calories Total Fat (PDV) Sugar (PDV) Sodium (PDV) Carbohydrates (PDV) Protein (PDV)
Less than 30 minutes 4.64466 345.862 26.4346 57.0829 24.8203 11.0133 25.2582
0.5 hour to 4 hours 4.61538 491.321 37.2619 76.4203 30.8117 15.9859 37.9524
1-day-or-more 4.55493 513.034 37.1742 77.761 46.7199 14.4785 53.9995

Findings:

  1. recipes that take “less than 30 minutes”
    • have the highest average rating of 4.64.
    • have the lowest
      • average calories
      • average protein (PDV)
      • average sugar(PDV)
  2. receips that takes “1 day or more” have the highest average calories

Conclusion:

  1. Preference for Convenience:
    • The high rating for short-preparation recipes suggests that many people value simplicity and speed.
    • This is particularly relevant for recipe developers and food bloggers focusing on gaining popularity; offering easy and quick recipes could attract more users.
  2. Trade-off Between Nutrition and Time:
    • Recipes that require more time tend to have richer nutritional content. This suggests a trade-off where users must balance convenience with the desire for more nutrient-dense meals.
    • Consumers who have more time to cook tend to opt for more complex and nutritionally dense dishes

Part V: Imputation Strategy

Framing a Prediction Problem

Baseline Model

Model Description and Evaluation

1. Model Description

2. Features Summary

3. Data Preparation

4. Model Performance

5. Evaluation of Model Quality

The current model is not good for the following reasons:

Final Model

1. Predictor Re-selection

2. Modeling Algorithm and Hyperparameter Selection

3. Performance Comparison

Improvement in Performance

  1. Reduction in MSE: Improvement = 0.41092464835782266 - 0.4103397379674629 = 0.0005849103903597768
    • The final model has reduced the average squared error by approximately 0.0006.
  2. Percentage Improvement:
    • Percentage Improvement = 0.0005849103903597768 / 0.41092464835782266 * 100 ≈ 0.14
    • This indicates that the final model has improved the baseline performance by around 0.14%.