FoodHub Data Analysis¶

Context¶

The number of restaurants in New York is increasing day by day. Lots of students and busy professionals rely on those restaurants due to their hectic lifestyles. Online food delivery service is a great option for them. It provides them with good food from their favorite restaurants. A food aggregator company FoodHub offers access to multiple restaurants through a single smartphone app.

The app allows the restaurants to receive a direct online order from a customer. The app assigns a delivery person from the company to pick up the order after it is confirmed by the restaurant. The delivery person then uses the map to reach the restaurant and waits for the food package. Once the food package is handed over to the delivery person, he/she confirms the pick-up in the app and travels to the customer's location to deliver the food. The delivery person confirms the drop-off in the app after delivering the food package to the customer. The customer can rate the order in the app. The food aggregator earns money by collecting a fixed margin of the delivery order from the restaurants.

Objective¶

The food aggregator company has stored the data of the different orders made by the registered customers in their online portal. They want to analyze the data to get a fair idea about the demand of different restaurants which will help them in enhancing their customer experience. As a Data Scientist in this company there are some key questions that need to be answered. I have to perform a data analysis to find answers to these questions that will help the company to improve the business.

Data Description¶

The data contains the different data related to a food order. The detailed data dictionary is given below.

Data Dictionary¶

  • order_id: Unique ID of the order
  • customer_id: ID of the customer who ordered the food
  • restaurant_name: Name of the restaurant
  • cuisine_type: Cuisine ordered by the customer
  • cost_of_the_order: Cost of the order
  • day_of_the_week: Indicates whether the order is placed on a weekday or weekend (The weekday is from Monday to Friday and the weekend is Saturday and Sunday)
  • rating: Rating given by the customer out of 5
  • food_preparation_time: Time (in minutes) taken by the restaurant to prepare the food. This is calculated by taking the difference between the timestamps of the restaurant's order confirmation and the delivery person's pick-up confirmation.
  • delivery_time: Time (in minutes) taken by the delivery person to deliver the food package. This is calculated by taking the difference between the timestamps of the delivery person's pick-up confirmation and drop-off information

Let us start by importing the required libraries¶

In [ ]:
# Installing the libraries with the specified version.
# !pip install numpy==1.25.2 pandas==1.5.3 matplotlib==3.7.1 seaborn==0.13.1 -q --user

Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.

In [3]:
# import libraries for data manipulation
import numpy as np
import pandas as pd

# import libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

Understanding the structure of the data¶

In [1]:
# uncomment and run the following lines for Google Colab
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
In [4]:
# Write your code here to read the data
# Reading FoodHub Master Data
data = pd.read_csv('/content/drive/MyDrive/content/foodhub_order.csv')
In [ ]:
# Write your code here to view the first 5 rows
# Reading 5 top rows
data.head()
Out[ ]:
order_id customer_id restaurant_name cuisine_type cost_of_the_order day_of_the_week rating food_preparation_time delivery_time
0 1477147 337525 Hangawi Korean 30.75 Weekend Not given 25 20
1 1477685 358141 Blue Ribbon Sushi Izakaya Japanese 12.08 Weekend Not given 25 23
2 1477070 66393 Cafe Habana Mexican 12.23 Weekday 5 23 28
3 1477334 106968 Blue Ribbon Fried Chicken American 29.20 Weekend 3 25 15
4 1478249 76942 Dirty Bird to Go American 11.59 Weekday 4 25 24

Question 1: How many rows and columns are present in the data?¶

In [ ]:
# Write your code here
print('Number of rows in the data:', data.shape[0])
print('Number of columns in the data:', data.shape[1])
Number of rows in the data: 1898
Number of columns in the data: 9

Observations:¶

  • There are 1898 rows in the dataset and 9 columns

Question 2: What are the datatypes of the different columns in the dataset?¶

In [ ]:
# Write your code here
data.info();
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1898 entries, 0 to 1897
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   order_id               1898 non-null   int64  
 1   customer_id            1898 non-null   int64  
 2   restaurant_name        1898 non-null   object 
 3   cuisine_type           1898 non-null   object 
 4   cost_of_the_order      1898 non-null   float64
 5   day_of_the_week        1898 non-null   object 
 6   rating                 1898 non-null   object 
 7   food_preparation_time  1898 non-null   int64  
 8   delivery_time          1898 non-null   int64  
dtypes: float64(1), int64(4), object(4)
memory usage: 133.6+ KB

Observations:¶

  • There are 5 numerical columns in the data and 4 object type columns

Question 3: Are there any missing values in the data? If yes, treat them using an appropriate method.¶

Checking for missing values:

In [ ]:
# Write your code here
data.isnull().sum()
Out[ ]:
0
order_id 0
customer_id 0
restaurant_name 0
cuisine_type 0
cost_of_the_order 0
day_of_the_week 0
rating 0
food_preparation_time 0
delivery_time 0

Using this space to convert columns to more related type for them

In [ ]:
data[['customer_id','order_id']]=data[['customer_id','order_id']].astype(object)

Observations:¶

  • There are no missing values in the dataset

Question 4: Check the statistical summary of the data. What is the minimum, average, and maximum time it takes for food to be prepared once an order is placed? [2 marks]¶

In [ ]:
# Write your code here
data.describe()
Out[ ]:
order_id customer_id cost_of_the_order food_preparation_time delivery_time
count 1.898000e+03 1898.000000 1898.000000 1898.000000 1898.000000
mean 1.477496e+06 171168.478398 16.498851 27.371970 24.161749
std 5.480497e+02 113698.139743 7.483812 4.632481 4.972637
min 1.476547e+06 1311.000000 4.470000 20.000000 15.000000
25% 1.477021e+06 77787.750000 12.080000 23.000000 20.000000
50% 1.477496e+06 128600.000000 14.140000 27.000000 25.000000
75% 1.477970e+06 270525.000000 22.297500 31.000000 28.000000
max 1.478444e+06 405334.000000 35.410000 35.000000 33.000000

Observations:¶

  • Preparation time: On average, it takes 27 minutes and 37 seconds to prepare an order with the time ranging between 20 minutes as minimum and goes up to 35 minutes as maximum time.

Question 5: How many orders are not rated?¶

In [ ]:
# Write the code here
data.groupby('rating')['rating'].count()
Out[ ]:
rating
rating
3 188
4 386
5 588
Not given 736

Observations:¶

  • There are 736 orders where clients did not provide any rate
  • There are 1162 orders with rate given between 3 and 5
  • There are no orders with rates between 1 and 2

Exploratory Data Analysis (EDA)¶

Univariate Analysis¶

Question 6: Explore all the variables and provide observations on their distributions. (Generally, histograms, boxplots, countplots, etc. are used for univariate exploration.) [9 marks]¶

  1. Cuisine Type
In [ ]:
chr = sns.countplot(data = data, x = 'cuisine_type', stat='percent',
                    legend=True, order=data['cuisine_type'].value_counts().index);
plt.xlabel('Type of food');
plt.ylabel('Total orders(%)');
plt.xticks(rotation=90);
plt.title('Prefered food by Customers');
chr.bar_label(chr.containers[0], fmt='%.0f%%');
plt.show()
  1. Cost of the order
In [ ]:
# Write the code here
sns.histplot(data = data, x = 'cost_of_the_order', kde=True);
plt.xlabel('Cost of the order');
plt.ylabel('Count');
plt.title('Distribution of Cost per order');
plt.show()
In [ ]:
sns.boxplot(data = data, x = 'cost_of_the_order');
plt.xlabel('Cost of the order');
plt.ylabel('Count');
plt.title('Distribution of Cost per order');
plt.show()
  1. Day of the week
In [ ]:
chr = sns.histplot(data = data, x = 'day_of_the_week', stat='percent', legend=True);
plt.xlabel('Day of the week');
plt.ylabel('Orders by day');
plt.title('Distribution of Order per day');
chr.bar_label(chr.containers[0], fmt='%.1f%%');
plt.show()
  1. Rating
In [ ]:
chr = sns.histplot(data = data, x = 'rating', stat='percent', legend=True);
plt.xlabel('Rating');
plt.ylabel('Value distribution');
plt.title('Rating by orders');
chr.bar_label(chr.containers[0], fmt='%.1f%%');
plt.show()

Changing Rating to float and "Not Given" values to NaN:

In [ ]:
#Replace Not given values' with nan
data['rating'] = data['rating'].replace(['Not given'],np.nan)
#Convert the data type to a float
data['rating'] = data['rating'].astype(float)
  1. Food preparation time
In [ ]:
chr = sns.histplot(data = data, x = 'food_preparation_time', legend=True, kde=True);
plt.xlabel('Preparation time');
plt.ylabel('Total Orders');
plt.show()
  1. Delivery time
In [ ]:
chr = sns.histplot(data = data, x = 'delivery_time', legend=True, kde=True);
plt.xlabel('Delivery time');
plt.ylabel('Total Orders');
plt.show()

Observations

  • American and Japanese food consolidate 56% of the total orders
  • Distribution of cost is skewed to the right
  • There are no outliers in the dataset
  • The average cost of each order is around 14 USD. This indicates that customers prefer food in the lower range of the cost
  • The bast majority of orders are requested on weekends representing 71.2% of the total
  • 38.8% of the total orders did not receive any rating. This is a high number so the column might be eliminated
  • Preparation time is distributed uniformly with some randomly orders taken more time.
  • Delivery time is partially skewed to the left but still shows a mostly uniform distribution with orders ranging between 25 and 30 minutes for delivery

Question 7: Which are the top 5 restaurants in terms of the number of orders received?¶

In [ ]:
# Write the code here
data['restaurant_name'].value_counts().nlargest(5)
Out[ ]:
restaurant_name
Shake Shack                  219
The Meatball Shop            132
Blue Ribbon Sushi            119
Blue Ribbon Fried Chicken     96
Parm                          68
Name: count, dtype: int64

Observations:¶

Top 5 restaurantes by orders received are:

  1. Shake Shack (219)
  2. The Meatball Shop (132)
  3. Blue Ribbon Sushi (119)
  4. Blue Ribbon Fried Chicken (96)
  5. Parm (68)

Question 8: Which is the most popular cuisine on weekends?¶

In [ ]:
# Write the code here
data.groupby('day_of_the_week')['cuisine_type'].value_counts().nlargest(1)
Out[ ]:
day_of_the_week  cuisine_type
Weekend          American        415
Name: count, dtype: int64

Observations:¶

The most popular food delivered on weekends is American Food

Question 9: What percentage of the orders cost more than 20 dollars? [2 marks]¶

In [ ]:
# Write the code here
df_v = [data[data['cost_of_the_order']>20].shape[0]/data.shape[0]*100,
      data[data['cost_of_the_order']<21].shape[0]/data.shape[0]*100]
df_l = ['More than 20 dollars', 'Less than 20 dollars']
# Creating chart
plt.figure(figsize=(6,6))
sns.set_style("whitegrid")
sns.set_palette("pastel")
plt.pie(df_v, labels=df_l, autopct='%.0f%%')
# Add a title
plt.title("Percentage of orders that cost more than $20")

# Display the plot
plt.show()

Observations:¶

  • Percentage or orders that cost more than 20 dollars is 29%

Question 10: What is the mean order delivery time?¶

In [ ]:
# Write the code here
round(data['delivery_time'].mean(), 1)
Out[ ]:
24.2

Observations:¶

  • The average delivery time for orders is 24.2 minutes

Question 11: The company has decided to give 20% discount vouchers to the top 3 most frequent customers. Find the IDs of these customers and the number of orders they placed.¶

In [ ]:
# Write the code here
data['customer_id'].value_counts().nlargest(3)
Out[ ]:
customer_id
52832    13
47440    10
83287     9
Name: count, dtype: int64

Observations:¶

Top 3 customers by orders submitted are:

  1. 52832 (13)
  2. 47440 (10)
  3. 83287 (9)

Multivariate Analysis¶

Question 12: Perform a multivariate analysis to explore relationships between the important variables in the dataset. (It is a good idea to explore relations between numerical variables as well as relations between numerical and categorical variables) [10 marks]¶

Change data types for the following questions in required columns

In [ ]:
#Convert the data type to a float
data['order_id'] = data['order_id'].astype(str)
data['customer_id'] = data['customer_id'].astype(str)
In [ ]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1898 entries, 0 to 1897
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   order_id               1898 non-null   object 
 1   customer_id            1898 non-null   object 
 2   restaurant_name        1898 non-null   object 
 3   cuisine_type           1898 non-null   object 
 4   cost_of_the_order      1898 non-null   float64
 5   day_of_the_week        1898 non-null   object 
 6   rating                 1162 non-null   float64
 7   food_preparation_time  1898 non-null   int64  
 8   delivery_time          1898 non-null   int64  
 9   total_delivery_time    1898 non-null   int64  
dtypes: float64(2), int64(3), object(5)
memory usage: 148.4+ KB
In [ ]:
# Write the code here
plt.figure(figsize=(7,3))
sns.heatmap(data=data[['cost_of_the_order', 'rating', 'food_preparation_time', 'delivery_time']].corr(), annot=True, cmap='Spectral', vmin=-1, vmax=1)
plt.show()
In [ ]:
sns.boxplot(data = data, x = 'rating', y = 'cost_of_the_order');
plt.xlabel('Rating');
plt.ylabel('Cost of the order');
plt.show()
In [5]:
sns.boxplot(data = data, x = 'cuisine_type', y = 'cost_of_the_order', hue='cuisine_type');
plt.xlabel('Cuisine Type');
plt.ylabel('Cost of the order');
plt.xticks(rotation=90);
plt.show()
In [7]:
sns.boxplot(data = data, x = 'day_of_the_week', y = 'food_preparation_time', hue='day_of_the_week');
plt.xlabel('Day of the week');
plt.ylabel('Preparation time');
plt.show()

Observations:¶

  • There is no a strong correlation between variables to determine why users are selecting some specific restaurants.
  • Notice how the rating is not affected by the cost of the order in the boxplot
  • The analysis of cuisine type vs cost also reflects that cost is not the driver for people to select American food over Koren being the last one more affordable on average.
  • Preparation time remains the same during weekdays and weekends

Question 13: The company wants to provide a promotional offer in the advertisement of the restaurants. The condition to get the offer is that the restaurants must have a rating count of more than 50 and the average rating should be greater than 4. Find the restaurants fulfilling the criteria to get the promotional offer. [3 marks]¶

Creating a separate Dataframe to get restaurantes with rating mean and count

In [ ]:
# Write the code here
df = data.groupby('restaurant_name')['rating'].agg(['mean','count'])
Out[ ]:
mean count
restaurant_name
'wichcraft 5.000000 1
12 Chairs 4.500000 2
5 Napkin Burger 4.000000 2
67 Burger 5.000000 1
Alidoro NaN 0
... ... ...
Zero Otto Nove 4.000000 1
brgr 3.000000 1
da Umberto 5.000000 1
ilili Restaurant 4.153846 13
indikitch 4.500000 2

178 rows × 2 columns

Filtering now restaurantes according to the conditions defined in the question

In [ ]:
df = df[(df['count']>50) & (df['mean']>4)]
df
Out[ ]:
mean count
restaurant_name
Blue Ribbon Fried Chicken 4.328125 64
Blue Ribbon Sushi 4.219178 73
Shake Shack 4.278195 133
The Meatball Shop 4.511905 84

Observations:¶

The restaurants that will get the promotional advertisement are:

  1. Blue Ribbon Fried Chicken
  2. Blue Ribbon Sushi
  3. Shake Shack
  4. The Meatball Shop

Question 14: The company charges the restaurant 25% on the orders having cost greater than 20 dollars and 15% on the orders having cost greater than 5 dollars. Find the net revenue generated by the company across all orders. [3 marks]¶

Creating 2 separete datasets to get the revenue by category

In [ ]:
fd_20 = data[data['cost_of_the_order']>20]
fd_5 = data[(data['cost_of_the_order']>5) & (data['cost_of_the_order']<=20)]

Now let's compute the revenue by category

In [ ]:
fd_20_rv = fd_20['cost_of_the_order'].sum()*0.25
fd_5_rv = fd_5['cost_of_the_order'].sum()*0.15

Finally combining the 2 entries and pinting

In [ ]:
print(fd_20_rv + fd_5_rv)
6166.303

Observations:¶

  • The total revenue generated by the company following the criteria defined is 6166.303 USD

Question 15: The company wants to analyze the total time required to deliver the food. What percentage of orders take more than 60 minutes to get delivered from the time the order is placed? (The food has to be prepared and then delivered.) [2 marks]¶

We first need to add a new column adding up both preparation time and delivery time. New column name is: total_delivery_time

In [ ]:
data['total_delivery_time'] = data['food_preparation_time'] + data['delivery_time']

Now we can work with the new column to filter values based on the criteria defined in the question and calculate the percentage of orders taken more than 1 hour

In [ ]:
round(data[data['total_delivery_time'] > 60].shape[0]/data.shape[0]*100, 2)
Out[ ]:
10.54

Observations:¶

  • Around 10.54% of the total orders took more than 1 hour to be delivered to the client from the moment they placed the order.

Question 16: The company wants to analyze the delivery time of the orders on weekdays and weekends. How does the mean delivery time vary during weekdays and weekends? [2 marks]¶

In [ ]:
# Write the code here
data.groupby('day_of_the_week')['delivery_time'].mean().round(2)
Out[ ]:
day_of_the_week
Weekday    28.34
Weekend    22.47
Name: delivery_time, dtype: float64
In [ ]:
plt.figure(figsize=(6,6))
sns.boxplot(data = data, x = 'day_of_the_week', y = 'delivery_time');
plt.xlabel('Day of the week');
plt.ylabel('Delivery time');
plt.title('Delivery time by day of the week');
plt.show()

Observations:¶

  • During weekends the delivery time decreases and shows less max time. This might suggest that on weekends there are more people delivering food to customers making the waiting time shorter
  • On Averange, clients need to wait 6 more minutes to get their orders delivered.

Conclusion and Recommendations¶

Question 17: What are your conclusions from the analysis? What recommendations would you like to share to help improve the business? (You can use cuisine type and feedback ratings to drive your business recommendations.) [6 marks]¶

Conclusions:¶

  • We can see a tendency for customers to get more American and Japanese and Italian orders.
  • Orders during weekends represent 71% of the total received by restaurants.
  • 10% of all orders are taking longer than expected for users going above 60 minutes of waiting time.
  • There is no clear relationship between price and total orders received by restaurants.
  • 38% of all orders delivered, do not get any feedback.
  • The vast majority of orders (71%) cost below $20 which shows a tendency from the users to find food that is more budget friendly.

Recommendations:¶

  • The company needs to find ways to get more people to provide feedback about the orders by giving away more coupons or free deliveries for those that rate orders. Then more accurate data will be handled to provide to the restaurants.
  • Improving total delivery time during weekdays might come as an opportunity for less popular restaurants to increase sales compared to the dominant preference of American, Japanese and Italian restaurants.
  • Business should put in place promotions and other type of compensation for orders taking more than 1 hour and find the reasons behind such a dramatic time of waiting for consumers.
  • The company should add more measures to the evaluation so correlation could be determined to find what moves the customers to select a particular type of food/restaurant.
  • There's seems to be low fidelity from customers since the top 3 represent a total of 32 orders which opens space for engaging campaigns to keep clients ordering more often.
  • Since 71% of the total orders are less than 20 USD. I would suggest the company to review the revenue scheme to increase earnings. Getting a little bit more thatn $1600 for managing more than 1800 orders dot not seem to be generating enough value.