Introduction

Problem Statement

This is a regression problem statement that aims to predict the time it takes (in minutes) for a food delivery given input feature about the rider, vehicle they own, weather conditions, traffic conditions, location of the restaurant and delivery, etc.

Data

The food delivery data is available on Kaggle. Table 1 shows the entire description of the data.

Column	Description
`ID`	Unique identifier for each delivery record.
`Delivery_person_ID`	Unique identifier for the delivery person.
`Delivery_person_Age`	Age of the delivery person.
`Delivery_person_Ratings`	Average customer ratings received by the delivery person.
`Restaurant_latitude`	Latitude coordinate of the restaurant.
`Restaurant_longitude`	Longitude coordinate of the restaurant.
`Delivery_location_latitude`	Latitude coordinate of the delivery location.
`Delivery_location_longitude`	Longitude coordinate of the delivery location.
`Order_Date`	Date on which the order was placed.
`Time_Orderd`	Time at which the order was placed.
`Time_Order_picked`	Time at which the order was picked up by the delivery person.
`Weatherconditions`	Weather conditions during the delivery. `conditions Fog`, `conditions Stormy`, `conditions Cloudy`, `conditions Sandstorms`, `conditions Windy`, `conditions Sunny`
`Road_traffic_density`	Traffic condition on the delivery route. `Low`, `Medium`, `High`, `Jam`
`Vehicle_condition`	Condition of the delivery person's vehicle. `0` = Best `1` = Good `2` = Fair `3` = Worst
`Type_of_order`	Type of food item ordered. `Snack`, `Meal`, `Drinks`, `Buffet`
`Type_of_vehicle`	Type of vehicle used by the delivery person. `bicycle`, `scooter`, `motorcycle`, `electric_scooter`
`multiple_deliveries`	Number of extra deliveries handled in the same trip. `0`, `1`, `2`, `3`
`Festival`	Whether the order was placed during a festival. `Yes` = Festival day `No` = Non-festival day
`City`	Classification of the city where the order was placed. `Urban`, `Semi-Urban`, `Metropolitian`
`Time_taken(min)`	Total time taken for the delivery in minutes.

Table 1: Data description.

As we can see, the data has roughly the following types of input features:

Related to the ride and rider:
- Delivery_person_Age
- Delivery_person_Ratings
- Vehicle_condition
- multiple_deliveries
- City
Related to events:
- Weatherconditions
- Road_traffic_density
- Festival
Related to location:
- Restaurant_latitude
- Restaurant_longitude
- Delivery_location_latitude
- Delivery_location_longitude
Related to the order:
- Order_Date
- Time_Order_picked
- Type_of_order

Business Use Case

Today, many companies, like Swiggy, DoorDash, Uber Eats, Zomato, etc., provide online delivery of food items from various restaurants and kitchens. In this competitive market, on-time delivery is crucial for customer satisfaction, retention, and operational efficiency. All of these directly affect the top line of the company. Hence, companies want to optimize delivery time predictions to improve customer experience by providing accurate estimated time of arrival (ETA) of the deliveries and to manage resources efficiently. Accurate prediction of ETA can also allow businesses to:

Improve Delivery Efficiency: Identifying factors that slow down deliveries enables better resource allocation, such as more reliable scheduling for delivery personnel.
Enhance Customer Satisfaction: Reliable ETAs can improve the experience of the customer by reducing uncertainty of the waiting time. This will also reduce the chances of delivery cancellation.
Optimize Operational Costs: If the model can predict scenarios with longer delays, additional resources like more riders or prioritizing some orders can be allocated.

How Machine Learning Helps?

Enhanced Customer Experience

Customer Satisfaction and Retention: Delivering on promised ETAs is critical to customer satisfaction. When customers know exactly when their order will arrive, they can plan accordingly, reducing their anxiety over wait times. Reliable ETAs mean customers aren’t left guessing or feeling frustrated, which increases trust and brand loyalty. A happy customer is more likely to become a repeat customer, directly impacting revenue growth through increased retention.
Improved Transparency: Clear, accurate ETAs improve transparency, a factor increasingly valued by customers. If a delay is predicted due to conditions like traffic or weather, proactive updates to ETAs reassure customers that they are informed. This builds trust and creates a positive brand image, especially important in a crowded food delivery market.
Reduced Customer Support Load: Accurate predictions minimize delays and the subsequent customer service calls and complaints that typically follow. Fewer service interactions reduce operational costs associated with customer support, allowing the business to reinvest these resources into growth initiatives or new customer service features, like a rewards program or app improvements.

Operational Efficiency for Delivery Management

Better Resource Utilization: With predictive insights on delivery times, the dispatch team can plan routes and schedules more effectively, avoiding high-traffic times or preparing for weather disruptions. For instance, if peak traffic density is predicted, drivers could be assigned specific zones, ensuring they can make more deliveries in a shorter timeframe. This reduces idle times, maximizes the time each driver spends on productive tasks, and makes the fleet more efficient overall.
Dynamic Allocation: When anticipated delays are flagged in advance, dispatch teams can make real-time decisions, such as reassigning drivers, adjusting routes, or rescheduling specific deliveries to ensure on-time arrivals for high-priority orders. This adaptability in resource management ensures smoother operations and helps the business consistently meet or exceed delivery expectations, even during busy or challenging periods.
Operational Scalability: As the business grows, knowing the expected delivery times allows for planning more delivery zones or micro-fulfillment centers in strategic locations. This can help scale operations smoothly, ensuring the business meets increased demand without overburdening resources.

Financial Optimization

Cost Control through Efficiency: Accurate delivery time predictions enable better route planning, which reduces fuel costs, overtime pay, and wear on vehicles. This kind of operational efficiency keeps delivery costs manageable and reduces the cost-per-delivery, which has a direct impact on profitability.
Minimized Compensation Costs: Predicting delays allows the business to manage customer expectations in advance, avoiding the need for compensations (e.g., refunds, discounts on next orders) due to late deliveries. Lowering such costs contributes to higher profit margins.
Increased Customer Lifetime Value (CLV): A happy customer who receives reliable ETAs is more likely to order again and again. Over time, this leads to increased customer lifetime value, as each repeat order adds to revenue without the acquisition cost associated with new customers.

Strategic Decision-Making

Peak and Off-Peak Planning: The model can identify patterns related to demand peaks, like specific times, holidays, or weather conditions that impact delivery time. This information allows the business to prepare, such as increasing driver availability during expected peaks and adjusting promotions or delivery charges in response to anticipated conditions. Properly managing high-demand periods ensures smooth service and maintains customer satisfaction during peak times.
Operational Flexibility: When data highlights challenging delivery periods, such as during festivals or high-traffic hours, businesses can implement strategies like surge pricing or limit orders to high-capacity areas only. This flexibility helps avoid bottlenecks, ensuring that resources are stretched only to manageable levels.
Data-Driven Expansion Plans: Insights from delivery time predictions guide strategic planning for geographic expansion or new fulfillment centers. If certain locations consistently show high delivery times, it might signal a need for a local hub to streamline deliveries, improving service without straining existing resources.

Summary

Overall

Increase customer satisfaction as customers can plan using the ETA of their orders.
Increases customer trust in company. Clear, accurate ETAs improve transparency, a factor increasingly valued by customers. If a delay is predicted due to conditions like traffic or weather, proactive updates to ETAs reassure customers that they are informed.
Accurate time predictions reduces chances of cancelled orders.
Increased transparency can help in lower customer service calls which eases up traffic of complaints those are time related.
The dispatch team for riders can plan routes and manpower accordingly to serve customers on time.
They can focus on hotspots in the city which have increased orders at certain time of day, month, year.
Can help company implement surge pricing in extreme weather or congestion events.

Riders

Riders can plan pickups and deliveries accordingly.
They have a foresight of time taken for delivery so can manage multiple orders along the same route.
Can help in route planning in case of traffic congestions.
Can do faster deliveries and limit wait times to increase number of deliveries per day which increases their earning potential.
Drivers do not have to rush or do risky driving during high rush hours as their delivery times are in synchrony with the on ground situation which gives them peace of mind and reduces the chances of unnecessary cancellations and do not impact their ratings.
Can opt for other providers when demand is less to increase their earnings.
Can tackle multiple deliveries.

Restaurants

They can prioritize their orders if delivery times are available.
They can manage staff to balance out between in house orders vs home deliveries.
They can prioritize their orders if delivery times are available.
They can manage staff to balance out between in house orders vs. home deliveries.
They can scale up staff and resources during events of increased demands.
Company can also leverage discounts and coupons to increase demand during off peak hours which results in continuous revenue generation.

What Metrics to Use?

There are two metrics well suited for this problem: root mean squared error (RMSE) and mean absolute error (MAE). These are the most appropriate metrics for most regression problems. An added advantage of using these for our problem is that since our stakeholders are the customers, riders, and the restaurants, we want the unit of the metrics to be the same as the output (which is time in minutes). These two metrics satisfy this property.

Say there are a total of $n$ observations in the training data. Further, say the ground truths of these observations are denoted by $y_1$, $y_2$, …, $y_n$; whereas their predicted counterparts are denoted by $\hat{y}_1$, $\hat{y}_2$, …, $\hat{y}_n$. Then, RMSE is given by

\[\begin{equation}\label{eq:rmse} \text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} \left(y_i - \hat{y}_i\right)^2}, \end{equation}\]

and MAE is given by

\[\begin{equation}\label{eq:mae} \text{MAE} = \frac{1}{n} \sum_{i=1}^{n} \left\lvert y_i - \hat{y}_i \right\rvert. \end{equation}\]

Now the issue with our data is that due to certain events like extreme weather, holidays, traffic conditions, etc., the delivery time increases significantly and the points become outliers. Hence we want to use a metric that is robust to outliers. As RMSE squares the difference between the ground truth and the corresponding predictions, it penalizes the outliers, making the model give incorrect predictions even for non-outlier points. Hence, we will stick to MAE, which is more robust to outliers as it does not square the difference between the ground truth and the corresponding predictions.

Improvements in Business

Let us now see how creating this model will improve business metrics.

Customer Satisfaction Score (CSAT)

Impact: With accurate ETAs, customers feel more informed and are less likely to experience frustration from delays or inaccurate delivery windows.
Measurement: Customer satisfaction surveys or ratings post-delivery. A rise in CSAT often indicates positive customer experiences.

Customer Retention Rate

Impact: Satisfied customers are more likely to reorder, especially if the delivery experience meets or exceeds expectations consistently.
Measurement: Percentage of customers who continue ordering after their initial experiences. Improved ETAs help build customer trust and encourage repeat business.

Average Order Value (AOV)

Impact: When customers trust the delivery process, they may place larger or more frequent orders, especially during promotions or peak times.
Measurement: Monitoring trends in order value over time can indicate that enhanced delivery service is encouraging higher spending.

Delivery Success Rate

Impact: Reduced late deliveries and minimized cancellation rates can increase the proportion of successful deliveries.
Measurement: Percentage of orders successfully delivered within the promised time frame. Lowering cancellation rates or delays directly impacts delivery success.

Operational Efficiency Metrics

Driver Utilization Rate: Predictive ETAs allow for better routing and delivery clustering, optimizing each driver’s workload.
Delivery per Hour: Accurate predictions reduce idle time and allow drivers to complete more orders per shift.
Cost per Delivery: Fewer delays and optimized routing reduce fuel costs, time on the road, and labour costs per delivery.

Customer Support Cost

Impact: Fewer customers will call support for delivery updates if they have a reliable ETA. This reduces operational costs tied to customer service.
Measurement: Reduced volume of time-related inquiries and complaints means lower customer support costs, indicating a positive impact from reliable delivery predictions.

Order Cancellation Rate

Impact: With accurate ETAs, customers are less likely to cancel orders due to long or uncertain wait times.
Measurement: The percentage of orders cancelled by customers can drop with improved delivery time accuracy.

Flow of the Project

The rough flow of this project will be the following:

Detailed EDA:
- Check for missing values and duplicates,
- Check for multicollinearity,
- Check for correlation,
- Check the distribution of variables,
- Check the distribution of the target,
- Check for the presence of outliers in the target,
- Feature selection.
Experimentation:
- Finding the best model using Optuna,
- Tuning the hyperparameters of the best model using Optuna.
MLOps:
- Experiment tracking using MLFLow,
- Creating a DVC pipeline,
- Containerization using Docker.

Our goal will be to make an API service that serves our machine learning model. We will create an app that will hit this API service, which will predict the delivery time and give it back to the app. The API service will get input data from multiple sources. For instance, it will get get some data from the rider, like rider’s location, rider’s vehicle type, rider’s rating; some data from the customer like delivery location, etc. We can also get other data, like weather, traffic, etc., from some other APIs.

We will use FastAPI to build this API service and test it through Swagger. We will containerize this app using Docker, use continuous integration (CI) for testing, and use continuous delivery (CD) for deployment on AWS. We will use an auto-scaling group of EC2 instances along with a load balancer for deployment. Once it is deployed, we will hit this service using Postman and stress test the load balancer and check if it is able to handle all the traffic properly and if the auto-scaling groups are performing correctly.

Data Cleaning

The data is already messy. So the first this messy data was cleaned to perform EDA, and based on this EDA, further data cleaning was performed. So data cleaning and EDA was carried out recursively and iteratively. Further, we also developed some new features. Let’s go through it one by one.

Basic Data Exploration

Basic data exploration refers to loading the data and just observing a few rows from it. The columns present in the data are mentioned in Table 1. Some notable observations are the following:

The data has 45,593 rows and 20 columns.
The entries in the Delivery_person_ID are of the form "INDORES13DEL02", "BANGRES18DEL02", "COIMBRES13DEL02", "CHENRES12DEL01", etc. We can see that the first 4 characters stand for the city in which the rider is delivering the food. So, we can build a new feature for city using this column.
Next, we have 4 location columns corresponding to the latitude and longitude coordinates of the restaurant and the delivery location. Using them we can find the distance between the restaurant and the delivery location which may help predict the delivery time.
Next, we have a column called Order_Date. Using this we can build multiple features, e.g, day of week, whether it is a weekend or not, etc., which can help predict the delivery time.
Next, we have two time columns, namely, Time_Orderd and Time_Order_picked. By finding their difference, we can find the time it took for the rider to pick the order up. This may again be a useful feature.
Next, we have a column called Weatherconditions that has values like "conditions Sunny", "conditions Stormy", etc. We can clearly see the redundant substring "conditions " in the entries, which can be cleaned.
Finally, the target column Time_taken(min) has entries like "(min) 24", "(min) 33", etc. The two problems here are the fact that it is a string, and the redundant substring of "(min) " attached to it as prefix. Both these problems need to be fixed.

At first glance, the following features may have some predictive power to predict the delivery time:

Delivery_person_Ratings: A rider with higher rating may deliver the food sooner as compared to the a rider with lower rating.
Latitude and longitude coordinates for the restaurant and delivery location: A delivery involving food from a restaurant that is farther away from the delivery location may take longer as compared to a delivery involving food from a restaurant that is closer.
Order_Date: A delivery ordered on weekends may take lesser time due to lower traffic as compared to a delivery ordered on weekdays.
Time_Ordered and Time_Order_picked: A delivery in which the rider takes longer time to pick up from the restaurant may take longer to deliver.
Weatherconditions: Extreme weather events, e.g., stormy, sandstorms, etc., may result in longer delivery times.
Road_traffic_density: Higher road traffic may imply longer delivery times.
Vehicle_condition: A better vehicle may take lesser time for delivery.
Type_of_vehicle: Vehicles like bicycle may result in longer deliveries.
multiple_deliveries: Orders in which riders deliver multiple deliveries in a single trip may take longer to get delivered.
Festival: Orders on a festival may take longer to get delivered as more people order outside food during festivals.
City: Food ordered in more developed cities with extensive road network may take lesser time to get delivered.

All these are just speculations right now. We will verify them during EDA. There are also issues with the data types of some columns. To be precise, the columns Delivery_person_Age, Delivery_person_Ratings, Vehicle_condition, multiple_deliveries, and Time_taken(min) are of the type object. We will have to change them to numerical type. We also need to change the data type of all the datetime columns to datetime as they are currently of type object.

Missing Values and Stripping Whitespaces

At first glance, we simply run the following code:

df.isna().sum()

The result suggested that there are no missing values in the data. However, after inspecting the data closely by having a closer look at a random sample of 50 rows of the data using the code

df.sample(50)

we realised that there are missing values, but they are included in some other format. For instance, missing values in the column Delivery_person_Age (which is currently of the type object) has missing values denoted using the string "NaN ". The same is the situation with the columns Delivery_person_Ratings, datetime columns, Weatherconditions, etc. Further, having a look at these individual values also showed that many of the entries in the columns of object datatype have leading and trailing whitespaces. So, we first removed these whitespaces using the following code:

object_type_columns = df.select_dtypes(include=["object"]).columns
df[object_type_columns] = df[object_type_columns].map(lambda x: x.strip() if isinstance(x, str) else x)

Next, we replaced all the strings "NaN" with np.nan.

df[object_type_columns] = df[object_type_columns].map(lambda x: np.nan if isinstance(x, str) and "NaN" in x else x)

Next, running the code

df.isna().sum()

gave us the output shown in Table 2.

Column	# missing values
`ID`	0
`Delivery_person_ID`	0
`Delivery_person_Age`	1854
`Delivery_person_Ratings`	1908
`Restaurant_latitude`	0
`Restaurant_longitude`	0
`Delivery_location_latitude`	0
`Delivery_location_longitude`	0
`Order_Date`	0
`Time_Orderd`	1731
`Time_Order_picked`	0
`Weatherconditions`	616
`Road_traffic_density`	601
`Vehicle_condition`	0
`Type_of_order`	0
`Type_of_vehicle`	0
`multiple_deliveries`	993
`Festival`	228
`City`	1200
`Time_taken(min)`	0

Table 2: Number of missing values in each column.

The total number of missing values in the data is 9,131. Note that this is not the number of rows with missing values, but the number of missing values in the data.

Visualizing Missing Values

We now visualize the missing values using the missingno library.

Matrix of Missing Values

The matrix of missing values is shown in Figure 1.

Count of Missing Values

The bar plot of the count of missing values is shown in Figure 2.

Correlation of Missingness

The correlation heat map of missing values that indicates whether the missingness is correlated or not is shown in Figure 3.

Dendrogram of Missing Values

The dendrogram of missing values is shown in Figure 4.

Observations

The observations from visualizing the missing values are the following:

The missingness in the delivery person related columns, i.e., Delivery_person_Age and Delivery_person_Ratings, is highly correlated.
The missingness in the Time_Ordered column is also highly correlated with the missingness of the delivery person columns. This might be due to some network error where the system was unable to log the delivery person details and the time of order.
The missingness of the Road_traffic_density column is also highly correlated with the missingness of the Weatherconditions column.
The missingness of the Road_traffic_density column is also highly correlated with the missingness of the delivery person columns. This might be because the road traffic density may be recorded using the delivery person’s phone. We need to investigate more on this.
Clearly, the values are missing not at random.

After running the following code

pct = (df.isna().any(axis=1).sum() / df.shape[0]) * 100
print(f"{round(pct, 2)}% rows in the data have missing values.")

we get that there are about 9.27% of the rows in the data have missing values.

Basic Data Cleaning

Stripping Leading and Trailing Whitespaces

We have already mentioned about this. But just for completeness, I will repeat it here. Most values in the columns of the object datatype have a leading or a trailing whitespace. We removed them using the following code:

object_type_columns = df.select_dtypes(include=["object"]).columns
df[object_type_columns] = df[object_type_columns].map(lambda x: x.strip() if isinstance(x, str) else x)

Converting String Encoded NaN values to Actual NaN values

As mentioned earlier, the missing values in the columns of the object datatype are encoded in the string representation of "NaN". We convert it to np.nan using the following code:

df[object_type_columns] = df[object_type_columns].map(lambda x: np.nan if isinstance(x, str) and "NaN" in x else x)

Renaming the Columns

The column names are too verbose. We rename them to have more succinct and meaningful names using the following function:

def change_column_names(df: pd.DataFrame):
    rename_mapping = {
        "delivery_person_id": "rider_id",
        "delivery_person_age": "age",
        "delivery_person_ratings": "ratings",
        "delivery_location_latitude": "delivery_latitude",
        "delivery_location_longitude": "delivery_longitude",
        "time_orderd": "order_time",
        "time_order_picked": "order_picked_time",
        "weatherconditions": "weather",
        "road_traffic_density": "traffic",
        "city": "city_type",
        "time_taken(min)": "time_taken"
    }
    df = df.rename(str.lower, axis=1).rename(rename_mapping, axis=1)
    return df

df = change_column_names(df)

Removing Duplicates

We check if the data has any duplicate rows using the following code:

df.duplicated().sum()

Fortunately, there are no duplicate rows in the data. If there were, we would have just kept their first occurrence.

Column-wise Data Cleaning

We will now discuss what data cleaning steps should be applied on each of the columns. At the end, we will create a function that applies all these steps on the data one by one.

`id`

This column is of the object datatype, and is actually unique for each row. We found this by running the following code:

print(f"The number of unique IDs are {df['id'].nunique()}, and the number of rows are {rows}.")

and we got the following output:

The number of unique IDs are 45593, and the number of rows are 45593.

So, we can simply drop this column as it will have no predictive power.

`rider_id`

This column is of the object datatype. We ran the following code to check the number of unique values in this column:

print(f"The number of unique rider IDs are {df['rider_id'].nunique()}, and the number of rows are {rows}.")

The output we got is

The number of unique rider IDs are 1320, and the number of rows are 45593.

This means that there are 1,320 unique riders in the data.

The first few rider_id values are the following: "INDORES13DEL02", "BANGRES18DEL02", "COIMBRES13DEL02", "CHENRES12DEL01", etc. Clearly, the substring that comes before "RES..." stands for the city name of the riders. So, we can create a new column called city_name by splitting the rider_id column on "RES" in the following way:

df["rider_id"].str.split("RES").str.get(0).rename("city_name")

`age`

This column right now is of the object datatype. We first convert it to float datatype using the following code:

df["age"] = df["age"].astype("float")

Next, running the describe function on it gives us the following output:

count    43739.000000
mean        29.567137
std          5.815155
min         15.000000
25%         25.000000
50%         30.000000
75%         35.000000
max         50.000000
Name: age, dtype: float64

We can see that the youngest rider is of age 15, which is concerning. We can investigate further the type of transport they are using.

Boxplot

Figure 5 shows the boxplot of this column age.

We can see that the distribution looks slightly right-skewed with no outliers.

Minors

We will now investigate all the riders that are minors, i.e., riders younger than 18 years of age. We create a subset (that contains minors) of our original data using the following code:

minors = df.loc[df["age"] < 18]

There are a total of 38 observations in the data corresponding to minors. Viewing these observations showed us some concerning points, which are the following:

All the data corresponding to minors have missing values in the columns order_time, order_picked_time, weather, and traffic.
Some values in the column corresponding to latitude and longitude had negative values, which does not make sense as India is above the equator and to the east of the prime meridian.
The rating of all minor riders is the worst, i.e., 1.
The condition of vehicles they use is the worst, i.e., 3.
All the minor riders are aged 15, which is below the permissible age to drive a vehicle.

Right now, it seems like removing this data makes more sense and is also easier than fixing it. This is because they have many missing values and faulty data. As these are just 38 rows in the original data, removing them won’t affect the performance of the model significantly.

`ratings`

We first change the datatype of this column from object to float. Next, running the describe function on this column gives the following output:

count    43685.000000
mean         4.633780
std          0.334716
min          1.000000
25%          4.500000
50%          4.700000
75%          4.900000
max          6.000000

Further, Figure 6 shows the boxplot of this column.

Observations

Minors have 1-star ratings, which, looking at the box plot, seem like anomalies.
The 6-star ratings are not possible. Hence, these are also anomalies.
Both these ratings need to be investigated. If the data is problematic, then fixing or removing these rows are our options.

Six-Star Ratings

We created a subset of the data with six-star rating using the following code:

six_star_ratings = df.loc[df["ratings"] == 6]

There are a total of 53 such rows. Observing these rows, we saw that they also suffer from the same problems that the rows corresponding to the minors did, i.e., some of the latitude and longitude coordinates being negative, and all values missing in the columns order_time, weather, and traffic. We need to remove such rows.

Location Columns

The location columns are restaurant_latitude, restaurant_longitude, delivery_latitude, and delivery_longitude. The datatype of all these columns is already correct, i.e., float. We now run the describe function on only these location columns. The output we get is shown in Table 3.

Statistic	Restaurant Latitude	Restaurant Longitude	Delivery Latitude	Delivery Longitude
count	45593.000000	45593.000000	45593.000000	45593.000000
mean	17.017729	70.231332	17.465186	70.845702
std	8.185109	22.883647	7.335122	21.118812
min	-30.905562	-88.366217	0.010000	0.010000
25%	12.933284	73.170000	12.988453	73.280000
50%	18.546947	75.898497	18.633934	76.002574
75%	22.728163	78.044095	22.785049	78.107044
max	30.914057	88.433452	31.054057	88.563452

Table 3: Summary statistics for the location columns.

To visualize this distribution better, we plot the box plot of the location columns which is shown in Figure 7.

**Figure 7:** Boxplot of location columns.

As India lies to the north of the equator between 6° 44’ N to 35° 30’ N, and between 68° 7’ E to 97° 25’ E, the valid values of latitudes and longitudes are only the ones that fall in this range. So, there are some faulty values of latitudes and longitudes in the data. We can see that only the values near the minimum are faulty. The maximum values already lie in the required range. So, we form a subset of the data with these incorrect coordinates using the following code:

lower_bound_lat = 6.44
lower_bound_long = 68.70

incorrect_locations = df.loc[
    (df['restaurant_latitude'] < lower_bound_lat) |
    (df['restaurant_longitude'] < lower_bound_long) |
    (df['delivery_latitude'] < lower_bound_lat) |
    (df['delivery_longitude'] < lower_bound_long)
]

We now plot the boxplot of these incorrect location coordinates, which is shown in Figure 8.

**Figure 8:** Boxplot of incorrect location columns.

Looking at this box plot, it seems that the negative values may be incorrect only because of their signs. If they were positive, they would lie in the required range. There are also zeros in these columns. One way of handling the zeros is to replace them with np.nan, and then impute them with advanced imputation techniques that use the remaining columns.

Fixing Incorrect Location Columns

We will fix the incorrect location columns by just taking their absolute values. This will convert all the negative coordinates to positive. This will make sure that they lie in the proper latitude and longitude bounds of India. The boxplot of the location columns after taking the absolute value is shown in Figure 9.

**Figure 9:** Boxplot of location columns after taking the absolute value.

This looks good. Next, we also replace the zeros with np.nan. These zeros also lie outside the geographical bound of India.

We create the following common function that fixes all the latitude and longitude columns:

def clean_lat_lon(df: pd.DataFrame, threshold=1):
    df = df.copy()
    location_columns = df.columns[4:8]
    df[location_columns] = df[location_columns].abs()
    df[location_columns] = df[location_columns].map(lambda x: x if x >= threshold else np.nan)
    return df

As we replaced the zeros with np.nan’s, this introduces 3,640 new missing values in the data.

Creating `distance` using Location Columns

We create a new column called distance using these location columns. This stands for the (haversine) distance between the restaurant and the delivery location. We use the following function to create this column:

def calculate_haversine_distance(df):
    location_columns = location_subset.columns.tolist()
    lat1 = df[location_columns[0]]
    lon1 = df[location_columns[1]]
    lat2 = df[location_columns[2]]
    lon2 = df[location_columns[3]]

    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

    c = 2 * np.arcsin(np.sqrt(a))

    distance = 6371 * c

    return (
      df.assign(
        distance = distance
      )
    )

Once we create this distance column, we will drop the latitude and longitude columns as they have missing values and there is no way of imputing them since we also have information about cities available. We can, however, impute the distance column using some advanced imputation techniques that uses the other columns.

`order_date`

This column is of the datatype object. We first convert it to the datetime format using the following code:

df["order_date"] = pd.to_datetime(df["order_date"], dayfirst=True)

As seen earlier, this column fortunately does not have any missing values. Next, we use the following function to create new datetime features using this column:

def extract_datetime_features(series: pd.Series):
    date_col = pd.to_datetime(series, dayfirst=True)
    data = {
        "day": date_col.dt.day,
        "month": date_col.dt.month,
        "year": date_col.dt.year,
        "day_of_week": date_col.dt.day_name(),
        "is_weekend": date_col.dt.day_name().isin(["Saturday", "Sunday"]).astype(int)
    }
    df = pd.DataFrame(data)
    return df

This gives us five new features, which are the following:

day: This is the day of the month.
year: This denotes the year.
month: This is the month (numerical).
day_of_week: This is a string, e.g., "Monday".
is_weekend: This is a Boolean column which is 1 if the day is a weekend, and 0 if it isn’t.

We notice that the starting date in the data is 2022-02-11 and the end date is 2022-04-06. So the year is the same for all the rows. Hence, we will drop the year column later.

`order_time` and `order_picked_time`

We first drop the string "NaN", then convert both these columns to datetime, and finally extract the feature called time_of_day that contains the strings "morning", "afternoon", "evening", and "night", using the following function:

def extract_time_of_day(series: pd.Series):
    hr = pd.to_datetime(series, format="%H:%M:%S", errors="coerce").dt.hour
    condlist = [
        (hr.between(6, 12, inclusive="left")),
        (hr.between(12, 17, inclusive="left")),
        (hr.between(17, 20, inclusive="left")),
        (hr.between(20, 24, inclusive="left")),
    ]
    choicelist = ["morning", "afternoon", "evening", "night"]
    default = "after_midnight"
    time_of_day_info = pd.Series(np.select(condlist, choicelist, default=default), index=series.index)
    return time_of_day_info.where(series.notna(), np.nan)

Further, we also extract the features pickup_time_minutes by subtracting the values in order_time from the values in order_picked_time in minutes, and order_time_hour.

`weather`

The datatype of this column is object, which is fine. To find the type of weathers available, we run the following code:

df["weather"].value_counts()

The output we get is the following:

weather
conditions Fog           7654
conditions Stormy        7586
conditions Cloudy        7536
conditions Sandstorms    7495
conditions Windy         7422
conditions Sunny         7284
conditions NaN            616
Name: count, dtype: int64

We remove the redundant substring "conditions " from these values and replace the string "NaN" with np.nan using the following code:

df["weather"] = df["weather"].str.replace("conditions ", "")
df["weather"].replace("NaN", np.nan).isna().sum()

`traffic`

The datatype of this column is object, which is fine. Running the following code:

df["traffic"].value_counts()

gives the following output:

traffic
Low       15477
Jam       14143
Medium    10947
High       4425
NaN         601
Name: count, dtype: int64

Again, we replace the string "NaN" with np.nan.

`vehicle_condition`

The datatype of this column is integer, which is fine. Doing value_counts on it gives

vehicle_condition
2    15034
1    15030
0    15009
3      520
Name: count, dtype: int64

This column already seems clean.

`type_of_order`

This is an object, which is fine. Doing value_counts on it gives us

type_of_order
Snack     11533
Meal      11458
Drinks    11322
Buffet    11280
Name: count, dtype: int64

This column also already seems clean.

`type_of_vehicle`

This is an object, which is fine. Doing value_counts on it gives

type_of_vehicle
motorcycle          26435
scooter             15276
electric_scooter     3814
bicycle                68
Name: count, dtype: int64

We can see that there are very few instances of bicycles.

`multiple_deliveries`

This is an object, which is incorrect. It should have been an integer. Before converting it to integer, we do value_counts on it to get

multiple_deliveries
1      28159
0      14095
2       1985
NaN      993
3        361
Name: count, dtype: int64

Next, we convert the string "NaN" to np.nan and convert the datatype to float.

`festival`

This is an object, which is fine. Doing value_counts on it gives:

festival
No     44469
Yes      896
NaN      228
Name: count, dtype: int64

Again, we replace "NaN" with np.nan.

`city_type`

This is an object, which is fine. Doing value_counts on it gives:

city_type
Metropolitian    34093
Urban            10136
NaN               1200
Semi-Urban         164
Name: count, dtype: int64

Again, we convert "NaN" to np.nan.

`time_taken` (The Target)

This is an object, which is incorrect. It consists of values like "(min) 24", "(min) 33", etc. We remove the redundant substring of "(min) " from it and convert it to integer using the following code:

df["time_taken"] = df["time_taken"].str.replace("(min) ", "").astype(int)

Function for Data Cleaning

Now, we will create a single function which performs this entire data cleaning on the raw data. We need to carry out the following steps:

Remove the trailing spaces from all the object-type columns.
Make all the values in the object-type columns lower case.
Change the string "NaN" and "conditions NaN" to actual np.nan.
Shorten all the column names and change them also to lower case.
As you go ahead with cleaning each column, change its data type to the appropriate one if applicable.
Drop the column id.
Drop the rows corresponding to minors.
Drop the rows corresponding to 6-star ratings.
Create a new column called city_name.
Take absolute value of all the location columns.
Replace the values of the locations below the threshold (1) with np.nan.
Calculate haversine distance using the location columns and save it in a new column named distance.
Calculate a binned version of this distance and call it distance_type.
Convert all the appropriate object-type columns to datetime.
Create the columns order_day, order_month, order_day_of_week, is_weekend, pickup_time_minutes, order_time_hour, and order_time_of_day.
Remove "(min) " from all the entries of the target time_taken, and then convert it to integer.

The function is the following:

def clean_data(df: pd.DataFrame):
    df = df.copy()

    # Removing trailing spaces from all the object-type columns
    object_type_cols = df.select_dtypes(include=['object']).columns
    for col in object_type_cols:
        df[col] = df[col].str.strip()

    # Changing all the values to lower case
    for col in object_type_cols:
        if col not in ["Delivery_person_ID"]:
            df[col] = df[col].str.lower()

    # Changing the string `"nan"` and `"conditions nan"` to `np.nan`
    df[object_type_cols] = df[object_type_cols].map(lambda x: np.nan if isinstance(x, str) and "nan" in x else x)

    # Shortening the column names and changing them to lower case
    rename_mapping = {
        "delivery_person_id": "rider_id",
        "delivery_person_age": "age",
        "delivery_person_ratings": "ratings",
        "delivery_location_latitude": "delivery_latitude",
        "delivery_location_longitude": "delivery_longitude",
        "time_orderd": "order_time",
        "time_order_picked": "order_picked_time",
        "weatherconditions": "weather",
        "road_traffic_density": "traffic",
        "city": "city_type",
        "time_taken(min)": "time_taken"
    }
    df = df.rename(str.lower, axis=1).rename(rename_mapping, axis=1)

    # Dropping the column `id`
    df.drop(columns=["id"], inplace=True)

    # Dropping minors

    ## First converting the `age` column to numeric
    df["age"] = df["age"].astype("float")
    minors = df.loc[df["age"] < 18]
    minors_idxs = minors.index
    df.drop(minors_idxs, inplace=True)

    # Dropping 6-star ratings

    ## First converting the `ratings` column to numeric
    df["ratings"] = df["ratings"].astype(float)
    six_star_ratings = df.loc[df["ratings"] == 6]
    six_star_ratings_idxs = six_star_ratings.index
    df.drop(six_star_ratings_idxs, inplace=True)

    # Creating a new column `city_name` using the column `rider_id`
    df["city_name"] = df["rider_id"].str.split("RES").str.get(0)

    # Cleaning location columns
    location_columns = df.columns[3:7]
    df[location_columns] = df[location_columns].abs()
    df[location_columns] = df[location_columns].map(lambda x: x if x >= 1 else np.nan)

    # Creating a new column called `distance` (haversine distance)
    lat1 = df[location_columns[0]]
    lon1 = df[location_columns[1]]
    lat2 = df[location_columns[2]]
    lon2 = df[location_columns[3]]

    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

    c = 2 * np.arcsin(np.sqrt(a))

    distance = 6371 * c

    df["distance"] = distance

    bins = [0, 5, 10, 15, np.inf]
    labels = ["short", "medium", "long", "very_long"]
    df["distance_type"] = pd.cut(df["distance"], bins=bins, right=False, labels=labels)

    # Cleaning the datetime related columns
    df["order_date"] = pd.to_datetime(df["order_date"], dayfirst=True)
    df["order_day"] = df["order_date"].dt.day
    df["order_month"] = df["order_date"].dt.month
    df["order_day_of_week"] = df["order_date"].dt.day_name().str.lower()
    df["order_day_is_weekend"] = df["order_day_of_week"].isin(["saturday", "sunday"]).astype(int)

    df["order_time"] = pd.to_datetime(df["order_time"], format="%H:%M:%S", errors="coerce")
    df["order_picked_time"] = pd.to_datetime(df["order_picked_time"], format="%H:%M:%S")
    df["pickup_time_minutes"] = ((df["order_picked_time"] - df["order_time"]).dt.seconds) / 60
    
    df["order_time_hour"] = df["order_time"].dt.hour
    condlist = [
        (df["order_time_hour"].between(6, 12, inclusive="left")),
        (df["order_time_hour"].between(12, 17, inclusive="left")),
        (df["order_time_hour"].between(17, 20, inclusive="left")),
        (df["order_time_hour"].between(20, 24, inclusive="left")),
    ]
    choicelist = ["morning", "afternoon", "evening", "night"]
    default = "after_midnight"
    time_of_day_info = pd.Series(np.select(condlist, choicelist, default=default), index=df["order_time"].index)
    # order_time_of_day = np.select(condlist=condlist, choicelist=choicelist, default=default)
    df["order_time_of_day"] = time_of_day_info.where(df["order_time"].notna(), np.nan)
    df.drop(columns=["order_time", "order_picked_time"], inplace=True)

    # Cleaning the `weather` column
    df["weather"] = df["weather"].str.replace("conditions ", "")

    # Changing dtype of `multiple_deliveries`
    df["multiple_deliveries"] = df["multiple_deliveries"].astype(float)

    # Cleaning the target column `time_taken`
    df["time_taken"] = df["time_taken"].str.replace("(min) ", "").astype(int)
    return df

This function takes in the raw data as input and returns the cleaned data.

Missing Values after Data Cleaning

Figure 10, Figure 11, and Figure 12 show the missing values matrix, missing value count, and the missing value correlation.

**Figure 10:** Matrix of missing values for visualization after data cleaning.

**Figure 11:** Bar plot of the count of missing values after data cleaning.

**Figure 12:** Correlation heat map of missing values after data cleaning.

Again, the missing values clearly are missing not at random. There is extremely high correlation in their missingness.

Jupyter Notebook

The entire data cleaning was performed in this Jupyter notebook.

EDA

The main problems that we aim to solve while doing EDA are the following:

Feature selection,
Missing data imputation.

We will first clean the raw data using the function clean_data that we built in the data cleaning stage, and then carry out EDA on this cleaned data.

Functions to Perform Analysis

I have created some functions that will perform analysis of a numerical variable, analysis of a categorical and a numerical variable, analysis of a categorical variables, multivariate analysis, $\chi^2$ test (using chi2_contingency from scipy.stats), ANOVA test (using f_oneway from scipy.stats), and normality test (using jarque_bera from scipy.stats). These functions are the following:

def numerical_analysis(data: pd.DataFrame, col_name: str, cat_col=None, bins="auto"):
    # create the figure
    fig = plt.figure(figsize=(15, 10))
    # generate the layout
    grid = GridSpec(nrows=2, ncols=2, figure=fig)
    # set subplots
    ax1 = fig.add_subplot(grid[0, 0])
    ax2 = fig.add_subplot(grid[0, 1])
    ax3 = fig.add_subplot(grid[1, :])
    # kdeplot
    sns.kdeplot(data=data, x=col_name, hue=cat_col, ax=ax1)
    # boxplot
    sns.boxplot(data=data, x=col_name, hue=cat_col, ax=ax2)
    # histogram
    sns.histplot(data=data, x=col_name, bins=bins, hue=cat_col, kde=True, ax=ax3)
    plt.tight_layout()
    plt.show();


def categorical_numerical_analysis(data: pd.DataFrame, cat_col, num_col):
    fig, (ax1, ax2) = plt.subplots(2, 2, figsize=(15, 7.5))
    # barplot
    sns.barplot(data=data, x=cat_col, y=num_col, ax=ax1[0])
    # boxplot
    sns.boxplot(data=data, x=cat_col, y=num_col, ax=ax1[1])
    # violinplot
    sns.violinplot(data=data, x=cat_col, y=num_col, ax=ax2[0])
    # stripplot
    sns.stripplot(data=data, x=cat_col, y=num_col, ax=ax2[1])
    plt.tight_layout()
    plt.show();


def categorical_analysis(data: pd.DataFrame, col_name):
    dic = {
        "Count": data[col_name].value_counts(),
        "Percentage": data[col_name].value_counts(normalize=True).mul(100).round(2).astype("str").add("%")
    }
    df = pd.DataFrame(dic)
    display(df)
    print("#" * 150)
    unique_categories = data[col_name].unique().tolist()
    no_of_categories = data[col_name].nunique()
    print(f"\nThe unique categories in the column '{col_name}' are {unique_categories}.", end="\n\n")
    print("#" * 150)
    print(f"\nThe number of categories in the column '{col_name}' are {no_of_categories}.", end="\n\n")
    sns.countplot(data=data, x=col_name)
    plt.xticks(rotation=45)
    plt.show();


def multivariate_analysis(data: pd.DataFrame, num_col, cat_col_1, cat_col_2):
    fig, (ax1, ax2) = plt.subplots(2, 2, figsize=(15, 7.5))
    # barplot
    sns.barplot(data=data, x=cat_col_1, y=num_col, hue=cat_col_2, ax=ax1[0])
    # boxplot
    sns.boxplot(data=data, x=cat_col_1, y=num_col, hue=cat_col_2, gap=0.1, ax=ax1[1])
    # violin plot
    sns.violinplot(data=data, x=cat_col_1, y=num_col, hue=cat_col_2, gap=0.1, ax=ax2[0])
    # strip plot
    sns.stripplot(data=data, x=cat_col_1, y=num_col, hue=cat_col_2, dodge=True, ax=ax2[1])
    plt.tight_layout()
    plt.show();


def chi2_test(data: pd.DataFrame, col1, col2, alpha=0.05):
    data = data.loc[:, [col1, col2]].dropna()
    # contingency table
    contingency_table = pd.crosstab(data[col1], data[col2])
    # chi2 test
    _, p_val, _, _ = chi2_contingency(contingency_table)
    print(f"p-value: {p_val}.")
    if p_val <= alpha:
        print(f"Reject H0. The data indicates significant association between '{col1}' and '{col2}'.")
    else:
        print(f"Not enough evidence to reject H0. The data indicates no significant association between '{col1}' and '{col2}'.")


def anova_test(data: pd.DataFrame, num_col, cat_col, alpha=0.05):
    data = data.loc[:, [num_col, cat_col]].dropna()
    cat_group = data.groupby(cat_col)
    groups = [group[num_col].values for _, group in cat_group]
    f_statistic, p_val = f_oneway(*groups)
    print(f"p-value: {p_val}.")
    if p_val <= alpha:
        print(f"Reject H0. The data indicates significant relationship between '{num_col}' and '{cat_col}'.")
    else:
        print(f"Not enough evidence to reject H0. The data indicates no significant relationship between '{num_col}' and '{cat_col}'.")


def normality_test(data: pd.DataFrame, col_name, alpha=0.05):
    data = data[col_name]
    print("Jarque Bera test for normality:")
    _, p_val = jarque_bera(data)
    print(f"p-value: {p_val}.")
    if p_val <= alpha:
        print(f"Reject H0. The column '{col_name}' IS NOT normally distributed.")
    else:
        print(f"Not enough evidence to reject H0. The column '{col_name}' IS normally distributed.")

These functions do the following:

numerical_analysis plots the kernel density estimation plot, the box plot, and the box plot of a given numerical variable,
categorical_numerical_analysis plots a bar plot, the box plot, the violin plot, and the strip plot for a particular given pair of categorical and numerical variable
categorical_analysis returns a pandas.DataFrame with the value counts and the percentage of each category, prints the name of unique categories, and plots a count plot of the categories
multivariate_analysis pots a bar plot, box plot, violin plot, and a strip plot with all of them having the first categorical column on the horizontal axis, the numerical column in the vertical axis, and the color encoding as the second categorical column
chi2_test performs the $\chi^2$ test between two categorical features to find if there is an association between them; anova_test performs the ANOVA test between a numerical column and a categorical column to check if they are associated
normality_test performs the Jarque–Bera test to test whether a given numerical column is normally distributed.

Column-wise EDA

`time_taken` (Target)

Running numerical_analysis on the column time_taken (the target) gives us the plots shown in Figure 13.

Result of running numerical_analysis on the target column. — **Figure 13:** Result of running `numerical_analysis` on the target column.

We can observe the following:

The target column is not fully continuous in nature.
It shows dual modality with two peaks: one around the 17-18 mark, and the other around the 26-27 mark.
It has very few extreme points. They cannot be thought of outliers as they are just extreme and rare. A delivery time of 50 mins. is possible in certain rare circumstances.

Figure 14 shows the Q-Q plot of the target variable.

The tails differ significantly from the normal distribution. Even running the normality_test function gave the output that this column, i.e., the target, is not normally distributed. We also found that all the observations for which the target column is high, the traffic condition is either high or jam, which is expected. However, we found that weather didn’t affect this column much as the distribution of weather types were mostly uniformly distributed even for higher values of time_taken, i.e., the target. Further, we also found that deliveries that lake longer time also have larger distance associated with them, which is again expected.

Transforming `time_taken` using Yeo-Johnson Transformation

We tried using the Yeo-Johnson transformation on the column time_taken with the intention to make it more normal looking. Figure 15 shows the plot given by the numerical_analysis function on this column after the transformation.

Result of running numerical_analysis on the target column after taking the Yeo-Johnson transformation. — **Figure 15:** Result of running `numerical_analysis` on the target column after taking the Yeo-Johnson transformation.

As we can see, the transformation did make it look more normal, however the normality_test function still categorized it as being not normal. But it is closer to a normal distribution than before taking the transformation. So, we will do this transformation on the target.

`rider_id`

Our main reasoning behind keeping this column was to check whether it can be used to fill in the missing values in the columns age and ratings. However, after grouping the data by rider_id and then checking the age and the ratings column, we observed that there are many inconsistencies in this column. For instance, many rows have the same rider_id but different values of age and ratings. Hence, this column cannot be used to fill in the missing values in age and ratings. We will drop this column.

`age`

Figure 16 shows the plots generated after running the numerical_analysis function on the column age.

Result of running numerical_analysis on the column age. — **Figure 16:** Result of running `numerical_analysis` on the column `age`.

We can see that age seems to be nearly uniformly distributed between 20 to 39.

Relationship between `time_taken` (target) and `age`

Figure 17 shows a scatter plot between time_taken and age.

We can see that age does not seem to affect the target time_taken. So the delivery time is not dependent on age of the delivery person. We can also see that age is a discrete numerical column.

Relationship between `time_taken`, `age`, and `type_of_vehicle`

Figure 18 shows a scatter plot between time_taken and age with hue as vehicle_condition.

We can see that for higher values of time_taken, i.e., for longer deliveries, generally the vehicles with the best condition (0) are used.

Relationship between `age` and `type_of_vehicle`

Figure 19 shows a strip plot between age and type_of_vehicle.

We can see that there is no preference of type_of_vehicle with age. The reason we do not seem to have any ages for bicycles may be because there are missing values in the age column whenever the type_of_vehicle is bicycle.

`ratings`

Figure 20 shows the plots generated after running the numerical_analysis function on the column ratings.

We can see that this column is left-skewed with most of the ratings as greater than 4.

Relationship between `time_taken` (target) and `ratings`

Figure 21 shows a scatter plot between time_taken and ratings.

We can see that the riders with high ratings are preferred for longer deliveries. Further, riders with high ratings also get more orders. So, higher ratings will give riders more work, and hence more income.

Relationship between `vehicle_condition` and `ratings`

Figure 22 shows the plots generated after running the categorical_numerical_analysis function with the categorical column as vehicle_condition and the numerical column as ratings.

We observe the following:

We can see that the average rating seems to be very similar for the first three vehicle conditions.
However, riders with the best vehicle_condition (0) have ratings of more than 3, whereas riders with the subsequent vehicle_condition also have ratings lower than 3.
The worst vehicle_condition (3) has no data for ratings. This may be because customers may not want to rate riders with such vehicles.

Relationship between `type_of_vehicle` and `ratings`

Figure 23 shows the plots generated after running the categorical_numerical_analysis function with the categorical column as type_of_vehicle and the numerical column as ratings.

We observe that:

The mean ratings for all the type_of_vehicle (except bicycle) seem very similar, however the strip plot indicates that electric scooters mostly have higher ratings as compared to the other type_of_vehicle.
Bicycles do not have ratings because there are NaN values in ratings if type_of_vehicle is bicycle.

Relationship between `festivals` and `ratings`

Figure 24 shows the plots generated after running the categorical_numerical_analysis function with the categorical column as festival and the numerical column as ratings.

We can see that during festivals, the ratings generally go down.

Location Columns

We use the location columns for delivery, i.e., delivery_latitude and delivery_longitude, to plot the locations on an interactive map to visualize where the deliveries are being done. We use plotly.express to do this. Figure 25 shows this plot.

The coordinates delivery_latitude and delivery_longitude plotted on an interactive map. — **Figure 25:** The coordinates `delivery_latitude` and `delivery_longitude` plotted on an interactive map.

`order_date` and Related Columns

The related columns are order_date, order_day, order_month, order_day_of_week, order_day_is_weekend, and festival

Relationship between `time_taken` (target) and `order_day_of_week`

Figure 26 shows the plots generated after running the categorical_numerical_analysis function with the categorical column as order_day_of_week and the numerical column as time_taken.

Result of running categorical_numerical_analysis with the categorical column as order_day_of_week and the numerical column as time_taken. — **Figure 26:** Result of running `categorical_numerical_analysis` with the categorical column as `order_day_of_week` and the numerical column as `time_taken`.

We can see that on an average, delivery times are slightly lesser on Thursdays.

Relationship between `time_taken` (target) and `order_day_is_weekend`

Figure 27 shows the plots generated after running the categorical_numerical_analysis function with the categorical column as order_day_is_weekend and the numerical column as time_taken.

Result of running categorical_numerical_analysis with the categorical column as order_day_is_weekend and the numerical column as time_taken. — **Figure 27:** Result of running `categorical_numerical_analysis` with the categorical column as `order_day_is_weekend` and the numerical column as `time_taken`.

The distribution looks very similar. So it doesn’t look like order_day_is_weekend affects the target time_taken.

Relationship between `traffic` and `order_day_is_weekend`

After running the chi2_test function on these two columns, we found that there is not enough evidence in the data to suggest that there is any significant association between them.

Relationship between `time_taken` (target) and `festival`

Figure 28 shows the plots generated after running the categorical_numerical_analysis function with the categorical column as festival and the numerical column as time_taken.

Result of running categorical_numerical_analysis with the categorical column as festival and the numerical column as time_taken. — **Figure 28:** Result of running `categorical_numerical_analysis` with the categorical column as `festival` and the numerical column as `time_taken`.

We observe the following:

Clearly, if there is a festival, then the delivery time is longer on average.
Also, the range of delivery times is shorter with lesser variation when there is a festival.

Relationship between `festival` and `traffic`

After running the chi2_test function on these two columns, we found that there is enough evidence in the data to suggest that there is a significant association between them. To view this difference, we created a pivot table which is shown in Table 4.

Traffic Density	Festival: No	Festival: Yes
`high`	27.010373	45.826087
`jam`	30.538039	46.093651
`low`	21.284332	42.020000
`medium`	26.550288	43.715385

Table 4: Average delivery time (in minutes) across traffic conditions and festival days.

We can clearly see how time_taken increases with increase in traffic congestion. Also, it seems like during festivals the delivery always takes longer.

Relationship between `festival`, `order_day_is_weekend`, and `time_taken`

Figure 29 shows the plots generated after running the multivariate_analysis function with the numerical column as time_taken, and the two categorical columns as order_day_is_weekend and festival.

Result of running multivariate_analysis with numerical column as time_taken and the two categorical columns as order_day_is_weekend and festival. — **Figure 29:** Result of running `multivariate_analysis` with numerical column as `time_taken` and the two categorical columns as `order_day_is_weekend` and `festival`.

We can see that it doesn’t matter whether a festival falls on a weekend. The time_taken is similarly distributed in either case.

`order_time` and Related Columns

The columns related to order_time are order_time_hour, order_time_of_day, and pickup_time_minutes.

Relationship between `order_time_of_day` and `time_taken` (target)

Figure 30 shows the plots generated after running the categorical_numerical_analysis function with the categorical column as order_time_of_day and the numerical column as time_taken.

Result of running categorical_numerical_analysis with the categorical column as order_time_of_day and the numerical column as time_taken. — **Figure 30:** Result of running `categorical_numerical_analysis` with the categorical column as `order_time_of_day` and the numerical column as `time_taken`.

We can see that deliveries in the morning are the quickest whereas the deliveries in the evening are the longest. Running the function anova_test on these two columns also confirmed that the data indicates significant relationship between them. After using the value_counts function on the order_time_of_day column, we found that most orders are placed at 9 p.m. Figure 31 shows a bar plot of this count.

Bar plot of the count of values in the column order_time_hour. — **Figure 31:** Bar plot of the count of values in the column `order_time_hour`.

Further, Figure 32 shows the same bar plot for the column order_time_of_day.

Bar plot of the count of values in the column order_time_of_day. — **Figure 32:** Bar plot of the count of values in the column `order_time_of_day`.

`pickup_time_minutes`

Relationship between `time_taken` (target) and `pickup_time_minutes`

Figure 33 shows a scatter plot between the target, i.e., time_taken, and pickup_time_minutes.

Scatter plot between time_taken and pickup_time_minutes. — **Figure 33:** Scatter plot between `time_taken` and `pickup_time_minutes`.

We observe the following:

So there is no relationship between pickup_time_minutes and time_taken.
We can also see that pickup_time_minutes can be treated like an ordinal categorical column.

So, running the categorical_analysis funcdtion on this column gives us the bar plot of the count of values in this column, which is shown in Figure 34.

Bar plot of the count of values in the column pickup_time_minutes. — **Figure 34:** Bar plot of the count of values in the column `pickup_time_minutes`.

We can see that the three categories in pickup_time_minutes are almost equally distributed. Next, Figure 35 shows the plots generated after running the categorical_numerical_analysis function with categorical column as pickup_time_minutes and numerical column as time_taken.

Result of running categorical_numerical_analysis with the categorical column as pickup_time_minutes and the numerical column as time_taken. — **Figure 35:** Result of running `categorical_numerical_analysis` with the categorical column as `pickup_time_minutes` and the numerical column as `time_taken`.

Here also we can see that the different categories in pickup_time_minutes have no relation with the target time_taken. This was also confirmed by running the anova_test function on these two columns.

`traffic`

This is a categorical column. Running the function categorical_analysis on this column gives the bar plot of the count of its categories, which is shown in Figure 36.

Most of the times the traffic is either low or is jam.

Relationship between `traffic` and `city_type`, and `traffic` and `city_name`

Running the chi2_test function on traffic and city_type indicated a significant association between them. However, the same test on traffic and city_name indicated no significant association.

Relationship between `time_taken` (target) and `traffic`

Figure 37 shows the plots generated after running the categorical_numerical_analysis function with categorical column as traffic and numerical column as time_taken.

Result of running categorical_numerical_analysis with the categorical column as traffic and the numerical column as time_taken. — **Figure 37:** Result of running `categorical_numerical_analysis` with the categorical column as `traffic` and the numerical column as `time_taken`.

Clearly, time_taken is lower when traffic has the least congestion, and vice versa. The association between these two variables was also confirmed after running the anova_test function on them.

Relationship between `time_taken` (target), `traffic`, and `type_of_vehicle`

Figure 38 shows the plots generated after running the multivariate_analysis function with the numerical column as time_taken, and the two categorical columns as traffic and type_of_vehicle.

Result of running multivariate_analysis with numerical column as time_taken and the two categorical columns as traffic and type_of_vehicle. — **Figure 38:** Result of running `multivariate_analysis` with numerical column as `time_taken` and the two categorical columns as `traffic` and `type_of_vehicle`.

We can see that motorcycles take longer time as compared to scooters and electric scooters. Bicycle has no points due to it having missing values in the corresponding traffic column, maybe due to the fact that bicycle people do not use smart phones.

Relationship between `time_taken` (target), `traffic`, and `vehicle_condition`

Figure 39 shows the plots generated after running the multivariate_analysis function with the numerical column as time_taken, and the two categorical columns as traffic and vehicle_condition.

Result of running multivariate_analysis with numerical column as time_taken and the two categorical columns as traffic and vehicle_condition. — **Figure 39:** Result of running `multivariate_analysis` with numerical column as `time_taken` and the two categorical columns as `traffic` and `vehicle_condition`.

We can see that vehicles with the best condition have longer time_taken. This may be because the vehicles with the best vehicle_condition are preferred for longer deliveries.

Relationship between `time_taken`, `festival`, and `vehicle_condition`

Figure 40 shows the plots generated after running the multivariate_analysis function with the numerical column as time_taken, and the two categorical columns as festival and vehicle_condition.

Result of running multivariate_analysis with numerical column as time_taken and the two categorical columns as festival and vehicle_condition. — **Figure 40:** Result of running `multivariate_analysis` with numerical column as `time_taken` and the two categorical columns as `festival` and `vehicle_condition`.

We can see that during festivals the vehicles with the worst vehicle_condition take longer, maybe due to breakdowns. They are also getting less orders during festivals, as evident from the strip plot (bottom right).

`multiple_deliveries`

This is like a categorical column.

Relationship between `multiple_deliveries` and `time_taken` (target)

Figure 41 shows the plots generated after running the categorical_numerical_analysis function with the categorical column as multiple_deliveries and the numerical column as time_taken.

Result of running categorical_numerical_analysis with the categorical column as multiple_deliveries and the numerical column as time_taken. — **Figure 41:** Result of running `categorical_numerical_analysis` with the categorical column as `multiple_deliveries` and the numerical column as `time_taken`.

Clearly, more deliveries take longer time. This is also confirmed by running the anova_test function on time_taken and multiple_deliveries.

Relationship between `multiple_deliveries` and `distance`

Figure 42 shows the plots generated after running the categorical_numerical_analysis function with the categorical column as multiple_deliveries and the numerical column as distance.

Result of running categorical_numerical_analysis with the categorical column as multiple_deliveries and the numerical column as distance. — **Figure 42:** Result of running `categorical_numerical_analysis` with the categorical column as `multiple_deliveries` and the numerical column as `distance`.

We can see that with more multiple_deliveries, the distance also increases.

`weather`

This is a categorical column. Running the categorical_analysis function on this gives a bar plot of the count of categories that are shown in Figure 43.

We can see that the categories in weather are uniformly distributed. There is no imbalance.

Relationship between `time_taken` (target) and `weather`

Figure 44 shows the plots generated after running the categorical_numerical_analysis function with the categorical column as weather and the numerical column as time_taken.

Result of running categorical_numerical_analysis with the categorical column as weather and the numerical column as time_taken. — **Figure 44:** Result of running `categorical_numerical_analysis` with the categorical column as `weather` and the numerical column as `time_taken`.

We can see that time_taken is higher for bad weather. The fact that they are associated was also confirmed after running the anova_test function on them.

Relationship between `weather` and `traffic`

Running the chi2_test function on them revealed no significant association between the two.

Relationship between `weather`, `traffic`, and `time_taken`

Figure 45 shows the plots generated after running the multivariate_analysis function with the numerical column as time_taken, and the two categorical columns as weather and traffic.

Result of running multivariate_analysis with numerical column as time_taken and the two categorical columns as weather and traffic. — **Figure 45:** Result of running `multivariate_analysis` with numerical column as `time_taken` and the two categorical columns as `weather` and `traffic`.

We can see that traffic has a similar distribution with respect to time_taken in all weather conditions. So, traffic is a more important feature as compared to weather for predicting time_taken (target). Table 5 shows a pivot table of average value of these three columns.

Weather Condition	High Traffic	Jam Traffic	Low Traffic	Medium Traffic
`cloudy`	28.940860	36.689655	22.208445	28.483134
`fog`	28.426546	36.806916	22.303427	28.044816
`sandstorms`	27.711840	30.018758	20.297049	27.738522
`stormy`	27.845839	29.850194	20.681734	27.680502
`sunny`	23.448980	23.082132	21.449293	20.195518
`windy`	26.972789	30.219056	20.665862	27.888769

Table 5: Average delivery time (in minutes) across traffic conditions and weather types.

`vehicle_condition`

This is a categorical column. Running the categorical_analysis function on it gives the bar plot of count of the categories in this column which is shown in Figure 46.

We can see that the worst vehicle_condition (3) is rare. We will need to handle it.

Relationship between `time_taken` (target) and `vehicle_condition`

Figure 47 shows the plots generated after running the categorical_numerical_analysis function with the categorical column as vehicle_condition and the numerical column as time_taken.

Result of running categorical_numerical_analysis with the categorical column as vehicle_condition and the numerical column as time_taken. — **Figure 47:** Result of running `categorical_numerical_analysis` with the categorical column as `vehicle_condition` and the numerical column as `time_taken`.

We can see that the vehicles with the best vehicle_condition take longer. This is because such vehicles are preferred for longer deliveries. Their association was also confirmed by running the anova_test function on the two.

`type_of_vehicle`

This is a categorical column. Running the categorical_analysis function on it gives the bar plot of count of the categories in this column which is shown in Figure 48.

We can see that bicycles are extremely rare. We will have to handle it.

Relationship between `time_taken` (target) and `type_of_vehicle`

Figure 49 shows the plots generated after running the categorical_numerical_analysis function with the categorical column as type_of_vehicle and the numerical column as time_taken.

Result of running categorical_numerical_analysis with the categorical column as type_of_vehicle and the numerical column as time_taken. — **Figure 49:** Result of running `categorical_numerical_analysis` with the categorical column as `type_of_vehicle` and the numerical column as `time_taken`.

We can see that motorcycles are preferred for longer deliveries.

Relationship between `vehicle_condition`, `type_of_vehicle`, and `time_taken` (target)

Figure 50 shows the plots generated after running the multivariate_analysis function with the numerical column as time_taken, and the two categorical columns as vehicle_condition and type_of_vehicle.

Result of running multivariate_analysis with numerical column as time_taken and the two categorical columns as vehicle_condition and type_of_vehicle. — **Figure 50:** Result of running `multivariate_analysis` with numerical column as `time_taken` and the two categorical columns as `vehicle_condition` and `type_of_vehicle`.

We can observe the following:

The best vehicle_condition only has motorcycles. The next best vehicle_condition has motorcycles and scooters; the next best has motorcycles, scooters, and electric scooters; and the worst vehicle_condition has all, i.e., motorcycles, scooters, electric scooters, and bicycles.
The delivery times for motorcycles is the highest mostly because they are getting long distance deliveries.

Relationship between `type_of_vehicle` and `vehicle_condition`

Running the chi2_test function on these two columns indicated a significant association between them.

`type_of_order`

This is a categorical column. Running the categorical_analysis function on it gives the bar plot of count of the categories in this column which is shown in Figure 51.

All the type_of_orders are mostly uniformly distributed.

Relationship between `time_taken` (target) and `type_of_order`

Figure 52 shows the plots generated after running the categorical_numerical_analysis function with the categorical column as type_of_order and the numerical column as time_taken.

Result of running categorical_numerical_analysis with the categorical column as type_of_order and the numerical column as time_taken. — **Figure 52:** Result of running `categorical_numerical_analysis` with the categorical column as `type_of_order` and the numerical column as `time_taken`.

It looks like type_of_order does not affect time_taken. This was also confirmed after running the anova_test function on these two columns.

Relationship between `type_of_order` and `order_day_is_weekend`

Table 6 shows the pivot table between these two columns.

Type of Order	Weekday (0)	Weekend (1)
`buffet`	8238	3023
`drinks`	8130	3164
`meal`	8290	3145
`snack`	8337	3175

Table 6: Distribution of order types across weekdays and weekends.

The proportions are more or less the same. So type_of_order doesn’t seem to depend on order_day_is_weekend.

Relationship between `type_of_order` and `pickup_time_minutes`

Running the chi2_test function on these indicated no significant association.

Relationship between `type_of_order` and `ratings`

Figure 53 shows the plots generated after running the categorical_numerical_analysis function with the categorical column as type_of_order and the numerical column as ratings.

It looks like type_of_order has no effect on ratings.

Relationship between `type_of_order` and `festival`

Running the chi2_test function on these two indicated no significant association.

`city_name`

This is a categorical column. Running the function categorical_analysis on this column gives the bar plot of the count of its categories, which is shown in Figure 54.

Relationship between `time_taken` (target) and `city_name`

Figure 55 shows the plots generated after running the categorical_numerical_analysis function with the categorical column as city_name and the numerical column as time_taken.

Result of running categorical_numerical_analysis with the categorical column as city_name and the numerical column as time_taken. — **Figure 55:** Result of running `categorical_numerical_analysis` with the categorical column as `city_name` and the numerical column as `time_taken`.

It looks like there is no significant effect of city_name on time_taken.

`city_type`

This is a categorical column. Running the function categorical_analysis on this column gives the bar plot of the count of its categories, which is shown in Figure 56.

We can see that semi-urban city_type is extremely rare. We will need to handle it.

Relationship between `time_take` (target) and `city_type`

Figure 57 shows the plots generated after running the categorical_numerical_analysis function with the categorical column as city_type and the numerical column as time_taken.

Result of running categorical_numerical_analysis with the categorical column as city_type and the numerical column as time_taken. — **Figure 57:** Result of running `categorical_numerical_analysis` with the categorical column as `city_type` and the numerical column as `time_taken`.

Clearly, urban cities have the lowest time_taken, followed by metropolitian, and then semi-urban. The fact that they are associated was also confirmed by running the anova_test function on them.

Relationship between `ratings` and `city_type`

Figure 58 shows the plots generated after running the categorical_numerical_analysis function with the categorical column as city_type and the numerical column as ratings.

We can see that there are no ratings below 3.5 in cities of semi-urban city_type.

Relationship between `time_taken` (target), `city_type`, and `type_of_vehicle`

Figure 59 shows the plots generated after running the multivariate_analysis function with the numerical column as time_taken, and the two categorical columns as city_type and type_of_vehicle.

Result of running multivariate_analysis with numerical column as time_taken and the two categorical columns as city_type and type_of_vehicle. — **Figure 59:** Result of running `multivariate_analysis` with numerical column as `time_taken` and the two categorical columns as `city_type` and `type_of_vehicle`.

We can see that bicycles are not used in the cities with city_type as semi-urban.

`distance`

This is a numerical column. Running numerical_analysis on the column distance gives us the plots shown in Figure 60.

We can see that the distribution is not normal.

Relationship between `time_taken` (target) and `distance`

Figure 61 shows a scatter plot between time_taken and distance.

The values in the distance column also seem close to discrete. We also found that the correlation between time_taken and distance is about 0.32.

Relationship between `type_of_vehicle` and `distance`

Figure 62 shows the plots generated after running the categorical_numerical_analysis function with the categorical column as type_of_vehicle and the numerical column as distance.

Result of running categorical_numerical_analysis with the categorical column as type_of_vehicle and the numerical column as distance. — **Figure 62:** Result of running `categorical_numerical_analysis` with the categorical column as `type_of_vehicle` and the numerical column as `distance`.

There doesn’t seem to be any relationship.

Relationship between `distance` and `festival`

Figure 63 shows the plots generated after running the categorical_numerical_analysis function with the categorical column as festival and the numerical column as distance.

Result of running categorical_numerical_analysis with the categorical column as festival and the numerical column as distance. — **Figure 63:** Result of running `categorical_numerical_analysis` with the categorical column as `festival` and the numerical column as `distance`.

We can see that riders cover larger distances during festivals.

Jupyter Notebook

The entire EDA was carried out in this Jupyter notebook.

Building a Baseline Model

We will first create a baseline model by dropping all the missing values in the data. This will be a linear regression model. We will later compare it with the same linear regression model but with missing values imputed.

Baseline Model with Missing Values Dropped

Here, we first simply use the same function to clean the data. Further, we also drop the unnecessary columns that we identified during the EDA to not have much predictive power to predict the target. These column are:

rider_id
restaurant_latitude
restaurant_longitude
delivery_latitude
delivery_longitude
order_date
order_time_hour
order_day

Next, we drop all the rows containing missing values, split the data into input features and the output, and finally split the data into training and test sets with 20% of the data in the test set.

Data Preprocessing

We have the following numerical columns:

age,
ratings,
pickup_time_minutes,
distance,

the following nominal categorical columns:

weather,
type_of_order,
type_of_vehicle,
festival,
city_type,
city_name,
order_month,
order_day_of_week,
order_day_is_weekend,
order_time_of_day,

and the following ordinal categorical columns:

traffic,
distance_type.

Note that we also have the columns multiple_deliveries and vehicle_condition which are already encoded correctly. So we won’t perform any data preprocessing on them. Now, we use simple one-hot encoding for the nominal categorical column, and ordinal encoding for the ordered categorical column. For traffic, the ordinal encoding is done with the following order:

traffic_order = ['low', 'medium', 'high', 'jam']

and for distance_type it is done using the following order:

distance_type_order = ['short', 'medium', 'long', 'very_long']

Next, we build a preprocessor using scikit-learn’s ColumnTransformer. The code for that is the following:

preprocessor = ColumnTransformer(
    transformers=[
        ("scaler", MinMaxScaler(), num_cols),
        ("nominal_encoder", OneHotEncoder(drop="first", handle_unknown="ignore", sparse_output=False), nominal_cat_cols),
        ("ordinal_encoder", OrdinalEncoder(categories=[traffic_order, distance_type_order]), ordinal_cat_cols)
    ],
    remainder="passthrough",
    n_jobs=-1,
    force_int_remainder_cols=False,
    verbose_feature_names_out=False,
)

Notice that before encoding we are using the MinMaxScaler transformation on the columns. We train this preprocessor object on the training data and use it to transform the training as well as test sets.

Transforming the Target `time_taken`

As seen during the EDA, the Yeo-Johnson transformation makes the column time_taken (the target) become more normal. Hence, we train a PowerTransformer object with the Yeo-Johnson method on the training output and transform the training and the test outputs.

Model

Training a linear regression model on this data with missing values dropped gives us the error metrics shown in Table 7.

Metric	Training Set	Test Set
MAE	4.67 mins.	4.73 mins.
R²	0.60	0.60

Table 7: Model performance metrics on training and test datasets with missing values dropped.

Baseline Model with Missing Values Imputed

We will follow the same data cleaning and preprocessing steps, however this time we will also impute the missing values. For imputing the missing values, we will need to check if simple imputation methods, like the mean or median imputation, will work by comparing the distribution of the column before and after imputation. If the distribution is more or less the same, we will stick with these simple techinques. Otherwise, we will use complex imputation techniques, like $k$-NN imputation.

The columns that have missing values are the following:

age,
ratings,
weather,
traffic,
multiple_deliveries,
festival,
city_type,
distance,
distance_type,
pickup_time_minutes,
order_time_of_day.

`age`

The distribution of age before and after median imputation is shown in Figure 64.

The distribution of age has changed significantly. We should use advanced imputation techniques like $k$-NN imputation.

`ratings`

The distribution of ratings before and after median imputation is shown in Figure 65.

Again, the distribution has changed.

`weather`

The count plot of the categories in weather is shown in Figure 66.

No category in weather is dominating. Hence, we cannot impute using the most frequent (or mode) category. It will change the distribution significantly. So, we will instead use a missing indicator column and impute the missing values using the string "missing".

`traffic`

The count plot of the categories in traffic is shown in Figure 67.

There are two traffic categories that are dominating, 'low' and 'jam'. Hence, we cannot use the most frequent category (or mode) imputation. So we will again have to use a missing indicator.

`multiple_deliveries`

The count plot of the categories in multiple_deliveries is shown in Figure 68.

multiple_deliveries has a clear dominating category. So, we can impute it using its mode. The count plot after imputation is shown in Figure 69.

Count plot of multiple_deliveries after mode imputation. — **Figure 69:** Count plot of `multiple_deliveries` after mode imputation.

`festival`

The count plot of the categories in festival is shown in Figure 70.

festival has an overwhelmingly dominating category. Hence we can use mode imputation. The count plot after imputation stays almost the same.

`city_type`

The count plot of the categories in city_type is shown in Figure 71.

city_type has an overwhelmingly dominating category. Hence we can use mode imputation. The count plot after imputation stays almost the same.

`pickup_time_minutes`

The distribution of pickup_time_minutes before and after median imputation is shown in Figure 72.

The distribution has changed. We will need to use $k$-NN imputation.

`order_time_of_day`

The count plot of the categories in order_time_of_day is shown in Figure 73.

There is no category that is overwhelmingly dominating. So we will have to use a missing indicator column and a fill value of "missing".

`distance`

The distribution of distance before and after median imputation is shown in Figure 74.

As the distribution has changed substantially, we should use advanced imputation methods.

`distance_type`

The count plot of the categories in distance_type is shown in Figure 75.

There is no category that is overwhelmingly dominating. So we will have to use a missing indicator column and a fill value of "missing".

Creating a Model Pipeline

We will use mode imputation to fill the missing values in the columns

multiple_deliveries,
festival,
city_type,

and use a missing indicator with a fill value of "missing" in the columns

weather,
type_of_order,
type_of_vehicle,
city_name,
order_month,
order_day_of_week,
order_day_is_weekend,
order_time_of_day,

and use $k$-NN imputation for the remaining columns. We will now create one more ColumnTransformer object for simple imputation (mode imputation and missing indicator).

simple_imputer = ColumnTransformer(
    transformers=[
        ("mode_imputer", SimpleImputer(strategy="most_frequent"), features_to_fill_mode),
        ("missing_ind", SimpleImputer(strategy="constant", fill_value="missing"), features_to_fill_missing_ind)
    ],
    remainder="passthrough",
    n_jobs=-1,
    force_int_remainder_cols=False,
    verbose_feature_names_out=False
)

Further, we will also create a KNNImputer object.

knn_imputer = KNNImputer(n_neighbors=5)

Next, we will combine all these transformers into a single scikit-learn Pipeline.

preprocessing_pipeline = Pipeline(
    steps=[
        ("simple_imputer", simple_imputer),
        ("preprocessor", preprocessor),
        ("knn_imputer", knn_imputer)
    ]
)

Finally, we will add this pipeline inside another pipeline that also consists of the linear regression model.

lr = LinearRegression()

model_pipeline = Pipeline(
    steps=[
        ("preprocessor", preprocessing_pipeline),
        ("model", lr)
    ]
)

The error metrics obtained on the data using this pipeline are shown in Table 8.

Metric	Training Set	Test Set
MAE	4.82 mins.	4.85 mins.
R²	0.58	0.58

Table 8: Model performance metrics on training and test datasets with missing values imputed.

Comparing Table 7 and Table 8, we can see that dropping the missing values gives a model with better performance as compared imputing them.

Trying Random Forest Model in the Model Pipeline

We also tried random forest model in the pipeline with missing value imputation.

rf = RandomForestRegressor(random_state=42, n_jobs=-1)

model_pipeline = Pipeline(
    steps=[
        ("preprocessor", preprocessing_pipeline),
        ("model", rf)
    ]
)

The error metrics obtained on the data using this pipeline are shown in Table 9.

Metric	Training Set	Test Set
MAE	1.22 mins.	3.28 mins.
R²	0.97	0.80

Table 9: Random Forest model performance metrics on training and test datasets with missing values imputed.

We can see that the random forest model performs way better than the baseline linear regression model. However, it also overfits more.

Jupyter Notebook

The baseline model was built in this Jupyter notebook.

Experimentation

We have used MLFlow to log experiments and their results. The tracking server is set up on DagsHub. The experiments we performed are the following:

Compare a random forest model with imputing missing values vs. dropping missing values.
Train a random forest model with missing value imputation and adding missing indicator columns.
Train the following regressors and their parameters (using the Optuna library) on the best configuration obtained from the previous two experiments and select two best models:
- Random forest,
- Gradient boosting,
- XGBoost,
- LightGBM,
- $k$-Nearest neighbors,
- SVM.
Carry out detailed hyperparameter tuning of the first model obtained in the previous experiment.
Carry out detailed hyperparameter tuning of the second model obtained in the experiment 3.
Train a stacking regressor, i.e., a meta estimator, using the two best models obtained in the previous two experiments as base estimators. We chose multiple meta estimators
Carry out detailed hyperparameter tuning of the stacking regressor with the best meta estimator.

Let us now discuss the results of each experiment.

Experiment 1: Drop vs. Impute `NaN`’s

The results are shown in Table 10.

Metric	Training Set (Dropped `NaN`'s)	Test Set (Dropped `NaN`'s)	Training Set (Imputed `NaN`'s)	Test Set (Imputed `NaN`'s)
MAE	1.14 mins.	3.12 mins.	1.22 mins.	3.28 mins.
R²	0.98	0.83	0.97	0.80

Table 10: Random Forest model performance comparison with missing values dropped vs. imputed.

The parallel coordinates plot obtained on MLFlow for this experiment is shown in Figure 76.

**Figure 76:** Parallel coordinates plot for experiment 1.

We can see that the model performs better when the missing values are dropped. However the common result is that the model is overfitting in both situations. But we now know that dropping the missing values is the way to go.

Jupyter Notebook

This experiment was performed in this Jupyter notebook.

Experiment 2: Adding Missing Indicator Columns

This is very simply done by adding the Boolean parameter of

add_indicator = True

inside the SimpleImputer object while defining it. The result obtained in this experiment are shown in Table 11.

Metric	Training Set	Test Set
MAE	1.21 mins.	3.26 mins.
R²	0.97	0.81

Table 11: Random forest model performance metrics on training and test datasets with added missing indicator columns.

Comparing Table 10 and Table 11, we can see that adding missing indicator columns barely made any difference. Figure 77 shows the parallel coordinates plot that compares adding missing indicator columns vs. not adding them while imputing missing values.

**Figure 77:** Parallel coordinates plot for experiment 2.

So, from experiments 1 and 2, we can conclude that dropping the missing values is the best strategy.

Jupyter Notebook

This experiment was performed in this Jupyter notebook.

Experiment 3: Model Selection

As mentioned before, we trained the following models:

Random forest,
Gradient boosting,
XGBoost,
LightGBM,
$k$-Nearest neighbors,
SVM,

using Optuna that carries out Bayesian optimization. The best performing model is LightGBM. The results are shown in Table 12.

Metric	Training Set	Test Set
MAE	2.82 mins.	3.05 mins.
R²	0.86	0.84

Table 12: Model performance metrics on training and test datasets for the LightGBM model.

We can see that this model not only performs well on the test set as compared to random forest, but also has less overfitting. The second best model we got was random forest.

Jupyter Notebook

This experiment was performed in this Jupyter notebook.

Experiment 4: Detailed Hyperparameter Tuning of LightGBM

We carried detailed hyperparameter tuning of the LightGBM model using Optuna (Bayesian optimization). The hyperparameters we tuned are the following:

n_estimators,
max_depth,
learning_rate,
subsample,
min_child_weight,
min_split_gain,
reg_lambda.

Further, we also logged the best performing LightGBM model on MLFlow. The optimization history plot of this experiment is shown in Figure 78.

**Figure 78:** LightGBM optimization history.

The hyperparameter importances plot is shown in Figure 79.

**Figure 79:** LightGBM hyperparameter importances.

Jupyter Notebook

This experiment was performed in this Jupyter notebook.

Experiment 5: Detailed Hyperparameter Tuning of Random Forest

We carried detailed hyperparameter tuning of the random forest model using Optuna (Bayesian optimization). The hyperparameters we tuned are the following:

n_estimators,
max_depth,
max_features,
min_samples_split,
min_samples_leaf,
max_samples.

Further, we also logged the best performing random forest model on MLFlow. The optimization history plot of this experiment is shown in Figure 80.

**Figure 80:** Random forest optimization history.

The hyperparameter importances plot is shown in Figure 81.

**Figure 81:** Random forest hyperparameter importances.

Jupyter Notebook

This experiment was performed in this Jupyter notebook.

Experiment 6: Meta Regressor

We tried three meta models: linear regression, $k$-nearest neighbors, and decision tree. The base estimators used were the best models that we got in the previous two experiments. This hyperparameter tuning to find the best meta model was also done using Optuna. This is just a basic coarse hyperparameter tuning. Detailed one is carried out in the next experiment. We found that the best meta model is linear regression.

Jupyter Notebook

This experiment was performed in this Jupyter notebook.

Experiment 7: Final Regressor

The best meta regressor found in the previous experiment was linear regression. As linear regression does not have any hyperparameters to tune, we created the final regressor in this experiment using base estimators as the best models found in experiments 4 and 5. We also logged this final regressor model on MLFlow.

Jupyter Notebook

This experiment was performed in this Jupyter notebook.

Building the DVC Pipeline

Let us discuss the stages that will be incorporated in the DVC pipeline. All these stages were implemented in complete modular fashion with classes and objects. Also, the DVC remote was set as an Amazon S3 bucket.

Stages in the DVC Pipeline

Data Cleaning

The functionality of this stage is almost the same as that of the clean_data function that we created when we cleaned the data before doing EDA. This stage does the following:

Removes trailing spaces from each entry in the observations,
Converts all the entries to lowercase,
Replaces the string "NaN" to np.nan,
Renames columns to more succinct and meaningful names,
Drops all the ID related columns,
Drops all the rows corresponding to minors, i.e., riders below the age of 18,
Drops all the rows corresponding to a rating of 6,
Creates a new feature for the name of the city,
Cleans location columns (latitude and longitude columns),
Computes haversine distance using the location columns,
Bins the distances,
Cleans datetime columns,
Cleans the weather column,
Converts the datatype of multiple_deliveries column to float
Transforms the target (time_taken) using the Yeo-Johnson transformation,
Drops unnecessary columns.

Once all these steps are carried out, the cleaned data is saved. All these steps are implemented in the data_cleaning.py module, which can be checked out here.

Training-Test Split

This stage simply does the following:

Loads the cleaned data that was saved after running the previous stage,
Loads the parameters (from the params.yaml file) needed for creating the training-test split (e.g., test_size, random_state),
Splits the cleaned data into training and test sets,
Saves these sets.

All these steps are implemented in the training_test_split.py module, which can be checked out here.

Data Preprocessing

This stage carries out the following steps:

Loads the training and test sets that were saved after running the previous stage,
Drops missing values from both these sets,
Forms the input-output (X_train, y_train, X_test, y_test) split,
Creates a scikit-learn’s ColumnTransformer object containing steps for min-max scaling, and categorical encoding,
Trains this object on the training set,
Transforms the training and test sets using the trained ColumnTransformer object,
Joins transformed X_train and y_train to form the training data, and similarly forms the test data,
Saves the transformed training and test data,
Saves the trained ColumnTransformer object.

All these steps are implemented in the data_preprocessing.py module, which can be checked out here.

Model Training

This stage carries out the following steps:

Loads the transformed training data that was saved after running the previous stage,
Forms the input-output (X_train, y_train) split of this transformed training data,
Reads the tuned hyperparameters (from the params.yaml) of both the models (LightGBM and random forest),
Creates a StackingRegressor model with linear regression as the meta estimator and the tuned LightGBM and tuned random forest as the base estimators,
Trains this StackingRegressor model and the Yeo-Johnson transformer on the training data,
Saves this trained model and the trained transformer.

All these steps are implemented in the model_training.py module, which can be checked out here.

Model Evaluation

This stage carries out the following steps:

Loads the transformed training and test data that was saved after running the data preprocessing stage,
Input-output splits them both (i.e., forms X_train, y_train, X_test, y_test),
Loads the trained model that was saved after running the previous stage,
Makes prediction on the X_train and X_test (i.e., gets the training and test predictions),
Compares the predictions with the ground truths and calculates the following metrics:
- Training MAE,
- Test MAE,
- Training $R^2$,
- Test $R^2$,
- Average of the cross-validated MAE.
Logs the individual cross-validated MAE, data (training and test) as the input, model, and artifacts like the stacking regressor, preprocessor and the Yeo-Johnson transformer on MLFlow,
Logs the information of the trained model in a JSON file. This file contains the ID, path, and the name of the model. Using this information, this model can be registered in MLFlow model registry (which is the next stage).

All these steps are implemented in the model_evaluation.py module, which can be checked out here.

Model Registration

This stage carries out the following steps:

Loads the JSON file containing the information of the trained model which was saved after running the previous stage,
Using this information, it registers the model on MLFlow model registry,
Transition the stage of the model to “Staging”.

All these steps are implemented in the model_registration.py module, which can be checked out here.

`dvc.yaml`

All these stages in the DVC pipeline are connected through the dvc.yaml file, which can be checked out here. Running the following command in the terminal from the root directory (i.e., the directory where dvc.yaml file is present) will run this pipeline:

dvc repro

Building the API

The API is created using FastAPI. The steps carried out are the following:

Cleans the data,
Loads the JSON file containing the model information,
Loads the preprocessor object,
Loads the latest model,
Builds a scikit-learn Pipeline with the preprocessor and the model,
Makes predictions using the pipeline.

The app has the following two endpoints:

"/": This is a basic welcome page.
"/predict": This makes the predictions on the given data using the pipeline.

The code for this is in the app.py module, which can be checked out here. The API can be manually tested using the Swagger UI that is comes with FastAPI.

Locally Testing the API

The API was locally tested by creating a new Python script sample_predictions.py. This script loads a random row from the raw data and gives it as input to the "/predict" endpoint of the API and prints the ground truth and the prediction. This script can be checked out here.

Model Testing

The following tests are performed on the model:

Model Loading:
- Testing whether the registered model is loading correctly.
Model Performance:
- Testing whether the performance of this model is good.

Once the model passes both these tests, it will be promoted from the “Staging” stage to the “Production” stage. Later, we will include both these tests as a part of continuous integration (CI) using GitHub actions.

Model Loading

As mentioned earlier, this test makes sure that the registered model correctly loads from the MLFlow model registry. The following steps are carried out:

Loads the JSON file containing the latest model information,
Gets the latest model from the model registry using this information and a given stage,
Performs a parameterized test that checks whether the registered model loads correctly.

The code for this is in the test_model_registration.py module inside the tests directory, which can be checked out here.

Model Performance

Again, as mentioned earlier, this test makes sure that the performance of the registered model is good. This is done by making sure that the test MAE is below a certain threshold. The steps carried out are the following:

Loads the JSON file containing the latest model information,
Gets the latest model from the model registry using this information and a given stage,
Loads the preprocessor,
Creates a scikit-learn Pipeline using the preprocessor and the loaded model,
Loads the test data,
Makes predictions using the pipeline,
Calculate the test MAE,
Performs a parameterize test and check whether this test MAE is below the threshold (which is right now kept at 5 mins.).

The code for this is in the test_model_performance.py module inside the tests directory, which can be checked out here.

Continuous Integration (CI)

We automate the tests that are performed so far using CI on a GitHub runner using GitHub actions. This is done by creating a workflow file called ci_cd.yaml (the name can be anything) inside the workflows directory, which is in turn placed inside the .github directory of the project root. This workflow file consists of all the instructions that the GitHub runner follow to carry out these tests. Once all these tests are passed, we promote the model from the “Staging” stage to the “Production” stage. This promotion is done using the promote_model_to_production.py module, which can be checked out here.

The current content of the ci_cd.yaml workflow file is the following:

name: CI-CD

on: push

jobs:
  CI-CD:
    runs-on: ubuntu-latest
    steps:
      - name: Code checkout
        uses: actions/checkout@v4
      
      - name: Setup Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.12'
          cache: 'pip'

      - name: Cache pip dependencies
        uses: actions/cache@v3
        with:
          path: ~/.cache/pip
          key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements-dev.txt') }}
          restore-keys: |
            ${{ runner.os }}-pip-
      
      - name: Install packages
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements-dev.txt
    
      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-east-2
      
      - name: DVC pull
        run: |
          dvc pull
      
      - name: Test model registration
        env:
          DAGSHUB_USER_TOKEN: ${{ secrets.DAGSHUB_TOKEN }}
        run: |
          pytest tests/test_model_registration.py
      
      - name: Test model performance
        env:
          DAGSHUB_USER_TOKEN: ${{ secrets.DAGSHUB_TOKEN }}
        run: |
          pytest tests/test_model_performance.py
      
      - name: Promote model
        if: success()
        env:
          DAGSHUB_USER_TOKEN: ${{ secrets.DAGSHUB_TOKEN }}
        run: |
          python scripts/promote_model_to_production.py

The explanation is the following:

Workflow Name and Trigger
```
name: CI-CD

on: push
```
- Name: The workflow is named CI-CD.
- Trigger: It runs every time code is pushed to the repository.
Job Definition
```
jobs:
  CI-CD:
    runs-on: ubuntu-latest
```
- Defines a single job also named CI-CD.
- This job will run on the latest Ubuntu runner provided by GitHub.
Step 1: Checkout Repository
```
- name: Code checkout
  uses: actions/checkout@v4
```
- Checks out the latest version of the repository so the runner has access to your code.

Step 2: Set Up Python

- name: Setup Python
  uses: actions/setup-python@v5
  with:
    python-version: '3.12'
    cache: 'pip'

Installs Python 3.12 on the runner.
Enables caching for pip to speed up future runs.

Step 3: Cache pip Dependencies

- name: Cache pip dependencies
  uses: actions/cache@v3
  with:
    path: ~/.cache/pip
    key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements-dev.txt') }}
    restore-keys: |
      ${{ runner.os }}-pip-

Uses caching to store downloaded Python packages between workflow runs.
This reduces time and bandwidth if requirements-dev.txt hasn’t changed.

Step 4: Install Python Packages
```
- name: Install packages
  run: |
    python -m pip install --upgrade pip
    pip install -r requirements-dev.txt
```
- Upgrades pip to the latest version.
- Installs all the development dependencies listed in requirements-dev.txt.

Step 5: Configure AWS Credentials

- name: Configure AWS credentials
  uses: aws-actions/configure-aws-credentials@v4
  with:
    aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
    aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
    aws-region: us-east-2

Uses GitHub secrets to authenticate with AWS.
Enables access to AWS services like S3 or ECR (used by DVC in this case).

Step 6: Pull Data and Models using DVC
```
- name: DVC pull
  run: |
    dvc pull
```
- Runs dvc pull to download data, models, or other tracked artifacts from DVC remote storage.
Step 7: Test Model Registration
```
- name: Test model registration
  env:
    DAGSHUB_USER_TOKEN: ${{ secrets.DAGSHUB_TOKEN }}
  run: |
    pytest tests/test_model_registration.py
```
- Runs the test located in tests/test_model_registration.py.
- This tests verifies whether the model loads successfully from the model registry.
- Uses a DagsHub token for authenticated interaction.

Step 8: Test Model Performance

- name: Test model performance
  env:
    DAGSHUB_USER_TOKEN: ${{ secrets.DAGSHUB_TOKEN }}
  run: |
    pytest tests/test_model_performance.py

Runs the test located in tests/test_model_performance.py.
This test verifies that the performance of the model is good.

Step 9: Promote Model to Production

- name: Promote model
  if: success()
  env:
    DAGSHUB_USER_TOKEN: ${{ secrets.DAGSHUB_TOKEN }}
  run: |
    python scripts/promote_model_to_production.py

If all previous steps succeed, the script promote_model_to_production.py is executed.
It promotes the model to production stage.

Containerizing the API using Docker

🚧 Work in Progress: This page is currently being written. Some sections are complete, while others are still under construction. Feel free to explore and check back later for updates!

Introduction

Problem Statement

Data

Business Use Case

How Machine Learning Helps?

Enhanced Customer Experience

Operational Efficiency for Delivery Management

Financial Optimization

Strategic Decision-Making

Summary

Overall

Riders

Restaurants

What Metrics to Use?

Improvements in Business

Customer Satisfaction Score (CSAT)

Customer Retention Rate

Average Order Value (AOV)

Delivery Success Rate

Operational Efficiency Metrics

Customer Support Cost

Order Cancellation Rate

Flow of the Project

Data Cleaning

Basic Data Exploration

Missing Values and Stripping Whitespaces

Visualizing Missing Values

Matrix of Missing Values

Count of Missing Values

Correlation of Missingness

Dendrogram of Missing Values

Observations

Basic Data Cleaning

Stripping Leading and Trailing Whitespaces

Converting String Encoded NaN values to Actual NaN values

Renaming the Columns

Removing Duplicates

Column-wise Data Cleaning

id

rider_id

age

Boxplot

Minors

ratings

Observations

Six-Star Ratings

Location Columns

Fixing Incorrect Location Columns

Creating distance using Location Columns

order_date

order_time and order_picked_time

weather

traffic

vehicle_condition

type_of_order

type_of_vehicle

multiple_deliveries

festival

city_type

time_taken (The Target)

Function for Data Cleaning

Missing Values after Data Cleaning

Jupyter Notebook

EDA

Functions to Perform Analysis

Column-wise EDA

time_taken (Target)

Transforming time_taken using Yeo-Johnson Transformation

rider_id

age

Relationship between time_taken (target) and age

Relationship between time_taken, age, and type_of_vehicle

Relationship between age and type_of_vehicle

ratings

Relationship between time_taken (target) and ratings

Relationship between vehicle_condition and ratings

Relationship between type_of_vehicle and ratings

Relationship between festivals and ratings

Location Columns

order_date and Related Columns

`id`

`rider_id`

`age`

`ratings`

Creating `distance` using Location Columns

`order_date`

`order_time` and `order_picked_time`

`weather`

`traffic`

`vehicle_condition`

`type_of_order`

`type_of_vehicle`

`multiple_deliveries`

`festival`

`city_type`

`time_taken` (The Target)

`time_taken` (Target)

Transforming `time_taken` using Yeo-Johnson Transformation

`rider_id`

`age`

Relationship between `time_taken` (target) and `age`

Relationship between `time_taken`, `age`, and `type_of_vehicle`

Relationship between `age` and `type_of_vehicle`

`ratings`

Relationship between `time_taken` (target) and `ratings`

Relationship between `vehicle_condition` and `ratings`

Relationship between `type_of_vehicle` and `ratings`

Relationship between `festivals` and `ratings`

`order_date` and Related Columns

Relationship between `time_taken` (target) and `order_day_of_week`

Relationship between `time_taken` (target) and `order_day_is_weekend`

Relationship between `traffic` and `order_day_is_weekend`

Relationship between `time_taken` (target) and `festival`

Relationship between `festival` and `traffic`

Relationship between `festival`, `order_day_is_weekend`, and `time_taken`

`order_time` and Related Columns

Relationship between `order_time_of_day` and `time_taken` (target)

`pickup_time_minutes`

Relationship between `time_taken` (target) and `pickup_time_minutes`

`traffic`

Relationship between `traffic` and `city_type`, and `traffic` and `city_name`

Relationship between `time_taken` (target) and `traffic`

Relationship between `time_taken` (target), `traffic`, and `type_of_vehicle`

Relationship between `time_taken` (target), `traffic`, and `vehicle_condition`

Relationship between `time_taken`, `festival`, and `vehicle_condition`

`multiple_deliveries`

Relationship between `multiple_deliveries` and `time_taken` (target)

Relationship between `multiple_deliveries` and `distance`

`weather`

Relationship between `time_taken` (target) and `weather`

Relationship between `weather` and `traffic`

Relationship between `weather`, `traffic`, and `time_taken`

`vehicle_condition`

Relationship between `time_taken` (target) and `vehicle_condition`

`type_of_vehicle`

Relationship between `time_taken` (target) and `type_of_vehicle`

Relationship between `vehicle_condition`, `type_of_vehicle`, and `time_taken` (target)

Relationship between `type_of_vehicle` and `vehicle_condition`

`type_of_order`

Relationship between `time_taken` (target) and `type_of_order`

Relationship between `type_of_order` and `order_day_is_weekend`

Relationship between `type_of_order` and `pickup_time_minutes`

Relationship between `type_of_order` and `ratings`

Relationship between `type_of_order` and `festival`

`city_name`

Relationship between `time_taken` (target) and `city_name`

`city_type`

Relationship between `time_take` (target) and `city_type`

Relationship between `ratings` and `city_type`

Relationship between `time_taken` (target), `city_type`, and `type_of_vehicle`

`distance`

Relationship between `time_taken` (target) and `distance`

Relationship between `type_of_vehicle` and `distance`

Relationship between `distance` and `festival`

Transforming the Target `time_taken`

`age`

`ratings`

`weather`

`traffic`

`multiple_deliveries`