A Deep Reinforcement Learning Approach for the Meal Delivery Problem

Meal delivery is a rapidly growing business area and is globally projected to reach $200 billion in terms of gross revenue by 2025. Especially in urban areas, many people may not find time to cook for a variety of reasons, including long working hours and heavy traffic, making them order takeout frequently. With the increase of such demand, meal delivery service, which started by facilitating orders from each restaurant’s own web page, has led to the emergence of services such as UberEats, Doordash, and Grubhub that bring orders from restaurants to the customers. Increasing demand over the years has enabled this market to thrive. The revenue growth of the platform-to-customer food delivery market has increased by 27% from 2019 to 2020. The number of users is expected to amount to 965.8 million all around the world by 2024 in the platform-to-consumer delivery segment.

In addition to the challenges due to increasing demand, meal delivery services also pose significant operational difficulties. Most delivery platforms guarantee their customers’ short delivery times intending to serve. Providing a reliable and fast service enables the company to attract more users, leading to revenue and market share growth. However, especially in the peak hours of the day, i.e., at noon or in the evening, timely delivery becomes a challenging task. Accordingly, courier shifts and order assignments should be defined based on the expected number of orders and the couriers’ working hours. Another concern is the increasing number of couriers, which can be a prohibitive operational cost item while operating such a company. Accordingly, it is important to determine the ideal number of couriers that should be ready for each hour of the day to fulfill the customers’ demands while minimizing the cost at the same time.

In our study, a deep reinforcement learning approach for the meal delivery problem published in Knowledge-Based Systems Journal, we consider a limited number of available couriers fulfilling customer demands by picking up the meals from requested restaurants and delivering them to the customers as fast as possible.

Our overall objective for the meal delivery problem is to maximize total profit by minimizing the expected delays, postponing or declining orders with low/negative rewards, and assigning the orders to the most appropriate couriers. We apply various deep reinforcement algorithms to solve our model, many of which have not been considered for the meal delivery and courier routing problems. We conduct detailed numerical analysis with a varying number of couriers in the system to determine the ideal number for the working hours of the day.

The main characteristics and novelty of our work can be summarized as follows.

  • The primary contribution of our study is to propose a novel MDP model for the meal delivery problem and show its use in practical settings through our numerical analysis of the datasets obtained from the day-to-day operations of a meal delivery company.
  • Our model enables examining system improvements that can contribute to profitability and service quality. For instance, in a typical setting, when a courier delivers an order, if there is no new order in the system, then the courier directly returns to the depot. In our proposed model, couriers may head to the restaurants instead of going back to the depot. By doing this, we aim to make the couriers learn the restaurants with a higher probability of having a new order so that the delivery times are shortened. In addition, we include a reject action in the action space, which corresponds to choosing not to deliver an order that is deemed to be far away from the current courier locations. This action allows trading immediate rewards with future rewards and helps us achieve higher expected total rewards. Furthermore, our model allows assigning multiple orders to the couriers provided that the long-run return of that assignment is higher than other options. This model specification has only been considered in a few other studies in the literature.
  • We conduct numerical experiments with different numbers of couriers to determine the ideal number of couriers at each hour of a given day to meet demands while minimizing costs. We also present the courier utilizations to explore the rate at which each courier is assigned a new order, which helps to observe workload distribution among couriers. This consideration has important practical implications and, to the best of our knowledge, has not been examined in previous studies on the meal-delivery problems.
  • Our detailed numerical study helps explore the performances of eight deep Q-Networks (DQN) algorithms for the meal delivery problem. We conduct our experiments using both synthetic and real-world datasets to better assess the generalizability of our results. We also perform hyperparameter tuning to identify the ideal parameters for our proposed model and provide a detailed discussion of these parameters. Accordingly, our observations contribute to future studies that apply DQN and its variants to other problems.

We analyze meal delivery data from three different regions of Istanbul, namely, Bomonti, Uskudar, and Hisarustu, between October 2019 and April 2020. Each region has a different size and order density. We consider Hisarustu as a low-density, Uskudar as a moderate density, and Bomonti as a high-density region. We convert each region’s real map to the corresponding grids consisting of cells of sizes 500 by 500 meters. The below table shows the characteristics of each region. The number of active cells in the table includes the parts of the map for which we have at least one past order, either from a customer or a restaurant partner. Also, the below figure shows the spatial distribution of restaurants across the region; however, note that the popularity of each restaurant — the number of received orders for a given time —- may differ over time.

Restaurant and depot location in Istanbul (Getir Company)

In our problem setting, the order cycle starts with receiving an order, and then the company either accepts or rejects the order based on the time to deliver it. Ideally, the company aims to accept all orders with the caveat that some orders might lead to excessive delivery times. The company rejects orders only if the customer does not accept the longer waiting time. As an order is placed, the corresponding restaurant is notified, and a courier is dispatched from the depot to the restaurant to deliver the order. When the food is delivered, the courier returns to the depot. We define our MDP problem and its features, namely, the states, actions, and rewards in the system.

We utilized Deep Reinforcement Learning and traditional approaches to address the problem. Our experiments show that Double Deep Q-Network with prioritized experience replay and hard update significantly improves the baselines.

Comparing the results in terms of the long-run reward

We also demonstrated the convergences of the model and sensitivity analysis for the model’s parameters.

You can read more about the paper in the journal of Knowledge-Based Systems:

I added the author’s draft version on the Arxiv website as well:

Do not hesitate to contact me if you have any questions regarding the paper.

Acknowledgments

I would like to thank the Getir company for supporting this study and providing data and feedback throughout.

en_CAEnglish