Introduction

Being able to predict the cuisine of a given dish is an interesting task to consider. In this blog post, I show something of how I used an IBM research project to predict whether a given dish is Mexican, Greek, Korean or Indian cuisine using its ingredients.

Problem & Approach

By far my favourite cuisines are Mexican, Greek, Korean and Indian. Therefore I asked myself if it is possible to predict whether a given dish is Mexican, Greek, Korean or Indian. After thinking about this question for some time, I guessed that if I knew the ingredients in a dish, I might be able to estimate its cuisine. For example, if a dish is comprised of cumin, turmeric, onion and coriander, one could assume that the cuisine is Indian.

After defining my question, I then thought about what analytic approach might be able to answer it. Could machine learning or statistics help? The IBM research team that I was following during this project suggested YES and that the decision tree machine learning algorithm may just work because it is able to handle large data, missing information and numeric or categorical features, among other things.

Requirements & Collection

In order to build such a model, I needed the data of Mexican, Greek, Korean and Indian cuisines and recipes (ingredients). Thankfully an IBM researcher named Yong-Yeol Ahn had scraped tens of thousands of food recipes (cuisines and ingredients) from three different websites producing a table of many different cuisines including Mexican, Greek, Korean and Indian.

Understanding & Preparation

Understanding and preparing the cuisine data was a vital step in this experiment because it increased the likelihood that the decision tree algorithm would work effectively in answering the question at hand. The data set consisted of 385 columns of different ingredients and over 57 000 rows of recipes from different cuisines. The values in each cell were either a No or Yes depending on whether that ingredient was present in the recipe. While analysing the data I noticed that I could calculate the percentage frequency of each ingredient per cuisine. For example, in Mexican food cayenne is found 74% of the time. Onion 68%, garlic 62% and tomato 59%. This pattern gave me confidence that the decision tree algorithm could work.

There were a few problems that needed to be corrected such as the name of the first column needed to be changed from country to cuisine and each cuisine name needed to be all small letters. I also found that there were a few cuisines that existed under slightly different names. For example, Vietnamese cuisine existed under the names Vietnam and Vietnamese. To fix this problem I merged both groups into one called Vietnamese. Lastly, cuisines with less than 50 recipes had to be removed for accuracy.

Modelling

After the data was prepared, the modelling began using the decision tree algorithm. Below are two images, the first one shows the final data used in the model (it shows the number of recipes associated with each cuisine) and the second is an image of the decision tree model targeting Mexican, Greek, Korean and Indian cuisine. As we can see it shows the steps taken to predict these cuisines using logical Yes and No steps.

Conclusion

As you can see above, building a model that predicts whether a given dish is Mexican, Greek, Korean or Indian cuisine using its ingredients is possible. It is worth noting that the work was and is far from done since this model needed to be and needs to be evaluated and improved before deployment. However, I will not be describing this work in this blog post. I hope that this blog post inspires you to use machine learning to solve your problems and answer your question.

It is worth noting that more investigation is needed to ensure the best results.

* Since technology is continually developing, by the time you read this blog the products used may have changed.