Select Page

### Introduction

Classifying compounds as organic or inorganic is a useful technique for someone working with or studying chemicals. In this blog post, I show how I used machine learning to build a model that determines whether a compound is organic or inorganic based on the mass % of Carbon found in the compound.

### Gathering The Data

I began by gathering real-world data of the mass % Carbon found in compounds. If the compound was organic then it got a score of 1 and if it was inorganic then it got a score of 0. In the end, I had a table of data with over 70 rows. ### Visualizing The Data

I then used Python in a Kaggle notebook to plot a simple scatter plot showing the distribution of the data. What I saw was a clear boundary between the two groups of data. This made me realise that I could use the logistic regression algorithm to build a decision boundary that would enable me to determine, with some level of accuracy, whether a new compound not found in the data set is organic or inorganic. ### Using The Logistic Regression Algorithm

My next step was to use the logistic regression algorithm found in the sklearn Python module to determine the decision boundary – a classification model. To do this, I had to split the data into two groups known as the train and the test data sets. The train data set contained 75% of the data and was used to train the algorithm to find a decision boundary. The test data set contained 25% of the data and was used to test the accuracy of the model. Below are the results of the test data with 1 being a prediction of organic, 0 being inorganic and x being the mass % Carbon as a decimal number. ### Building A Confusion Matrix

How accurate is my model? In order to find the accuracy of my model, I needed to see the number of correct and incorrect predictions made by the model on the test data. To do this I made a confusion matrix which is a summary of prediction results. As you can see below, my model has an accuracy of 85%.  ### Testing With New Data

Finally, I had the chance to test the model using some new data. I was very pleased with the results. As you can see below it returned a score of 1 for a compound of mass % Carbon 0.67. What this means is that the model thinks that the compound is most likely organic. Knowing what I discovered while gathering my initial data set, this is CORRECT. ### Conclusion

As you can see above, building a model that classifies compounds as organic or inorganic – with some level of accuracy – is possible. I hope that this blog post inspires you to use machine learning.

It is worth noting that more investigation is needed to ensure the best results.

* Since technology is continually developing, by the time you read this blog the products used may have changed.

# Organic Compound Identifier

Find the percentage probability that your compound is organic or inorganic by entering in the mass % Carbon found in your compound as a decimal number. The higher the percentage probability returned by the application the higher the chances are that your compound is organic. In addition, the decision boundary is 42%. What this means is that if the percentage probability returned is less than 42%, then the algorithm thinks that your compound is inorganic.

Percentage Probability

0.00