Classifying compounds as organic or inorganic is a useful technique for someone working with or studying chemicals. In this blog post, I show how I used machine learning to build a model that determines whether a compound is organic or inorganic based on the mass % of Carbon found in the compound.
Gathering The Data
I began by gathering real-world data of the mass % Carbon found in compounds. If the compound was organic then it got a score of 1 and if it was inorganic then it got a score of 0. In the end, I had a table of data with over 70 rows.
Visualizing The Data
I then used Python in a Kaggle notebook to plot a simple scatter plot showing the distribution of the data. What I saw was a clear boundary between the two groups of data. This made me realise that I could use the logistic regression algorithm to build a decision boundary that would enable me to determine, with some level of accuracy, whether a new compound not found in the data set is organic or inorganic.
Using The Logistic Regression Algorithm
My next step was to use the logistic regression algorithm found in the sklearn Python module to determine the decision boundary – a classification model. To do this, I had to split the data into two groups known as the train and the test data sets. The train data set contained 75% of the data and was used to train the algorithm to find a decision boundary. The test data set contained 25% of the data and was used to test the accuracy of the model. Below are the results of the test data with 1 being a prediction of organic, 0 being inorganic and x being the mass % Carbon as a decimal number.
Building A Confusion Matrix
How accurate is my model? In order to find the accuracy of my model, I needed to see the number of correct and incorrect predictions made by the model on the test data. To do this I made a confusion matrix which is a summary of prediction results. As you can see below, my model has an accuracy of 85%.
Testing With New Data
Finally, I had the chance to test the model using some new data. I was very pleased with the results. As you can see below it returned a score of 1 for a compound of mass % Carbon 0.67. What this means is that the model thinks that the compound is most likely organic. Knowing what I discovered while gathering my initial data set, this is CORRECT.
As you can see above, building a model that classifies compounds as organic or inorganic – with some level of accuracy – is possible. I hope that this blog post inspires you to use machine learning.
It is worth noting that more investigation is needed to ensure the best results.
* Since technology is continually developing, by the time you read this blog the products used may have changed.
Organic Compound Identifier
Find the percentage probability that your compound is organic or inorganic by entering in the mass % Carbon found in your compound as a decimal number. The higher the percentage probability returned by the application the higher the chances are that your compound is organic. In addition, the decision boundary is 42%. What this means is that if the percentage probability returned is less than 42%, then the algorithm thinks that your compound is inorganic.