Diabetes recognition using Machine Learning

Shriya Tyagi
7 min readOct 16, 2019

“Machine Intelligence is the last invention that humanity will ever need to make.”

This article will show how machine learning would be useful in healthcare for predicting various diseases like diabetes. It will show how data related to diabetes can be used to predict if a person has diabetes or not.

What is Diabetes?

Diabetes is a serious condition that causes higher than normal blood sugar levels. Diabetes occurs when your body cannot make or effectively use its own insulin, a hormone made by special cells in the pancreas. Insulin serves as a “key” to open your cells, to allow the sugar (glucose) from the food you eat to enter. Then, your body uses that glucose for energy.

Step 0: Data Readiness

The dullest job which we experience is the preparation of a data set. Despite the fact that there is a large amount of data in this period, it is still hard to find an appropriate data set that suits the issue you are attempting to handle. In the event that there aren’t any reasonable data sets to be discovered,

In this exercise, we will utilize a current data set called the “Pima Indians Diabetes Database” provided by the UCI Machine Learning Repository. You can access the repository using my GitHub Repository.

Step 1: Data Exploration

This stage is essential to check whether data cleaning is required. For this, we need to “become more acquainted with” the data set.

To begin with, we will import the essential libraries and import our data set to the Jupyter notebook. We can watch the mentioned columns in the data set.

%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
diabetes = pd.read_csv('diabetes.csv')
print(diabetes.columns)

We can inspect the data set using the pandas’ head() method.

diabetes.head()
Diabetes Data Set

We can find the dimensions of the data set using the panda Dataframes’ ‘shape’ attribute.

print("Diabetes data set dimensions : {}".format(diabetes.shape))

Diabetes data set dimensions : (768, 9)

We can see that the data set contain 768 rows and 9 columns. ‘Outcome’ is the segment that we will anticipate, which says if the patient is diabetic or not. 1 signifies the individual is diabetic and 0 signifies the individual isn’t. We can distinguish that out of the 768 people, 500 are marked as 0 (non-diabetic) and 268 as 1 (diabetic).

diabetes.groupby('Outcome').size()
Class Distribution

In this exercise we will utilize pandas’ visualization which is based over matplotlib, to discover the data distribution of the features.

Data distribution

We can utilize the below-mentioned code to draw histograms for the two responses independently.

diabetes.groupby(‘Outcome’).hist(figsize=(9, 9))

Step 2: Data Cleaning

The following stage of the AI workflow is data cleaning. Viewed as one of the vital steps of the workflow, since it can make or break the model.

There are a few components to consider in the data cleaning process.

· duplicate or insignificant observations.

· Terrible marking of data, same category occurring several times

· Null or invalid data points.

· Sudden outliers.

Since we are using a standard data set, we can safely assume that factors 1, 2 are already dealt with. Unexpected outliers are either useful or potentially harmful.

Step 3: Model Selection

The model choice or algorithm determination stage is the most thrilling and the core of AI. It is where we select the model which performs best for the data set at hand.

First, we will be computing the “Classification Accuracy (Testing Accuracy)” of a given set of classification models with their default parameters to figure out which model performs better with the diabetes data set.

We will import the fundamental libraries into the notebook. We import 6 classifiers in particular

· K-Nearest Neighbours

· Support Vector Classifier

· Logistic Regression

· Gaussian Naive Bayes

· Random Forest

· Gradient Boost

to be contenders for the best classifier.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

We will initialize the classifier models with their default parameters and add them to a model list.

models = []models.append(('KNN', KNeighborsClassifier()))
models.append(('SVC', SVC()))
models.append(('LR', LogisticRegression()))
models.append(('DT', DecisionTreeClassifier()))
models.append(('GNB', GaussianNB()))
models.append(('RF', RandomForestClassifier()))
models.append(('GB', GradientBoostingClassifier()))

Assessment Methods

It is a general practice to avoid training and testing on similar data. The reasons are that the objective of the model is to foresee out-of-test data, and the model could be excessively complex leading to overfitting. To maintain a strategic distance from the previously mentioned issues, there are two safety measures.

1. Train/Test Split

2. K-Fold Cross-Validation

We will import “train_test_split” for train/test split and “cross_val_score” for k-overlap cross-validation. “accuracy score” is to assess the accuracy of the model in the train/test split method.

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score

We will perform the above-mentioned methods to find the best performing base, models.

Train/Test Split

This method split the data set into two portions: a training set and a testing set. The training set is used to train the model. And the testing set is used to test the model, and evaluate the accuracy.

Pros : But, train/test split is still useful because of its flexibility and speed

Cons : Provides a high-variance estimate of out-of-sample accuracy

Train/Test Split

Train/Test Split with Scikit Learn :

Next, we can split the features into train and test partitions. We stratify (a procedure where every response class should be represented with equivalent extents in every one of the bits) the examples.

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = diabetes_mod.Outcome, random_state=0)

At that point, we fit each model in a loop and compute the accuracy of the particular model utilizing the “accuracy_score”.

names = []
scores = []for name, model in models:
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
scores.append(accuracy_score(y_test, y_pred))
names.append(name)tr_split = pd.DataFrame({'Name': names, 'Score': scores})
print(tr_split)
Train/Test Accuracy Scores

K-Fold Cross-Validation

This strategy parts the data set into K equivalent slots (“folds”), at that point utilize 1 fold as the testing set and the association of different folds as the preparation set. At that point, the model is tested for accuracy. The procedure will follow the above advances K times, using the different fold as the testing set each time The normal testing precision of the procedure is the testing accuracy.

K-Fold Cross Validation when K=5

K-Fold Cross Validation with Scikit Learn :

We will push ahead with K-Fold cross-validation as it is increasingly accurate and utilize the data effectively. We will prepare the models utilizing 10 fold cross-validation and figure the mean accuracy of the models. “cross_val_score” gives its very own training and accuracy calculation interface.

names = []
scores = []for name, model in models:

kfold = KFold(n_splits=10, random_state=10)
score = cross_val_score(model, X, y, cv=kfold, scoring='accuracy').mean()

names.append(name)
scores.append(score)kf_cross_val = pd.DataFrame({'Name': names, 'Score': scores})
print(kf_cross_val)
K-Fold Cross-Validation Accuracy Scores

We can plot the accuracy scores using seaborn

axis = sns.barplot(x = 'Name', y = 'Score', data = kf_cross_val)
axis.set(xlabel='Classifier', ylabel='Accuracy')for p in axis.patches:
height = p.get_height()
axis.text(p.get_x() + p.get_width()/2, height + 0.005, '{:1.4f}'.format(height), ha="center")

plt.show()
Accuracy of Classifiers

We can see the Logistic Regression, Gaussian Naive Bayes, Random Forest, and Gradient Boosting have performed better than the rest. From the base level, we can observe that the Logistic Regression performs better than the other algorithms.

Calculating Accuracy Scores

🚩 Finally, Logistic Regression managed to achieve a classification accuracy of 77.50 %. This will be selected as the prime candidate for the next phases.

Summary

In this article, we discussed the basic machine learning workflow steps such as data exploration, data cleaning steps, and model selection using the Scikit Learn library.

The Jupyter notebook for this article is available on Google Colab. You can also find the dataset for the same in my GitHub Repository.

Thank you, and I am open to your suggestions.

Hope you enjoyed my article!

If you like my article, please don’t forget to give a clap 👏 .

--

--