Python and machine learning II

Asst. Prof. Teerasak E-kobon

15 Feb 2021

Previous chapter introduced an example of using machine learning (ML) in bioinformatics. This chapter will discuss more on different machine learning algorithms available for bioinformaticists. Four types of the ML algorithms can be categorized. 

1) Supervised learning which requires the trained data and the associated outputs in the process of learning such as decision tree, random forest, and logistic regression. The supervised learning is frequently used for classification and regression.

2) Unsupervised learning does not require pre-labelled trained data and can extract a useful pattern from the input data including clustering (K-nearest neighbour (KNN), K-mean clustering), association, and dimensionality reduction (principal component analysis or PCA, and discriminant analysis).

3) Semi-supervised learning consists of both supervised and unsupervised components. 

4) Reinforcement learning is the process of training over a period of times under a specific environment until having the optimal solution. The optimization can be adapted if the environment of condition changes. 

Several Python packages are available for the ML analysis i.e., numpy, scipy, pandas, and scikit-learn. This chapter will use the pima-indians-diabetes dataset for the tutorial. The first example shows how to import the dataset and basic data handling and visualization. If you do not have some modules, you can install them by the pip command in the cmd terminal. Visualization of the data helps to understand data nature and select suitable preprocessing methods if required. 

Example 1 How to use numpy, pandas, and scikit-learn modules for data import and visualization. Nine columns of the data sets are 1. Number of times pregnant, 2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test, 3. Diastolic blood pressure (mm Hg), 4. Triceps skin fold thickness (mm), 5. 2-Hour serum insulin (mu U/ml), 6. Body mass index (weight in kg/(height in m)^2), 7. Diabetes pedigree function, 8. Age (years), and 9. Class variable (0 or 1).

from numpy import loadtxt
path = r"C:\Users\TeerasakArt\Downloads\pima-indians-diabetes-2.csv"
datapath= open(path, 'r')
data = loadtxt(datapath, delimiter=",")

from pandas import read_csv
path = r"C:\Users\TeerasakArt\Downloads\pima-indians-diabetes-2.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=headernames)

# check data tyes

# print basic statistic analysis

#check data balance in the class column
count_class = data.groupby('class').size()

# plotting histogram
from matplotlib import pyplot

# density plot
data.plot(kind='density', subplots=True, layout=(3,3), sharex=False)

# box plot
data.plot(kind='box', subplots=True, layout=(3,3), sharex=False,sharey=False)

# plotting a correlation matrix
import numpy
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
correlations = data.corr()
fig = pyplot.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=-1, vmax=1)
ticks = numpy.arange(0,9,1)


Question 1 From the pima-indians-diabetes-2 dataset, please specify the data type for each of these nine columns.

After the raw data visualization, data preparation is then required to process the data suitable to the ML algorithms. The data preparation or pre-processing methods include data scaling (adjust the data to the same scale), normalization (rescale each row of the data to the range of 1), binarization (convert by the specified threshold to 0 and 1), and standardization (transformation by using the Gaussian distribution). The second example shows the process of data preparation.

Example 2 Methods for the data preparation

#scale the data to range of 0 and 1
from numpy import set_printoptions
from sklearn import preprocessing
dataframe = read_csv(path, names=names)
array = dataframe.values
data_scaler = preprocessing.MinMaxScaler(feature_range=(0,1))
data_rescaled = data_scaler.fit_transform(array)
print ("\nScaled data:\n", data_rescaled[0:10])

# L1 normalization (least absolute deviation)
from sklearn.preprocessing import Normalizer
dataframe = read_csv (path, names=names)
array = dataframe.values
Data_normalizer = Normalizer(norm='l1').fit(array)
Data_normalized = Data_normalizer.transform(array)
print ("\nNormalized data:\n", Data_normalized [0:3])

# L2 normalization (least square)
Data_normalizer = Normalizer(norm='l2').fit(array)
Data_normalized = Data_normalizer.transform(array)
print ("\nNormalized data:\n", Data_normalized [0:3])

# binarization
from sklearn.preprocessing import Binarizer
dataframe = read_csv(path, names=names)
array = dataframe.values
binarizer = Binarizer(threshold=0.5).fit(array)
Data_binarized = binarizer.transform(array)
print ("\nBinary data:\n", Data_binarized [0:5])

# Standardization
from sklearn.preprocessing import StandardScaler
dataframe = read_csv(path, names=names)
array = dataframe.values
data_scaler = StandardScaler().fit(array)
data_rescaled = data_scaler.transform(array)
print ("\nRescaled data:\n", data_rescaled [0:5])

Question 2 From the second example, does the pima-indians-diabetes data need to be pre-processed? Please explain your reasons.

Next step after the data preparation is to select the data feature used for training the ML program which is important for the ML performance. Good data features will reduce overfitting, increase the model accuracy, and lower the training time. Several methods can be used in the feature selection. Univariate selection selects the features that have the strongest relationship with the train data. Recursive feature elimination method removes the features repeatedly and builds the model from the remaining features. Principal component analysis (PCA) is a data reduction method which uses the linear algebra to transform the dataset into a compressed form and select the number of principal components in the output. These methods are shown in the third example.


Example 3 Codes that show different feature selection methods

#univariate selection
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
dataframe = read_csv(path, names=names)
array = dataframe.values

# separate inpu and output arrays
X = array[:,0:8]
Y = array[:,8]

# select the best features
test = SelectKBest(score_func=chi2, k=4)
fit =,Y)

# data summary
featured_data = fit.transform(X)
print ("\nFeatured data:\n", featured_data[0:4])

# Recursive feature elimination
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
dataframe = read_csv(path, names=names)
array = dataframe.values

X = array[:,0:8]
Y = array[:,8]

model = LogisticRegression(max_iter=1000)
rfe = RFE(model, n_features_to_select=3)
fit =, Y)
print("Number of Features: ", fit.n_features_to_select)
print("Selected Features: ", fit.support_)
print("Feature Ranking: ", fit.ranking_)

from sklearn.decomposition import PCA
dataframe = read_csv(path, names=names)
array = dataframe.values

X = array[:,0:8]
Y = array[:,8]

pca = PCA(n_components=3)
fit =
print("Explained Variance: ", fit.explained_variance_ratio_)

# feature importance
from sklearn.ensemble import ExtraTreesClassifier
dataframe = read_csv(path, names=names)
array = dataframe.values

X = array[:,0:8]
Y = array[:,8]

model = ExtraTreesClassifier(), Y)

Question 3 After you have tried different feature selection methods on the pima-indeans-diabetes dataset, which feature should be the best three features for develop the ML model for this dataset? Please explain.

Once the data features have been selected, the ML model will be developed by selecting suitable ML algorithms. For the pima-indians-diabetes dataset, the supervised learning algorithm named the classification decision tree is chosen to split the dataset in different ways using different conditions. The decision tree algorithm starts by creating the cost function to evaluate the binary splits before splitting the dataset and evaluating all splits. After that, the tree will be built by initiating the terminal node and then creating the child nodes from the splitting datasets. The codes are shown in the fourth example.

Example 4 Codes for making the decision tree model based on the pima-indians-diabetes dataset

# build the decision tree model
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']
pima = pd.read_csv(r"C:\Users\TeerasakArt\Downloads\pima-indians-diabetes-2.csv", header=None, names=col_names)

# split the dataset into features and target variable
feature_cols = ['pregnant', 'insulin', 'bmi', 'age','glucose','bp','pedigree']
X = pima[feature_cols] # Features
y = pima.label # Target variable

# divide the data into train and test split.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

# train the model
clf = DecisionTreeClassifier()
clf =,y_train)

# make a prediction
y_pred = clf.predict(X_test)

# calculate accuracy score, confusion matrix and classification report
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
result = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
result1 = classification_report(y_test, y_pred)
print("Classification Report:",)
print (result1)
result2 = accuracy_score(y_test,y_pred)
print("Accuracy:", result2)

Question 4 From the fourth example, did the ML model give good prediction results? How do know this? 

Question 5 Based on the use of supervised learning, if ones will not use the decision tree algorithm, please modify the given codes by using other supervised learning algorithms and compare their performance.