Machine learning and bioinformatics data analysis

Asst. Prof. Teerasak E-kobon

22 February 2021

This chapter will introduce steps to design and develop a quick complete machine learning project for analysis of bioinformatics data. A clear problem has to be defined before data acquisition and preparation. Several ML algorithms might have to be evaluated to select the optimal one before making a prediction and presenting the results. This example requires scipy, numpy, matplotlib, pandas, and sklearn modules. The iris dataset is used again to create some classifiers/predictors (the class column as the data label or known data) for differentiating the three plant species from the measured sepal and petal data. Descriptive statistics and explorative data plotting are done to understand the data and guide the algorithm selection. The iris dataset contains four continuous data, so the simple ML predictor could be developed using supervised learning methods for partially linear separation in some dimensions such as linear regression (LR), linear discriminant analysis (LDA), K-nearest neighbors (KNN), classification and regression trees (CART), Gaussian Naive Bayes (NB), and support vector machines (SVM). These algorithms are evaluated on the divided data (train-test splits) for training and testing/validation by the 10-fold cross-validation (split the data into 10 parts, train on nine sets and test on one set, and repeat for all combinations). The performance of this model is evaluated by using the percentage of accuracy. In the example, the algorithm that gives the best accuracy score is chosen to develop the prediction program and test against the validation dataset. The final evaluation then compares the predicted results and the expected/known results. The classification scores for this example will include precision, recall, f1-score, and support.

Example 1 Codes for developing the ML classification model for the iris data

# Python version
import sys
print('Python: {}'.format(sys.version))
# scipy
import scipy
print('scipy: {}'.format(scipy.__version__))
# numpy
import numpy
print('numpy: {}'.format(numpy.__version__))
# matplotlib
import matplotlib
print('matplotlib: {}'.format(matplotlib.__version__))
# pandas
import pandas
print('pandas: {}'.format(pandas.__version__))
# scikit-learn
import sklearn
print('sklearn: {}'.format(sklearn.__version__))

# Load libraries
from pandas import read_csv
from pandas.plotting import scatter_matrix
from matplotlib import pyplot
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

# Load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = read_csv(url, names=names)

# shape
print(dataset.shape)

# head
print(dataset.head(20))

# descriptions
print(dataset.describe())

# class distribution
print(dataset.groupby('class').size())

# box and whisker plots
dataset.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False)
pyplot.show()

# histograms
dataset.hist()
pyplot.show()

# scatter plot matrix
scatter_matrix(dataset)
pyplot.show()

# Separate out a validation dataset.
# Set-up the test harness to use 10-fold cross validation.
# Build multiple different models to predict species from flower measurements
# Select the best model.

# Split-out validation dataset
array = dataset.values
X = array[:,0:4]
y = array[:,4]
X_train, X_validation, Y_train, Y_validation = train_test_split(X, y, test_size=0.20, random_state=1)

# Let’s test 6 different algorithms:
#
# Logistic Regression (LR)
# Linear Discriminant Analysis (LDA)
# K-Nearest Neighbors (KNN).
# Classification and Regression Trees (CART).
# Gaussian Naive Bayes (NB).
# Support Vector Machines (SVM).

# Spot Check Algorithms
models = []
models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto')))

# evaluate each model in turn
results = []
names = []
for name, model in models:
   kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)
   cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy')
   results.append(cv_results)
   names.append(name)
   print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()))

# Compare Algorithms
pyplot.boxplot(results, labels=names)
pyplot.title('Algorithm Comparison')
pyplot.show()

# Make predictions on validation dataset
model = SVC(gamma='auto')
model.fit(X_train, Y_train)
predictions = model.predict(X_validation)

# Evaluate predictions
print(accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))

Question 1 From the example, can you explain the 10-fold cross-validation method? Give names of other validation methods used for the ML analysis?

Question 2 Please explain the difference between the six chosen ML algorithms using simple language (no need to go deep into the equation details). Could you also suggest other algorithms that might be applicable to this iris dataset?

Question 3 The examples used some parameters (accuracy, precision, recall, f1-score, and support) to evaluate and compare the performance of different algorithms performed on the same dataset. Please explain how to calculate these scores using a simple explanation.

Question 4 Once we have a good prediction program that allows users to submit the new data of the sepal and petal width and length. If we would like to make the program a graphical user interface (GUI) or an application (app), this must begin with the GUI design which can be drafted on paper. Please show your design of this application or GUI. This can be drawn on any painting and drawing programs.

 

*******************************