Machine learning and bioinformatics data analysis

Asst. Prof. Teerasak E-kobon

22 February 2021

This chapter will introduce steps to design and develop a quick complete machine learning project for analysis of bioinformatics data. A clear problem has to be defined before data acquisition and preparation. Several ML algorithms might have to be evaluated to select the optimal one before making a prediction and presenting the results. This example requires scipy, numpy, matplotlib, pandas, and sklearn modules. The iris dataset is used again to create some classifiers/predictors (the class column as the data label or known data) for differentiating the three plant species from the measured sepal and petal data. Descriptive statistics and explorative data plotting are done to understand the data and guide the algorithm selection. The iris dataset contains four continuous data, so the simple ML predictor could be developed using supervised learning methods for partially linear separation in some dimensions such as linear regression (LR), linear discriminant analysis (LDA), K-nearest neighbors (KNN), classification and regression trees (CART), Gaussian Naive Bayes (NB), and support vector machines (SVM). These algorithms are evaluated on the divided data (train-test splits) for training and testing/validation by the 10-fold cross-validation (split the data into 10 parts, train on nine sets and test on one set, and repeat for all combinations). The performance of this model is evaluated by using the percentage of accuracy. In the example, the algorithm that gives the best accuracy score is chosen to develop the prediction program and test against the validation dataset. The final evaluation then compares the predicted results and the expected/known results. The classification scores for this example will include precision, recall, f1-score, and support.

Example 1 Codes for developing the ML classification model for the iris data

# Python version import sys print(‘Python: {}’.format(sys.version)) # scipy import scipy print(‘scipy: {}’.format(scipy.__version__)) # numpy import numpy print(‘numpy: {}’.format(numpy.__version__)) # matplotlib import matplotlib print(‘matplotlib: {}’.format(matplotlib.__version__)) # pandas import pandas print(‘pandas: {}’.format(pandas.__version__)) # scikit-learn import sklearn print(‘sklearn: {}’.format(sklearn.__version__)) # Load libraries from pandas import read_csv from pandas.plotting import scatter_matrix from matplotlib import pyplot from sklearn.model_selection import train_test_split from sklearn.model_selection import cross_val_score from sklearn.model_selection import StratifiedKFold from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.naive_bayes import GaussianNB from sklearn.svm import SVC # Load dataset url = “https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv” names = [‘sepal-length’, ‘sepal-width’, ‘petal-length’, ‘petal-width’, ‘class’] dataset = read_csv(url, names=names) # shape print(dataset.shape) # head print(dataset.head(20)) # descriptions print(dataset.describe()) # class distribution print(dataset.groupby(‘class’).size()) # box and whisker plots dataset.plot(kind=’box’, subplots=True, layout=(2,2), sharex=False, sharey=False) pyplot.show() # histograms dataset.hist() pyplot.show() # scatter plot matrix scatter_matrix(dataset) pyplot.show() # Separate out a validation dataset. # Set-up the test harness to use 10-fold cross validation. # Build multiple different models to predict species from flower measurements # Select the best model. # Split-out validation dataset array = dataset.values X = array[:,0:4] y = array[:,4] X_train, X_validation, Y_train, Y_validation = train_test_split(X, y, test_size=0.20, random_state=1) # Let’s test 6 different algorithms: # # Logistic Regression (LR) # Linear Discriminant Analysis (LDA) # K-Nearest Neighbors (KNN). # Classification and Regression Trees (CART). # Gaussian Naive Bayes (NB). # Support Vector Machines (SVM). # Spot Check Algorithms models = [] models.append((‘LR’, LogisticRegression(solver=’liblinear’, multi_class=’ovr’))) models.append((‘LDA’, LinearDiscriminantAnalysis())) models.append((‘KNN’, KNeighborsClassifier())) models.append((‘CART’, DecisionTreeClassifier())) models.append((‘NB’, GaussianNB())) models.append((‘SVM’, SVC(gamma=’auto’))) # evaluate each model in turn results = [] names = [] for name, model in models: kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True) cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring=’accuracy’) results.append(cv_results) names.append(name) print(‘%s: %f (%f)’ % (name, cv_results.mean(), cv_results.std())) # Compare Algorithms pyplot.boxplot(results, labels=names) pyplot.title(‘Algorithm Comparison’) pyplot.show() # Make predictions on validation dataset model = SVC(gamma=’auto’) model.fit(X_train, Y_train) predictions = model.predict(X_validation) # Evaluate predictions print(accuracy_score(Y_validation, predictions)) print(confusion_matrix(Y_validation, predictions)) print(classification_report(Y_validation, predictions))

Question 1 From the example, can you explain the 10-fold cross-validation method? Give names of other validation methods used for the ML analysis?

Question 2 Please explain the difference between the six chosen ML algorithms using simple language (no need to go deep into the equation details). Could you also suggest other algorithms that might be applicable to this iris dataset?

Question 3 The examples used some parameters (accuracy, precision, recall, f1-score, and support) to evaluate and compare the performance of different algorithms performed on the same dataset. Please explain how to calculate these scores using a simple explanation.

Question 4 Once we have a good prediction program that allows users to submit the new data of the sepal and petal width and length. If we would like to make the program a graphical user interface (GUI) or an application (app), this must begin with the GUI design which can be drafted on paper. Please show your design of this application or GUI. This can be drawn on any painting and drawing programs.