Asst. Prof. Teerasak E-kobon
15 Feb 2021
Previous chapter introduced an example of using machine learning (ML) in bioinformatics. This chapter will discuss more on different machine learning algorithms available for bioinformaticists. Four types of the ML algorithms can be categorized.
1) Supervised learning which requires the trained data and the associated outputs in the process of learning such as decision tree, random forest, and logistic regression. The supervised learning is frequently used for classification and regression.
2) Unsupervised learning does not require pre-labelled trained data and can extract a useful pattern from the input data including clustering (K-nearest neighbour (KNN), K-mean clustering), association, and dimensionality reduction (principal component analysis or PCA, and discriminant analysis).
3) Semi-supervised learning consists of both supervised and unsupervised components.
4) Reinforcement learning is the process of training over a period of times under a specific environment until having the optimal solution. The optimization can be adapted if the environment of condition changes.
Several Python packages are available for the ML analysis i.e., numpy, scipy, pandas, and scikit-learn. This chapter will use the pima-indians-diabetes dataset for the tutorial. The first example shows how to import the dataset and basic data handling and visualization. If you do not have some modules, you can install them by the pip command in the cmd terminal. Visualization of the data helps to understand data nature and select suitable preprocessing methods if required.
Example 1 How to use numpy, pandas, and scikit-learn modules for data import and visualization. Nine columns of the data sets are 1. Number of times pregnant, 2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test, 3. Diastolic blood pressure (mm Hg), 4. Triceps skin fold thickness (mm), 5. 2-Hour serum insulin (mu U/ml), 6. Body mass index (weight in kg/(height in m)^2), 7. Diabetes pedigree function, 8. Age (years), and 9. Class variable (0 or 1).
from numpy import loadtxt path = r"C:\Users\TeerasakArt\Downloads\pima-indians-diabetes-2.csv" datapath= open(path, 'r') data = loadtxt(datapath, delimiter=",") print(data.shape) print(data[:3]) from pandas import read_csv path = r"C:\Users\TeerasakArt\Downloads\pima-indians-diabetes-2.csv" headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] data = read_csv(path, names=headernames) print(data.head(50)) # check data tyes print(data.dtypes) # print basic statistic analysis print(data.describe()) #check data balance in the class column count_class = data.groupby('class').size() print(count_class) # plotting histogram from matplotlib import pyplot data.hist() pyplot.show() # density plot data.plot(kind='density', subplots=True, layout=(3,3), sharex=False) pyplot.show() # box plot data.plot(kind='box', subplots=True, layout=(3,3), sharex=False,sharey=False) pyplot.show() # plotting a correlation matrix import numpy names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] correlations = data.corr() fig = pyplot.figure() ax = fig.add_subplot(111) cax = ax.matshow(correlations, vmin=-1, vmax=1) fig.colorbar(cax) ticks = numpy.arange(0,9,1) ax.set_xticks(ticks) ax.set_yticks(ticks) ax.set_xticklabels(names) ax.set_yticklabels(names) pyplot.show()
Question 1 From the pima-indians-diabetes-2 dataset, please specify the data type for each of these nine columns.
After the raw data visualization, data preparation is then required to process the data suitable to the ML algorithms. The data preparation or pre-processing methods include data scaling (adjust the data to the same scale), normalization (rescale each row of the data to the range of 1), binarization (convert by the specified threshold to 0 and 1), and standardization (transformation by using the Gaussian distribution). The second example shows the process of data preparation.
Example 2 Methods for the data preparation
#scale the data to range of 0 and 1 from numpy import set_printoptions from sklearn import preprocessing dataframe = read_csv(path, names=names) array = dataframe.values data_scaler = preprocessing.MinMaxScaler(feature_range=(0,1)) data_rescaled = data_scaler.fit_transform(array) set_printoptions(precision=1) print ("\nScaled data:\n", data_rescaled[0:10]) # L1 normalization (least absolute deviation) from sklearn.preprocessing import Normalizer dataframe = read_csv (path, names=names) array = dataframe.values Data_normalizer = Normalizer(norm='l1').fit(array) Data_normalized = Data_normalizer.transform(array) set_printoptions(precision=2) print ("\nNormalized data:\n", Data_normalized [0:3]) # L2 normalization (least square) Data_normalizer = Normalizer(norm='l2').fit(array) Data_normalized = Data_normalizer.transform(array) set_printoptions(precision=2) print ("\nNormalized data:\n", Data_normalized [0:3]) # binarization from sklearn.preprocessing import Binarizer dataframe = read_csv(path, names=names) array = dataframe.values binarizer = Binarizer(threshold=0.5).fit(array) Data_binarized = binarizer.transform(array) print ("\nBinary data:\n", Data_binarized [0:5]) # Standardization from sklearn.preprocessing import StandardScaler dataframe = read_csv(path, names=names) array = dataframe.values data_scaler = StandardScaler().fit(array) data_rescaled = data_scaler.transform(array) set_printoptions(precision=2) print ("\nRescaled data:\n", data_rescaled [0:5])
Question 2 From the second example, does the pima-indians-diabetes data need to be pre-processed? Please explain your reasons.
Next step after the data preparation is to select the data feature used for training the ML program which is important for the ML performance. Good data features will reduce overfitting, increase the model accuracy, and lower the training time. Several methods can be used in the feature selection. Univariate selection selects the features that have the strongest relationship with the train data. Recursive feature elimination method removes the features repeatedly and builds the model from the remaining features. Principal component analysis (PCA) is a data reduction method which uses the linear algebra to transform the dataset into a compressed form and select the number of principal components in the output. These methods are shown in the third example.
Example 3 Codes that show different feature selection methods
#univariate selection from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2 dataframe = read_csv(path, names=names) array = dataframe.values # separate inpu and output arrays X = array[:,0:8] Y = array[:,8] # select the best features test = SelectKBest(score_func=chi2, k=4) fit = test.fit(X,Y) # data summary set_printoptions(precision=2) print(fit.scores_) featured_data = fit.transform(X) print ("\nFeatured data:\n", featured_data[0:4]) # Recursive feature elimination from sklearn.feature_selection import RFE from sklearn.linear_model import LogisticRegression dataframe = read_csv(path, names=names) array = dataframe.values X = array[:,0:8] Y = array[:,8] model = LogisticRegression(max_iter=1000) rfe = RFE(model, n_features_to_select=3) fit = rfe.fit(X, Y) print(fit) print("Number of Features: ", fit.n_features_to_select) print("Selected Features: ", fit.support_) print("Feature Ranking: ", fit.ranking_) #PCA from sklearn.decomposition import PCA dataframe = read_csv(path, names=names) array = dataframe.values X = array[:,0:8] Y = array[:,8] pca = PCA(n_components=3) fit = pca.fit(X) print("Explained Variance: ", fit.explained_variance_ratio_) print(fit.components_) # feature importance from sklearn.ensemble import ExtraTreesClassifier dataframe = read_csv(path, names=names) array = dataframe.values X = array[:,0:8] Y = array[:,8] model = ExtraTreesClassifier() model.fit(X, Y) print(model.feature_importances_)
Question 3 After you have tried different feature selection methods on the pima-indeans-diabetes dataset, which feature should be the best three features for develop the ML model for this dataset? Please explain.
Once the data features have been selected, the ML model will be developed by selecting suitable ML algorithms. For the pima-indians-diabetes dataset, the supervised learning algorithm named the classification decision tree is chosen to split the dataset in different ways using different conditions. The decision tree algorithm starts by creating the cost function to evaluate the binary splits before splitting the dataset and evaluating all splits. After that, the tree will be built by initiating the terminal node and then creating the child nodes from the splitting datasets. The codes are shown in the fourth example.
Example 4 Codes for making the decision tree model based on the pima-indians-diabetes dataset
# build the decision tree model import pandas as pd from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import train_test_split col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label'] pima = pd.read_csv(r"C:\Users\TeerasakArt\Downloads\pima-indians-diabetes-2.csv", header=None, names=col_names) print(pima.head()) # split the dataset into features and target variable feature_cols = ['pregnant', 'insulin', 'bmi', 'age','glucose','bp','pedigree'] X = pima[feature_cols] # Features y = pima.label # Target variable # divide the data into train and test split. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # train the model clf = DecisionTreeClassifier() clf = clf.fit(X_train,y_train) # make a prediction y_pred = clf.predict(X_test) # calculate accuracy score, confusion matrix and classification report from sklearn.metrics import classification_report, confusion_matrix, accuracy_score result = confusion_matrix(y_test, y_pred) print("Confusion Matrix:") print(result) result1 = classification_report(y_test, y_pred) print("Classification Report:",) print (result1) result2 = accuracy_score(y_test,y_pred) print("Accuracy:", result2)
Question 4 From the fourth example, did the ML model give good prediction results? How do know this?
Question 5 Based on the use of supervised learning, if ones will not use the decision tree algorithm, please modify the given codes by using other supervised learning algorithms and compare their performance.