# Classification parameter optimization

# 1 General

**Type:**- Matrix Processing**Heading:**- Learning**Source code:**not public.

# 2 Brief description

The cross-validation performance is monitored as a function of one or two parameters of the classification or feature selection algorithms.

# 3 Parameters

## 3.1 Items are in

It specifies if the items that should be used for the cross-validation or the prediction can be found in “Columns” or “Rows” (default: Columns).

### 3.1.1 Classes

Selected categorical row or column that contains the class of the items (default: first categorical row/column in the matrix). If items are in columns then the classes are in a categorical row, and if items are in rows the classes are in a categorical column.

### 3.1.2 Sub-classes

This parameter is just relevant, if the parameter “Items are in” is set to “Columns”. It specifies whether sub-classes should be taken into consideration for the classification process (default: `<None>`

).

## 3.2 Classification algorithm

Defines the algorithm that should be used for the classification (default: Support vector machine). The algorithm can be selected from a predefined list:

- Support vector machine
- Fisher LDA
- KNN

### 3.2.1 Kernel

This parameter is just relevant, if the parameter “Classification algorithm” is set to “Support vector machine”. It defines the kernel function that is used to classify items (default: linear). The kernel function can be selected from a predefined list:

Depending on the chosen function 1 to 4 parameters must be specified.

#### 3.2.1.1 Sigma

This parameter is just relevant, if “Kernel” is set to “RBF”. It defines the slope of the function (see formula above, default: 1).

#### 3.2.1.2 Degree

This parameter is just relevant, if “Kernel” is set to “Polynomial”. It defines the degree of the polynom (see formula above, default: 3).

#### 3.2.1.3 Gamma

This parameter is just relevant, if “Kernel” is set to “Polynomial” or “Sigmoid”. It defines the slope of the function (see formula above, default: 0.01).

#### 3.2.1.4 Coef

This parameter is just relevant, if “Kernel” is set to “Polynomial” or “Sigmoid”. It defines a constant (see formula above, default: 0).

#### 3.2.1.5 C

This parameter is just relevant, if the parameter “Classification algorithm” is set to “Support vector machine”. C is a penalty constant (default: 10). Large C corresponds to large penalties for misclassification and resembles a hard margin classifier.

### 3.2.2 Distance

This parameter is just relevant, if the parameter “Classification algorithm” is set to “KNN”. It defines the selected distance that will be used to assign the nearest neighbours to an item and therefore classify it (default: Euclidean). The distance can be selected from a predefined list:

- Euclidean
- L1
- Maximum
- Lp
- Pearson correlation
- Spearman correlation
- Cosine
- Canberra

### 3.2.3 Number of neighbours

This parameter is just relevant, if the parameter “Classification algorithm” is set to “KNN”. It specifies the number of closest neighbours that are taken into account for the classification of an item (default: 5).

### 3.2.4 Feature selection

Defines whether feature selection should be applied by ranking and reducing the features before the classification process (default: None).

### 3.2.5 Feature ranking method

This parameter is just relevant, if the parameter “Feature selection” is set to “From feature ranking”. It specifies which features method will be used to rank the features (default: ANOVA). The method can be selected from a predefined list:

- ANOVA
- Hybrid SVM
- MANOVA
- One-sided t-test
- Two-way ANOVA
- SVM
- RFE-SVM
- Golub

Depending on the ranking method up to 4 parameters can be specified.

#### 3.2.5.1 S0

This parameter is just relevant, if the parameter “Feature ranking method” is set to “ANOVA”, “Hybrid SVM”, “One-sided t-test” or “MANOVA”. It defines the artificial within groups variance and controls the relative importance of resulted test p-values and difference between means (default: 0). At s0=0 only the p-value matters, while at nonzero s0 also the difference of means plays a role. See^{1} for details.

#### 3.2.5.2 C

This parameter is just relevant, if the parameter “Feature ranking method” is set to “Hybrid SVM”, “SVM” or “RFE-SVM”. C is a penalty constant (default: 100). Large C corresponds to large penalties for misclassification and resembles a hard margin classifier.

#### 3.2.5.3 Reduction factor

This parameter is just relevant, if the parameter “Feature ranking method” is set to “Hybrid SVM” or “RFE-SVM”. It defines the factor by what the number of features will be reduced step by step during the ranking process (default: 1.414).

#### 3.2.5.4 Number of top ANOVA features

This parameter is just relevant, if the parameter “Feature ranking method” is set to “MANOVA”. It defines how many of the selected features are top ANOVA features.

#### 3.2.5.5 Side

This parameter is just relevant, if the parameter “Feature ranking method” is set to “One-sided t-test”. It defines the “Left” or “Right” side, where the null hypothesis can be rejected (default: Right).

#### 3.2.5.6 Orthogonal grouping

This parameter is just relevant, if the parameter “Feature ranking method” is set to “Two-way ANOVA”. It defines the grouping of the data according to a given categorical column or row to distinguish the effects of the groups.

#### 3.2.5.7 Min. orthogonal p-value

This parameter is just relevant, if the parameter “Feature ranking method” is set to “Two-way ANOVA”. Test results above this p-value are defined as orthogonal (default: 0).

#### 3.2.5.8 Min. interaction p-value

This parameter is just relevant, if the parameter “Feature ranking method” is set to “Two-way ANOVA”. Test results above this p-value are defined as interacting, hence the effects of one group do not depend on the other group (default: 0).

#### 3.2.5.9 Skip if orthog. P-value is better

This parameter is just relevant, if the parameter “Feature ranking method” is set to “Two-way ANOVA”. It defines whether features with an orthogonal p-value better than the given value in “Min. interaction p-value” are filtered out (default: unchecked).

#### 3.2.5.10 Number of features

Defines how many features should be selected (default: 100).

#### 3.2.5.11 Group-wise feature sel.

If checked, for each defined group in the data a different amount of features can be selected, which are then used for the classification (default: unchecked). The numbers can be defined either by typing in the text field in the form [Group,number] or by using the Edit button.

## 3.3 Parameter scan type

Defines whether a “One-dimensional scan” or a “Two-dimensional scan” should be applied (default: One-dimensional scan). A “One-dimensional scan” just scans through different parameter values of one parameter, whereas in a “Two-dimensional scan” two parameters can be chosen for which different values will be tested to find the optimal combination.

### 3.3.1 Parameter

Selected parameter, for which different values should be used to figure out the best setting for the classification problem. The parameter can be selected from a predefined list including the parameters of the selected classification algorithm as well as the parameters of the selected feature ranking method (default: first parameter available of “Classification algorithm” and “Feature selection”).

### 3.3.2 Number of values

Defines the number of values that should be tested (default: 5).

### 3.3.3 Step size

Defines the size by which the parameter should be increased step by step (default: 1). The operation how to increase the parameter values can be defined in “Step type”.

### 3.3.4 Step type

Specifies whether the value defined in “Step size” is added to the parameter value to calculate the new value for the parameter, or if the values are multiplied with each other to get the new parameter value (default: Additive).

## 3.4 Cross-validation type

This parameter is just relevant, if the parameter “Cross-validate assigned items” is checked. It defines the type of cross-validation that should be applied to the data set (default: n-fold). The type can be selected from a predefined list:

**Leave one out**: As many predictors are built as there are items in the data set. Thus for each predictor one item is left out to train the model and the predictor will be evaluated using the left out item. In the end the average prediction performance will be returned.**n-fold:**The items of the data set are split into n equally sized chunks. n predictors will be generated. In each of these prediction models the union of n-1 of these chunks are taken as the training set and the remaining chunk is the test set. In the end the average prediction performance will be returned.**Random sampling**: The number of predictors is specified by the “Number of repeats” parameter. The number of items taken out to form the test set (and not used for building the predictor) is specified by the “Test set percentage” parameter. In the end the average prediction performance will be returned.

Depending on the cross-validation type 0 to 2 parameters have to specified:

### 3.4.1 n

This parameter is just relevant, if the parameter “Cross-validation type” is set to “n-fold”. It defines the number of partitions the data is divided into (default: 4).

### 3.4.2 Test set percentage

This parameter is just relevant, if the parameter “Cross-validation type” is set to “Random sampling”. It specifies the percentage of the data that is used for testing the trained model (default: 15). The remaining data is used for the training process.

### 3.4.3 Number of repeats

This parameter is just relevant, if the parameter “Cross-validation type” is set to “Random sampling”. It specifies how often the cross-validation process is repeated (default: 250). In every round the data is again divided according to the previously defined percentage.

## 3.5 Number of threads

Defines the number of threads that should be used for the process (default: 1). The number of threads is limited by number of available cores of the machine Perseus in running on.

# 4 Parameter window

# 5 Theoretical background

## 5.1 Support vector machines

Support vector machines (*SVMs*) were largely developed in the 1990s by Vapnik and co-workers on a basis of a separable bipartition problem at the AT & T Bell Laboratories (see^{2}. *SVMs* are a family of data analysis algorithms, based on convex quadratic programming, whose successful use has been demonstrated in classification, regression and clustering problems. Thus, *SVMs* are now the state-of-the-art tools for non-linear input-output knowledge. The following section covers a brief and basic description of *SVMs*, but detailed explanations can be found in V. N. Vapniks^{3}, N. Cristianinis and J. Shawe-Taylors^{4}, V. N. Vapniks^{5}, V. N. Vapniks^{6}, B. E. Bosers, I. M. Guyons, and V. N. Vapniks^{7}.

*SVMs* are a particular class of supervised learning methods that are well suited for analyses of data in high-dimensional feature spaces. They are computationally efficient and capable of detecting biologically-relevant signals. *SVMs* revolve around the notion of a *margin* - either side of a data separating linear decision boundary (*hyperplane*). Maximizing this *margin* and thereby creating the largest distance between two classes as well as between the *hyperplane* and the instances on either side, is the main task in training *SVMs* (see figure below). Thus, these models have a binary nature to separate classes, but can be extended to multi-class problems by reducing the problem to a set of multiple binary classification problems. The *hyperplane* is defined by:

where \(ω\) is the weights vector and \(b\) is a bias value (or \(−b\) the threshold).

In case an optimal separating *hyperplane* is found, data points on the *margin* are known as *support vectors* and the solution is a linear combination of them (red data points in figure below). Each new data point is then classified according to its optimal position relative to the model’s *hyperplane*. So the model complexity is unaffected by the number of features encountered in the training data, therefore *SVMs* are well suited to deal with learning tasks with a large number of features compared to the number of data points. In case no *hyperplane* can be found, the problem can be addressed using the so-called *soft margin*. The *margin* optimization constraints can be relaxed by allowing some misclassifications or *margin* violations in the training set, to get better generalization of the *SVM* than using a *hard margin*. The choice of appropriate penalties is mandatory:

where \(\omega\) is the weights vector, \(b\) is a bias value, \(C\) is a penalty constant, and \(\xi\) is a slack variable, which is the orthogonal distance between a data point and the *hyperplane*. Large *C* correspond to large penalties for misclassification and resemble a *hard margin* classifier, whereas \(\xi\) measures the degree of misclassification or *margin* violation. This is a good way to deal with outliers in the data set without destroying the model by tailoring it perfectly to the input data.

Nevertheless, most real-world data sets involve separation problems that are linearly non-separable, which requires the definition of complex functions to build a good classifier. *SVMs* use kernels, a special class of functions to deal with such situations. Mapping the data points to a higher-dimensional space (transformed feature space) using kernels, enables the definition of a linear *hyperplane*, which results in a non-linear *hyperplane* in the original space. The *hyperplanes* in the higher-dimensional space are represented by all points defining a set, whose inner product with a vector is constant in that space. Training the classifier depends only on the data through dot products, which are possible to compute even at a high-dimension at low cost by applying the so-called *kernel trick*. The trick lies in working in an higher-dimensional space, without ever explicitly transforming the original data points into that space, but instead relying on algorithms that only need to compute inner products within that space. These algorithms are identical to kernels and can thus be cheaply computed in the original space. So, everything about linear cases can also be applied to non-linear ones using an appropriate kernel function. It is common practice to find the best suiting function by cross-validation. Some popular kernels, which are all included in Perseus, are:

where \(x\) and \(y\) are two data points, \(\gamma\) is the slope, \(d\) is the degree of the polynom, and \(r\) is a constant.

For more information you can also consult Wikipedia.

## 5.2 Fisher’s linear discriminant analysis

Linear Discriminant Analysis (LDA), is a well-known classification technique that has been used successfully in many statistical pattern recognition problems. It was developed by R.A. Fisher, a professor of statistics at University College London, and is sometimes called Fisher Discriminant Analysis (FDA). Its first description was in 1936 and can be found in^{8}.

The primary purpose of *LDA* is to separate samples of two or multiple distinct groups while preserving as much of the class discriminatory information as possible to classify new unseen instances. The approach of the *LDA* is to project all the data points into new space, normally of lower dimension, which maximizes the between-class separability while minimizing their within-class variability. So the goal is to find the best projection axes for separating the classes. In general the number of axes that can be computed by the *LDA* method is one less than the number of classes in the problem.

For more information you can also consult Wikipedia.

## 5.3 k-nearest neighbors

K-Nearest Neighbors (*kNN*) is a simple *lazy learner* algorithm that stores all available data points and classifies new instances based on a similarity measure (e.g., distance functions). It corresponds to the group of supervised learning algorithms and has been used in statistical estimation and pattern recognition already in the beginning of 1970’s as a non-parametric technique. During the training phase the algorithm simply stores the data points including their class labels and all computation is deferred until the classification process. So *kNN* is based on the principle that instances that are in close proximity to another have similar properties. Thus, to classify new unclassified instances, one simply has to look at their k-nearest neighbors, to figure out the classification label. The class membership can be defined by a majority vote of the *k* closest neighbors or the neighbors can be ranked and weighted according to their distance to the new instance. A common weighting scheme consists in giving each neighbor a weight of 1/d, where d is the distance to the neighbor.

For more information you can also consult Wikipedia.

## References

*Proceedings of the National Academy of Sciences*

**98**, 5116–5121 (2001).

*Informatica (Slovenia)*

**31**, 249–268 (2007).

*The nature of statistical learning theory*. (Springer New York, 2000). doi:10.1007/978-1-4757-3264-1.

*Statistical learning theory*. (Wiley, 1998).

*IEEE Transactions on Neural Networks*

**10**, 988–999 (1999).

*Proceedings of the fifth annual workshop on Computational learning theory*(1992) doi:10.1145/130385.130401.

*Annals of Eugenics*

**7**, 179–188 (1936).