/******************************************************************************* * Copyright (c) 2010 Haifeng Li * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. *******************************************************************************/ /** * Classification algorithms. In machine learning and pattern recognition, * classification refers to an algorithmic procedure for assigning a given * input object into one of a given number of categories. The input * object is formally termed an instance, and the categories are termed classes. * <p> * The instance is usually described by a vector of features, which together * constitute a description of all known characteristics of the instance. * Typically, features are either categorical (also known as nominal, i.e. * consisting of one of a set of unordered items, such as a gender of "male" * or "female", or a blood type of "A", "B", "AB" or "O"), ordinal (consisting * of one of a set of ordered items, e.g. "large", "medium" or "small"), * integer-valued (e.g. a count of the number of occurrences of a particular * word in an email) or real-valued (e.g. a measurement of blood pressure). * <p> * Classification normally refers to a supervised procedure, i.e. a procedure * that produces an inferred function to predict the output value of new * instances based on a training set of pairs consisting of an input object * and a desired output value. The inferred function is called a classifier * if the output is discrete or a regression function if the output is * continuous. * <p> * The inferred function should predict the correct output value for any valid * input object. This requires the learning algorithm to generalize from the * training data to unseen situations in a "reasonable" way. * <p> * A wide range of supervised learning algorithms is available, each with * its strengths and weaknesses. There is no single learning algorithm that * works best on all supervised learning problems. The most widely used * learning algorithms are AdaBoost and gradient boosting, support vector * machines, linear regression, linear discriminant analysis, logistic * regression, naive Bayes, decision trees, k-nearest neighbor algorithm, * and neural networks (multilayer perceptron). * <p> * If the feature vectors include features of many different kinds (discrete, * discrete ordered, counts, continuous values), some algorithms cannot be * easily applied. Many algorithms, including linear regression, logistic * regression, neural networks, and nearest neighbor methods, require that * the input features be numerical and scaled to similar ranges (e.g., to * the [-1,1] interval). Methods that employ a distance function, such as * nearest neighbor methods and support vector machines with Gaussian kernels, * are particularly sensitive to this. An advantage of decision trees (and * boosting algorithms based on decision trees) is that they easily handle * heterogeneous data. * <p> * If the input features contain redundant information (e.g., highly correlated * features), some learning algorithms (e.g., linear regression, logistic * regression, and distance based methods) will perform poorly because of * numerical instabilities. These problems can often be solved by imposing * some form of regularization. * <p> * If each of the features makes an independent contribution to the output, * then algorithms based on linear functions (e.g., linear regression, * logistic regression, linear support vector machines, naive Bayes) generally * perform well. However, if there are complex interactions among features, * then algorithms such as nonlinear support vector machines, decision trees * and neural networks work better. Linear methods can also be applied, but * the engineer must manually specify the interactions when using them. * <p> * There are several major issues to consider in supervised learning: * <dl> * <dt>Features</dt> * <dd>The accuracy of the inferred function depends strongly on how the input * object is represented. Typically, the input object is transformed into * a feature vector, which contains a number of features that are descriptive * of the object. The number of features should not be too large, because of * the curse of dimensionality; but should contain enough information to * accurately predict the output.<p> * There are many algorithms for feature selection that seek to identify * the relevant features and discard the irrelevant ones. More generally, * dimensionality reduction may seek to map the input data into a lower * dimensional space prior to running the supervised learning algorithm.</dd> * <dt>Over-fitting</dt> * <dd>Over-fitting occurs when a statistical model describes random error * or noise instead of the underlying relationship. Over-fitting generally * occurs when a model is excessively complex, such as having too many * parameters relative to the number of observations. A model which has * been over-fit will generally have poor predictive performance, as it can * exaggerate minor fluctuations in the data. * <p> * The potential for over-fitting depends not only on the number of parameters * and data but also the conformability of the model structure with the data * shape, and the magnitude of model error compared to the expected level * of noise or error in the data. * <p> * In order to avoid over-fitting, it is necessary to use additional techniques * (e.g. cross-validation, regularization, early stopping, pruning, Bayesian * priors on parameters or model comparison), that can indicate when further * training is not resulting in better generalization. The basis of some * techniques is either (1) to explicitly penalize overly complex models, * or (2) to test the model's ability to generalize by evaluating its * performance on a set of data not used for training, which is assumed to * approximate the typical unseen data that a model will encounter.</dd> * <dt>Regularization</dt> * <dd>Regularization involves introducing additional information in order * to solve an ill-posed problem or to prevent over-fitting. This information * is usually of the form of a penalty for complexity, such as restrictions * for smoothness or bounds on the vector space norm. * <p> * A theoretical justification for regularization is that it attempts to impose * Occam's razor on the solution. From a Bayesian point of view, many * regularization techniques correspond to imposing certain prior distributions * on model parameters.</dd> * <dt>Bias-variance tradeoff</dt> * <dd>Mean squared error (MSE) can be broken down into two components: * variance and squared bias, known as the bias-variance decomposition. * Thus in order to minimize the MSE, we need to minimize both the bias and * the variance. However, this is not trivial. Therefore, there is a tradeoff * between bias and variance.</dd> * </dl> * * @author Haifeng Li */ package smile.classification;