/** * <p> * The classify package provides facilities for training classifiers. * In this package, data points are viewed as single instances, not sequences. * The most commonly used classifier is the softmax log-linear classifier with binary features. * More classifiers, such as SVM and Naive Bayes, are also available in this package. * <p>The {@code Classifier} contract only guarantees routines for getting a classification for an example, * and the scores assigned to each class for that example. * <b>Note</b> that training is dependent upon the individual classifier. * <p>Classifiers operate over {@code Datum} objects. A {@code Datum} is a list of descriptive features and * a class label; features and labels can be any object, but usually {@code String}s are used. A Datum can store * only categorical features (common in NLP) or it can store features with real values. The latter is referred to in * this package as an RVFDatum (real-valued feature datum). * Datum objects are grouped using {@code Dataset} objects. Some classifiers use Dataset objects as a way of grouping inputs. * <p>Following is a set of examples outlining how to create, train, and use each of the different classifier types. * * <h3>Linear Classifiers</h3> * <p>To build a classifier, one first creates a {@code GeneralDataset}, which is a list to {@code Datum} objects. * A {@code Datum} is a list of descriptive features, along with a label; features and labels can be any object, * though we usually use strings. * <p><pre> * GeneralDataset dataSet=new Dataset(); * while (more datums to make) { * ... make featureList: e.g., ["PrevWord=at","CurrentTag=NNP","isUpperCase"] * ... make label: e.g., ["PLACE"]; * Datum d = new BasicDatum(featureList, label); * dataSet.add(d); * } * </pre> * <p>There are some useful methods in {@code GeneralDataset} such as: * <p><pre> * dataSet.applyFeatureCountThreshold(int cutoff); * dataSet.summaryStatistics(); // dumps the number of features and datums * </pre> * <p>Next, one makes a {@code LinearClassifierFactory} and calls its {@code trainClassifier(GeneralDataset dataSet)} method: * <p><pre> * LinearClassifierFactory lcFactory = new LinearClassifierFactory(); * LinearClassifier c = lcFactory.trainClassifier(dataSet); * </pre> * <p>{@code LinearClassifierFactory} has options for different optimizers (default: QNminimizer), the converge threshold for minimization, etc. Check the class description for detailed information. * <p>A classifier, once built, can be used to classify new {@code Datum} instances: * <p><pre> * Object label = c.classOf(mysteryDatum); * </pre> * If you want scores instead, you can ask: * <p><pre> * Counter scores = c.scoresOf(mysteryDatum); * </pre> * <p>The scores which are returned by the log-linear classifiers are the feature-weight * dot products, not the normalized probabilities. * <p>There are some other useful methods like {@code justificationOf(Datum d)}, and * {@code logProbabilityOf(Datum d)}, also various methods for visualizing the * weights and the most highly weighted features. * This concludes the log-linear classifiers with binary features. * <p>We can also train log-linear classifiers with real-valued features. In this case, * {@code RVFDatum} should be used. * * <h3>Real Valued Classifiers</h3> * <p>Real Valued Classifiers (RVF) operate over {@code RVFDatum} objects. A RVFDatum is composed of a set of feature and real-value pairs. RVFDatums are grouped using a {@code RVFDataset}. * <p>To assemble an {@code RVFDatum} by using a {@code Counter} and assigning an {@code Object} label to it. * <pre> * Counter features = new Counter(); * features.incrementCount("FEATURE_A", 1.2); * features.incrementCount("FEATURE_B", 2.3); * features.incrementCount("FEATURE_C", 0.5); * RVFDatum rvfDatum = new RVFDatum(features, "DATUM_LABEL"); * </pre> * <p> * {@code RVFDataset} objects are representations of {@code RVFDatum} objects that efficiently store * the data with which to train the classifier. This type of dataset only accepts {@code RVFDatum} objects via its add * method (other {@code Datum} objects that are not instances of {@code RVFDatum} will be ignored), and is equivalent to a {@code Dataset} * if all {@code RVFDatum} objects have only features with value 1.0. Since it is a subclass of {@code GeneralDataset}, * the methods shown above as applied to the {@code GeneralDataset} can also be applied to the {@code RVFDataset}. * <p> * (TODO) An example for LinearType2Classifier. * <p> * (TODO) Saving Classifier out to file (from {@code LearningExperiment}) * <pre> * private static void saveClassifierToFile(LinearClassifier classifier, String serializePath) { * log.info("Serializing classifier to " + serializePath + "..."); * try { * ObjectOutputStream oos; * if (serializePath.endsWith(".gz")) { * oos = new ObjectOutputStream(new BufferedOutputStream(new GZIPOutputStream(new FileOutputStream(serializePath)))); * } else { * oos = new ObjectOutputStream(new BufferedOutputStream(new FileOutputStream(serializePath))); * } * oos.writeObject(classifier); * oos.close(); * Logging.logger(this.getClass()).info("done."); * } catch (Exception e) { * e.printStackTrace(); * throw new RuntimeException("Serialization failed: "+e.getMessage()); * } * } * </pre> * <p>Alternately, if your features are Strings, and you wish to serialize to a human readable text file, * you can use {@code saveToFilename} in {@code LinearClassifier} and reconstitute using {@code loadFromFilename} * in {@code LinearClassifierFactory}. Though the format is not as compact as a serialized object, * and implicitly presumes the features are Strings, this is useful for debugging purposes. * * @author Dan Klein * @author Eric Yeh */ package edu.stanford.nlp.classify;