/**
* <p>
* The classify package provides facilities for training classifiers.
* In this package, data points are viewed as single instances, not sequences.
* The most commonly used classifier is the softmax log-linear classifier with binary features.
* More classifiers, such as SVM and Naive Bayes, are also available in this package.
* <p>The {@code Classifier} contract only guarantees routines for getting a classification for an example,
* and the scores assigned to each class for that example.
* <b>Note</b> that training is dependent upon the individual classifier.
* <p>Classifiers operate over {@code Datum} objects. A {@code Datum} is a list of descriptive features and
* a class label; features and labels can be any object, but usually {@code String}s are used. A Datum can store
* only categorical features (common in NLP) or it can store features with real values. The latter is referred to in
* this package as an RVFDatum (real-valued feature datum).
* Datum objects are grouped using {@code Dataset} objects. Some classifiers use Dataset objects as a way of grouping inputs.
* <p>Following is a set of examples outlining how to create, train, and use each of the different classifier types.
*
* <h3>Linear Classifiers</h3>
* <p>To build a classifier, one first creates a {@code GeneralDataset}, which is a list to {@code Datum} objects.
* A {@code Datum} is a list of descriptive features, along with a label; features and labels can be any object,
* though we usually use strings.
* <p><pre>
* GeneralDataset dataSet=new Dataset();
* while (more datums to make) {
* ... make featureList: e.g., ["PrevWord=at","CurrentTag=NNP","isUpperCase"]
* ... make label: e.g., ["PLACE"];
* Datum d = new BasicDatum(featureList, label);
* dataSet.add(d);
* }
* </pre>
* <p>There are some useful methods in {@code GeneralDataset} such as:
* <p><pre>
* dataSet.applyFeatureCountThreshold(int cutoff);
* dataSet.summaryStatistics(); // dumps the number of features and datums
* </pre>
* <p>Next, one makes a {@code LinearClassifierFactory} and calls its {@code trainClassifier(GeneralDataset dataSet)} method:
* <p><pre>
* LinearClassifierFactory lcFactory = new LinearClassifierFactory();
* LinearClassifier c = lcFactory.trainClassifier(dataSet);
* </pre>
* <p>{@code LinearClassifierFactory} has options for different optimizers (default: QNminimizer), the converge threshold for minimization, etc. Check the class description for detailed information.
* <p>A classifier, once built, can be used to classify new {@code Datum} instances:
* <p><pre>
* Object label = c.classOf(mysteryDatum);
* </pre>
* If you want scores instead, you can ask:
* <p><pre>
* Counter scores = c.scoresOf(mysteryDatum);
* </pre>
* <p>The scores which are returned by the log-linear classifiers are the feature-weight
* dot products, not the normalized probabilities.
* <p>There are some other useful methods like {@code justificationOf(Datum d)}, and
* {@code logProbabilityOf(Datum d)}, also various methods for visualizing the
* weights and the most highly weighted features.
* This concludes the log-linear classifiers with binary features.
* <p>We can also train log-linear classifiers with real-valued features. In this case,
* {@code RVFDatum} should be used.
*
* <h3>Real Valued Classifiers</h3>
* <p>Real Valued Classifiers (RVF) operate over {@code RVFDatum} objects. A RVFDatum is composed of a set of feature and real-value pairs. RVFDatums are grouped using a {@code RVFDataset}.
* <p>To assemble an {@code RVFDatum} by using a {@code Counter} and assigning an {@code Object} label to it.
* <pre>
* Counter features = new Counter();
* features.incrementCount("FEATURE_A", 1.2);
* features.incrementCount("FEATURE_B", 2.3);
* features.incrementCount("FEATURE_C", 0.5);
* RVFDatum rvfDatum = new RVFDatum(features, "DATUM_LABEL");
* </pre>
* <p>
* {@code RVFDataset} objects are representations of {@code RVFDatum} objects that efficiently store
* the data with which to train the classifier. This type of dataset only accepts {@code RVFDatum} objects via its add
* method (other {@code Datum} objects that are not instances of {@code RVFDatum} will be ignored), and is equivalent to a {@code Dataset}
* if all {@code RVFDatum} objects have only features with value 1.0. Since it is a subclass of {@code GeneralDataset},
* the methods shown above as applied to the {@code GeneralDataset} can also be applied to the {@code RVFDataset}.
* <p>
* (TODO) An example for LinearType2Classifier.
* <p>
* (TODO) Saving Classifier out to file (from {@code LearningExperiment})
* <pre>
* private static void saveClassifierToFile(LinearClassifier classifier, String serializePath) {
* log.info("Serializing classifier to " + serializePath + "...");
* try {
* ObjectOutputStream oos;
* if (serializePath.endsWith(".gz")) {
* oos = new ObjectOutputStream(new BufferedOutputStream(new GZIPOutputStream(new FileOutputStream(serializePath))));
* } else {
* oos = new ObjectOutputStream(new BufferedOutputStream(new FileOutputStream(serializePath)));
* }
* oos.writeObject(classifier);
* oos.close();
* Logging.logger(this.getClass()).info("done.");
* } catch (Exception e) {
* e.printStackTrace();
* throw new RuntimeException("Serialization failed: "+e.getMessage());
* }
* }
* </pre>
* <p>Alternately, if your features are Strings, and you wish to serialize to a human readable text file,
* you can use {@code saveToFilename} in {@code LinearClassifier} and reconstitute using {@code loadFromFilename}
* in {@code LinearClassifierFactory}. Though the format is not as compact as a serialized object,
* and implicitly presumes the features are Strings, this is useful for debugging purposes.
*
* @author Dan Klein
* @author Eric Yeh
*/
package edu.stanford.nlp.classify;