package-info.java example

Explorer
CoreNLP-master
/**
 * <p>
 * A Maximum Entropy Part-of-Speech Tagger.  It can run either a Conditional
 * Markov Model (CMM) aka Maximum Entropy Markov Model (MEMM) tagger or a
 * cyclic dependency network tagger.
 * </p>
 *
 * <p>If you are only interested in using one of the trained taggers
 * included in the distribution, either from the commandline or via the
 * Java API, look at the documentation of the class
 * {@link edu.stanford.nlp.tagger.maxent.MaxentTagger}.</p>
 *
 * <p>Or, if you are interested in training a tagger from data using
 * some of the built-in architecture options (CMM or bi-directional
 * dependency network), also look at the documentation for
 * {@link edu.stanford.nlp.tagger.maxent.MaxentTagger}.</p>
 *
 * <p>The rest of this document is for more complex situations where you
 * want to define the features/architecture of your own tagger, which
 * requires delving into the code.</p>
 *
 * <p>The pre-defined features are for CMMs and bi-directional dependency
 * networks. The local models are log-linear models using features
 * specified via feature templates.</p>
 *
 *
 * <h2>Kinds of Templates</h2>
 *
 * <p>There are two kinds of templates: ones for rare words,
 * and ones for common words. For a context centered at a common word, only common
 * word features are active. For a context centered at a rare word, both common
 * and rare word features are active. Which words are considered common and which
 * rare is determined by the parameter GlobalHolder.rareThreshold. For example,
 * a threshold of five means that words occurring five times or less are considered
 * rare.</p>
 *
 * <p>The feature templates represent conditions on the words and tags
 * surrounding the current position and also the target tag. For example
 * <code><t<sub>_-1</sub>,t_<sub>0</sub>></code> is a common word
 * feature template. It is
 * instantiated for various values of the previous and current tag. A feature
 * formed by instantiating this template will look for example like this:
 * <code><t<sub>_-1</sub>=DT,t_<sub>0</sub>=NN></code>
 * and will have value 1 for a history <i>h</i> and tag <i>t</i>
 * iff the condition is satisfied. Every feature
 * template includes a specification of the current tag. It is not possible at the
 * moment to include features that are true if the current tag is one of several
 * possible, e.g. NN or NNS or NNP.</p>
 *
 * <p>To reduce the number
 * of features, cutoffs on the number of times a feature is active are introduced.
 * The cutoff for common word features is GlobalHolder.threshold, and
 * the cutoff for rare word features is GlobalHolder.thresholdRare.
 * The thresholds work like this: the part of the feature that does not include
 * the tag has to be active in the training set at least cutoff+1 times, and the
 * complete feature has to be active at least once in order for the feature to be
 *  included in the model. (Note, the cutoff for the current word feature is set to
 * 2 independent of threshold settings; cf. method
 * {@link edu.stanford.nlp.tagger.maxent.TaggerExperiments#populated(int, int)}.
  * </p>
 *
 * <h2>Training a Tagger</h2>
 *
 * <p >In order to train a tagger, we need to specify the feature
 * templates to be used, change the count cutoffs if we want, change the default
 * parameter estimation method if we want, perhaps hand-specify closed
 * class POS tags, and then train given tagged text.</p>
 *
 * <h3>Specifying Feature Templates</h3>
 *
 * <p >Feature templates inherit from the class {@link edu.stanford.nlp.tagger.maxent.Extractor}.
 * The main job of an Extractor is to extract the value it is interested in from a history.
 * Each instantiating feature for a given template will be true for a specific
 * value extracted from a history and a specific target tag.</p>
 *
 * <p>For example, this is a common word extractor that extracts
 * the current and next word. </p>
 *
 * <pre>
 * /**
 * * This extractor extracts the current and the next word in conjunction.
 * *\/
 * class ExtractorCWordNextWord extends Extractor {
 *
 * private final static String excl="!";
 *
 * public ExtractorCWordNextWord() {}
 *
 *  String extract(History h, PairsHolder pH) {
 * String s = pH.get(h, 0, false) + excl + pH.get(h, 1, false);
 * return s;
 * }
 *
 * }
 * </pre>
 *
 * <p>The method extract(History h) is defined in the base class
 * Extractor as:</p>
 * *
 * <pre>
 * String extract(History h) {
 * return extract(h, GlobalHolder.pairs);
 * }
 * </pre>
 *
 * <p>The PairsHolder contains an array of words and the tags. It
 * has a get method that can be used to extract things from the history. In
 * GlobalHolder.pairs , the whole training data is stored. String
 * GlobalHolder.pairs.get(History
 * <i>h</i>,int <i>position</i>, boolean <i>isTag</i>) , will return the tag or
 * word (depending on <i>isTag</i>), at position <i>position </i>relative to the
 * history <i>h</i>.</p>
 *
 * <p>Using this PairsHolder, we can extract features from the
 * whole sentence including the current word. The History object is basically a
 * specification of the start of the sentence, the current word, and the end of
 *  the sentence. </p>
 *
 * <p>In an extractor we can also specify for which tags to
 * instantiate the template. The method
 * <code>boolean precondition(String tag)</code> is by default true, meaning
 * that a feature can be created for every tag. Sometimes we would like to
 * restrict that, and say that features should be created for only the VB and VBP
 * tags, for example. In this case the method precondition has to be redefined to
 * return false for all other tags.</p>
 *
 * <p>The extractors for common word features have to be placed in
 * the static array ExtractorFrames.eFrames. The present state of this array for
 * the best tagger is:</p>
 * <blockquote><pre>
 * public static Extractor[] eFrames={cWord,prevWord,nextWord,prevTag,nextTag,
 * prevTwoTags,nextTwoTags,prevNextTag,prevTagWord,nextTagWord,cWordPrevWord,
 * cWordNextWord};
 * </pre></blockquote>
 *
 * <p>The extractors for rare word features commonly inherit from
 * RareExtractor, which inherits from Extractor. RareExtractor provides
 * some nice static methods for manipulating
 * strings, such as seeing whether they contain numbers, etc. The rare word
 * extractors have to be placed in the static array ExtractorFramesRare.eFrames.
 * For example, for a good English tagger, this array might be:</p>
 *
 * <blockquote><pre>
 * public static Extractor[] eFrames={cWordUppCase,cWordNumber,
 * cWordDash,cWordSuff1,cWordSuff2,cWordSuff3,cWordSuff4,
 * cAllCap,cMidSentence,cWordStartUCase,cWordMidUCase,
 * cWordPref1,cWordPref2,cWordPref3,cWordPref4,
 * new ExtractorCWordPref(5),new ExtractorCWordPref(6),
 * new ExtractorCWordPref(7), new ExtractorCWordPref(8),
 * new ExtractorCWordPref(9), new ExtractorCWordPref(10),
 * new ExtractorCWordSuff(5),new ExtractorCWordSuff(6),
 * new ExtractorCWordSuff(7),new ExtractorCWordSuff(8),
 * new ExtractorCWordSuff(9), new ExtractorCWordSuff(10),
 * cLetterDigitDash, cCompany,cAllCapitalized,cUpperDigitDash};
 * </pre></blockquote>
 *
 * <p>At present, many of the extractor and rare extractor combinations can
 * be flexibly set from a properties file by suitable specifications of the
 * <code>arch</code> option, whereas others require changing the code.</p>
 *
 * <h3>Specifying closed-class POS tags</h3>
 *
 * <p>
 * By default, all POS tags are assumed to be open classes.  In many cases,
 * it is useful to specify POS tags which are closed class, and can only be
 * applied to words seen in the training data (rather than being possible
 * tags for new words seen at runtime).  These closed class tags are
 * specified for a language (where a "language" is really a
 * (language,tag-set) pair: a different system of tagging is a new
 * language).  You do this by specifying the language in the properties
 * file, and specifying the closed class tags for that language in
 * {@link edu.stanford.nlp.tagger.maxent.TTags}.</p>
 */
package edu.stanford.nlp.tagger.maxent;