/** * <p> * A Maximum Entropy Part-of-Speech Tagger. It can run either a Conditional * Markov Model (CMM) aka Maximum Entropy Markov Model (MEMM) tagger or a * cyclic dependency network tagger. * </p> * * <p>If you are only interested in using one of the trained taggers * included in the distribution, either from the commandline or via the * Java API, look at the documentation of the class * {@link edu.stanford.nlp.tagger.maxent.MaxentTagger}.</p> * * <p>Or, if you are interested in training a tagger from data using * some of the built-in architecture options (CMM or bi-directional * dependency network), also look at the documentation for * {@link edu.stanford.nlp.tagger.maxent.MaxentTagger}.</p> * * <p>The rest of this document is for more complex situations where you * want to define the features/architecture of your own tagger, which * requires delving into the code.</p> * * <p>The pre-defined features are for CMMs and bi-directional dependency * networks. The local models are log-linear models using features * specified via feature templates.</p> * * * <h2>Kinds of Templates</h2> * * <p>There are two kinds of templates: ones for rare words, * and ones for common words. For a context centered at a common word, only common * word features are active. For a context centered at a rare word, both common * and rare word features are active. Which words are considered common and which * rare is determined by the parameter GlobalHolder.rareThreshold. For example, * a threshold of five means that words occurring five times or less are considered * rare.</p> * * <p>The feature templates represent conditions on the words and tags * surrounding the current position and also the target tag. For example * <code><t<sub>_-1</sub>,t_<sub>0</sub>></code> is a common word * feature template. It is * instantiated for various values of the previous and current tag. A feature * formed by instantiating this template will look for example like this: * <code><t<sub>_-1</sub>=DT,t_<sub>0</sub>=NN></code> * and will have value 1 for a history <i>h</i> and tag <i>t</i> * iff the condition is satisfied. Every feature * template includes a specification of the current tag. It is not possible at the * moment to include features that are true if the current tag is one of several * possible, e.g. NN or NNS or NNP.</p> * * <p>To reduce the number * of features, cutoffs on the number of times a feature is active are introduced. * The cutoff for common word features is GlobalHolder.threshold, and * the cutoff for rare word features is GlobalHolder.thresholdRare. * The thresholds work like this: the part of the feature that does not include * the tag has to be active in the training set at least cutoff+1 times, and the * complete feature has to be active at least once in order for the feature to be * included in the model. (Note, the cutoff for the current word feature is set to * 2 independent of threshold settings; cf. method * {@link edu.stanford.nlp.tagger.maxent.TaggerExperiments#populated(int, int)}. * </p> * * <h2>Training a Tagger</h2> * * <p >In order to train a tagger, we need to specify the feature * templates to be used, change the count cutoffs if we want, change the default * parameter estimation method if we want, perhaps hand-specify closed * class POS tags, and then train given tagged text.</p> * * <h3>Specifying Feature Templates</h3> * * <p >Feature templates inherit from the class {@link edu.stanford.nlp.tagger.maxent.Extractor}. * The main job of an Extractor is to extract the value it is interested in from a history. * Each instantiating feature for a given template will be true for a specific * value extracted from a history and a specific target tag.</p> * * <p>For example, this is a common word extractor that extracts * the current and next word. </p> * * <pre> * /** * * This extractor extracts the current and the next word in conjunction. * *\/ * class ExtractorCWordNextWord extends Extractor { * * private final static String excl="!"; * * public ExtractorCWordNextWord() {} * * String extract(History h, PairsHolder pH) { * String s = pH.get(h, 0, false) + excl + pH.get(h, 1, false); * return s; * } * * } * </pre> * * <p>The method extract(History h) is defined in the base class * Extractor as:</p> * * * <pre> * String extract(History h) { * return extract(h, GlobalHolder.pairs); * } * </pre> * * <p>The PairsHolder contains an array of words and the tags. It * has a get method that can be used to extract things from the history. In * GlobalHolder.pairs , the whole training data is stored. String * GlobalHolder.pairs.get(History * <i>h</i>,int <i>position</i>, boolean <i>isTag</i>) , will return the tag or * word (depending on <i>isTag</i>), at position <i>position </i>relative to the * history <i>h</i>.</p> * * <p>Using this PairsHolder, we can extract features from the * whole sentence including the current word. The History object is basically a * specification of the start of the sentence, the current word, and the end of * the sentence. </p> * * <p>In an extractor we can also specify for which tags to * instantiate the template. The method * <code>boolean precondition(String tag)</code> is by default true, meaning * that a feature can be created for every tag. Sometimes we would like to * restrict that, and say that features should be created for only the VB and VBP * tags, for example. In this case the method precondition has to be redefined to * return false for all other tags.</p> * * <p>The extractors for common word features have to be placed in * the static array ExtractorFrames.eFrames. The present state of this array for * the best tagger is:</p> * <blockquote><pre> * public static Extractor[] eFrames={cWord,prevWord,nextWord,prevTag,nextTag, * prevTwoTags,nextTwoTags,prevNextTag,prevTagWord,nextTagWord,cWordPrevWord, * cWordNextWord}; * </pre></blockquote> * * <p>The extractors for rare word features commonly inherit from * RareExtractor, which inherits from Extractor. RareExtractor provides * some nice static methods for manipulating * strings, such as seeing whether they contain numbers, etc. The rare word * extractors have to be placed in the static array ExtractorFramesRare.eFrames. * For example, for a good English tagger, this array might be:</p> * * <blockquote><pre> * public static Extractor[] eFrames={cWordUppCase,cWordNumber, * cWordDash,cWordSuff1,cWordSuff2,cWordSuff3,cWordSuff4, * cAllCap,cMidSentence,cWordStartUCase,cWordMidUCase, * cWordPref1,cWordPref2,cWordPref3,cWordPref4, * new ExtractorCWordPref(5),new ExtractorCWordPref(6), * new ExtractorCWordPref(7), new ExtractorCWordPref(8), * new ExtractorCWordPref(9), new ExtractorCWordPref(10), * new ExtractorCWordSuff(5),new ExtractorCWordSuff(6), * new ExtractorCWordSuff(7),new ExtractorCWordSuff(8), * new ExtractorCWordSuff(9), new ExtractorCWordSuff(10), * cLetterDigitDash, cCompany,cAllCapitalized,cUpperDigitDash}; * </pre></blockquote> * * <p>At present, many of the extractor and rare extractor combinations can * be flexibly set from a properties file by suitable specifications of the * <code>arch</code> option, whereas others require changing the code.</p> * * <h3>Specifying closed-class POS tags</h3> * * <p> * By default, all POS tags are assumed to be open classes. In many cases, * it is useful to specify POS tags which are closed class, and can only be * applied to words seen in the training data (rather than being possible * tags for new words seen at runtime). These closed class tags are * specified for a language (where a "language" is really a * (language,tag-set) pair: a different system of tagging is a new * language). You do this by specifying the language in the properties * file, and specifying the closed class tags for that language in * {@link edu.stanford.nlp.tagger.maxent.TTags}.</p> */ package edu.stanford.nlp.tagger.maxent;