/* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ /** * Text analysis. * <p>API and code to convert text into indexable/searchable tokens. Covers {@link org.apache.lucene.analysis.Analyzer} and related classes.</p> * <h2>Parsing? Tokenization? Analysis!</h2> * <p> * Lucene, an indexing and search library, accepts only plain text input. * <h2>Parsing</h2> * <p> * Applications that build their search capabilities upon Lucene may support documents in various formats – HTML, XML, PDF, Word – just to name a few. * Lucene does not care about the <i>Parsing</i> of these and other document formats, and it is the responsibility of the * application using Lucene to use an appropriate <i>Parser</i> to convert the original format into plain text before passing that plain text to Lucene. * <h2>Tokenization</h2> * <p> * Plain text passed to Lucene for indexing goes through a process generally called tokenization. Tokenization is the process * of breaking input text into small indexing elements – tokens. * The way input text is broken into tokens heavily influences how people will then be able to search for that text. * For instance, sentences beginnings and endings can be identified to provide for more accurate phrase * and proximity searches (though sentence identification is not provided by Lucene). * <p> * In some cases simply breaking the input text into tokens is not enough * – a deeper <i>Analysis</i> may be needed. Lucene includes both * pre- and post-tokenization analysis facilities. * </p> * <p> * Pre-tokenization analysis can include (but is not limited to) stripping * HTML markup, and transforming or removing text matching arbitrary patterns * or sets of fixed strings. * </p> * <p> * There are many post-tokenization steps that can be done, including * (but not limited to): * </p> * <ul> * <li><a href="http://en.wikipedia.org/wiki/Stemming">Stemming</a> – * Replacing words with their stems. * For instance with English stemming "bikes" is replaced with "bike"; * now query "bike" can find both documents containing "bike" and those containing "bikes". * </li> * <li><a href="http://en.wikipedia.org/wiki/Stop_words">Stop Words Filtering</a> – * Common words like "the", "and" and "a" rarely add any value to a search. * Removing them shrinks the index size and increases performance. * It may also reduce some "noise" and actually improve search quality. * </li> * <li><a href="http://en.wikipedia.org/wiki/Text_normalization">Text Normalization</a> – * Stripping accents and other character markings can make for better searching. * </li> * <li><a href="http://en.wikipedia.org/wiki/Synonym">Synonym Expansion</a> – * Adding in synonyms at the same token position as the current word can mean better * matching when users search with words in the synonym set. * </li> * </ul> * <h2>Core Analysis</h2> * <p> * The analysis package provides the mechanism to convert Strings and Readers * into tokens that can be indexed by Lucene. There are four main classes in * the package from which all analysis processes are derived. These are: * </p> * <ul> * <li> * {@link org.apache.lucene.analysis.Analyzer} – An <code>Analyzer</code> is * responsible for supplying a * {@link org.apache.lucene.analysis.TokenStream} which can be consumed * by the indexing and searching processes. See below for more information * on implementing your own {@link org.apache.lucene.analysis.Analyzer}. Most of the time, you can use * an anonymous subclass of {@link org.apache.lucene.analysis.Analyzer}. * </li> * <li> * {@link org.apache.lucene.analysis.CharFilter} – <code>CharFilter</code> extends * {@link java.io.Reader} to transform the text before it is * tokenized, while providing * corrected character offsets to account for these modifications. This * capability allows highlighting to function over the original text when * indexed tokens are created from <code>CharFilter</code>-modified text with offsets * that are not the same as those in the original text. {@link org.apache.lucene.analysis.Tokenizer#setReader(java.io.Reader)} * accept <code>CharFilter</code>s. <code>CharFilter</code>s may * be chained to perform multiple pre-tokenization modifications. * </li> * <li> * {@link org.apache.lucene.analysis.Tokenizer} – A <code>Tokenizer</code> is a * {@link org.apache.lucene.analysis.TokenStream} and is responsible for * breaking up incoming text into tokens. In many cases, an {@link org.apache.lucene.analysis.Analyzer} will * use a {@link org.apache.lucene.analysis.Tokenizer} as the first step in the analysis process. However, * to modify text prior to tokenization, use a {@link org.apache.lucene.analysis.CharFilter} subclass (see * above). * </li> * <li> * {@link org.apache.lucene.analysis.TokenFilter} – A <code>TokenFilter</code> is * a {@link org.apache.lucene.analysis.TokenStream} and is responsible * for modifying tokens that have been created by the <code>Tokenizer</code>. Common * modifications performed by a <code>TokenFilter</code> are: deletion, stemming, synonym * injection, and case folding. Not all <code>Analyzer</code>s require <code>TokenFilter</code>s. * </li> * </ul> * <h2>Hints, Tips and Traps</h2> * <p> * The relationship between {@link org.apache.lucene.analysis.Analyzer} and * {@link org.apache.lucene.analysis.CharFilter}s, * {@link org.apache.lucene.analysis.Tokenizer}s, * and {@link org.apache.lucene.analysis.TokenFilter}s is sometimes confusing. To ease * this confusion, here is some clarifications: * </p> * <ul> * <li> * The {@link org.apache.lucene.analysis.Analyzer} is a * <strong>factory</strong> for analysis chains. <code>Analyzer</code>s don't * process text, <code>Analyzer</code>s construct <code>CharFilter</code>s, <code>Tokenizer</code>s, and/or * <code>TokenFilter</code>s that process text. An <code>Analyzer</code> has two tasks: * to produce {@link org.apache.lucene.analysis.TokenStream}s that accept a * reader and produces tokens, and to wrap or otherwise * pre-process {@link java.io.Reader} objects. * </li> * <li> * The {@link org.apache.lucene.analysis.CharFilter} is a subclass of * {@link java.io.Reader} that supports offset tracking. * </li> * <li>The{@link org.apache.lucene.analysis.Tokenizer} * is only responsible for <u>breaking</u> the input text into tokens. * </li> * <li>The{@link org.apache.lucene.analysis.TokenFilter} modifies a * stream of tokens and their contents. * </li> * <li> * {@link org.apache.lucene.analysis.Tokenizer} is a {@link org.apache.lucene.analysis.TokenStream}, * but {@link org.apache.lucene.analysis.Analyzer} is not. * </li> * <li> * {@link org.apache.lucene.analysis.Analyzer} is "field aware", but * {@link org.apache.lucene.analysis.Tokenizer} is not. {@link org.apache.lucene.analysis.Analyzer}s may * take a field name into account when constructing the {@link org.apache.lucene.analysis.TokenStream}. * </li> * </ul> * <p> * If you want to use a particular combination of <code>CharFilter</code>s, a * <code>Tokenizer</code>, and some <code>TokenFilter</code>s, the simplest thing is often an * create an anonymous subclass of {@link org.apache.lucene.analysis.Analyzer}, provide {@link * org.apache.lucene.analysis.Analyzer#createComponents(String)} and perhaps also * {@link org.apache.lucene.analysis.Analyzer#initReader(String, * java.io.Reader)}. However, if you need the same set of components * over and over in many places, you can make a subclass of * {@link org.apache.lucene.analysis.Analyzer}. In fact, Apache Lucene * supplies a large family of <code>Analyzer</code> classes that deliver useful * analysis chains. The most common of these is the <a href="{@docRoot}/org/apache/lucene/analysis/standard/StandardAnalyzer.html">StandardAnalyzer</a>. * Many applications will have a long and industrious life with nothing more * than the <code>StandardAnalyzer</code>. The <a href="{@docRoot}/../analyzers-common/overview-summary.html">analyzers-common</a> * library provides many pre-existing analyzers for various languages. * The analysis-common library also allows to configure a custom Analyzer without subclassing using the * <a href="{@docRoot}/../analyzers-common/org/apache/lucene/analysis/custom/CustomAnalyzer.html">CustomAnalyzer</a> * class. * </p> * <p> * Aside from the <code>StandardAnalyzer</code>, * Lucene includes several components containing analysis components, * all under the 'analysis' directory of the distribution. Some of * these support particular languages, others integrate external * components. The 'common' subdirectory has some noteworthy * general-purpose analyzers, including the <a href="{@docRoot}/../analyzers-common/org/apache/lucene/analysis/miscellaneous/PerFieldAnalyzerWrapper.html">PerFieldAnalyzerWrapper</a>. Most <code>Analyzer</code>s perform the same operation on all * {@link org.apache.lucene.document.Field}s. The PerFieldAnalyzerWrapper can be used to associate a different <code>Analyzer</code> with different * {@link org.apache.lucene.document.Field}s. There is a great deal of * functionality in the analysis area, you should study it carefully to * find the pieces you need. * </p> * <p> * Analysis is one of the main causes of slow indexing. Simply put, the more you analyze the slower the indexing (in most cases). * Perhaps your application would be just fine using the simple WhitespaceTokenizer combined with a StopFilter. The benchmark/ library can be useful * for testing out the speed of the analysis process. * </p> * <h2>Invoking the Analyzer</h2> * <p> * Applications usually do not invoke analysis – Lucene does it * for them. Applications construct <code>Analyzer</code>s and pass then into Lucene, * as follows: * </p> * <ul> * <li> * At indexing, as a consequence of * {@link org.apache.lucene.index.IndexWriter#addDocument(Iterable) addDocument(doc)}, * the <code>Analyzer</code> in effect for indexing is invoked for each indexed field of the added document. * </li> * <li> * At search, a <code>QueryParser</code> may invoke the Analyzer during parsing. Note that for some queries, analysis does not * take place, e.g. wildcard queries. * </li> * </ul> * <p> * However an application might invoke Analysis of any text for testing or for any other purpose, something like: * </p> * <PRE class="prettyprint" id="analysis-workflow"> * Version matchVersion = Version.LUCENE_XY; // Substitute desired Lucene version for XY * Analyzer analyzer = new StandardAnalyzer(matchVersion); // or any other analyzer * TokenStream ts = analyzer.tokenStream("myfield", new StringReader("some text goes here")); * // The Analyzer class will construct the Tokenizer, TokenFilter(s), and CharFilter(s), * // and pass the resulting Reader to the Tokenizer. * OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class); * * try { * ts.reset(); // Resets this stream to the beginning. (Required) * while (ts.incrementToken()) { * // Use {@link org.apache.lucene.util.AttributeSource#reflectAsString(boolean)} * // for token stream debugging. * System.out.println("token: " + ts.reflectAsString(true)); * * System.out.println("token start offset: " + offsetAtt.startOffset()); * System.out.println(" token end offset: " + offsetAtt.endOffset()); * } * ts.end(); // Perform end-of-stream operations, e.g. set the final offset. * } finally { * ts.close(); // Release resources associated with this stream. * } * </PRE> * <h2>Indexing Analysis vs. Search Analysis</h2> * <p> * Selecting the "correct" analyzer is crucial * for search quality, and can also affect indexing and search performance. * The "correct" analyzer for your application will depend on what your input text * looks like and what problem you are trying to solve. * Lucene java's wiki page * <a href="http://wiki.apache.org/lucene-java/AnalysisParalysis">AnalysisParalysis</a> * provides some data on "analyzing your analyzer". * Here are some rules of thumb: * <ol> * <li>Test test test... (did we say test?)</li> * <li>Beware of too much analysis – it might hurt indexing performance.</li> * <li>Start with the same analyzer for indexing and search, otherwise searches would not find what they are supposed to...</li> * <li>In some cases a different analyzer is required for indexing and search, for instance: * <ul> * <li>Certain searches require more stop words to be filtered. (i.e. more than those that were filtered at indexing.)</li> * <li>Query expansion by synonyms, acronyms, auto spell correction, etc.</li> * </ul> * This might sometimes require a modified analyzer – see the next section on how to do that. * </li> * </ol> * <h2>Implementing your own Analyzer and Analysis Components</h2> * <p> * Creating your own Analyzer is straightforward. Your Analyzer should subclass {@link org.apache.lucene.analysis.Analyzer}. It can use * existing analysis components — CharFilter(s) <i>(optional)</i>, a * Tokenizer, and TokenFilter(s) <i>(optional)</i> — or components you * create, or a combination of existing and newly created components. Before * pursuing this approach, you may find it worthwhile to explore the * <a href="{@docRoot}/../analyzers-common/overview-summary.html">analyzers-common</a> library and/or ask on the * <a href="http://lucene.apache.org/core/discussion.html">java-user@lucene.apache.org mailing list</a> first to see if what you * need already exists. If you are still committed to creating your own * Analyzer, have a look at the source code of any one of the many samples * located in this package. * </p> * <p> * The following sections discuss some aspects of implementing your own analyzer. * </p> * <h3>Field Section Boundaries</h3> * <p> * When {@link org.apache.lucene.document.Document#add(org.apache.lucene.index.IndexableField) document.add(field)} * is called multiple times for the same field name, we could say that each such call creates a new * section for that field in that document. * In fact, a separate call to * {@link org.apache.lucene.analysis.Analyzer#tokenStream(java.lang.String, java.io.Reader) tokenStream(field,reader)} * would take place for each of these so called "sections". * However, the default Analyzer behavior is to treat all these sections as one large section. * This allows phrase search and proximity search to seamlessly cross * boundaries between these "sections". * In other words, if a certain field "f" is added like this: * </p> * <PRE class="prettyprint"> * document.add(new Field("f","first ends",...); * document.add(new Field("f","starts two",...); * indexWriter.addDocument(document); * </PRE> * <p> * Then, a phrase search for "ends starts" would find that document. * Where desired, this behavior can be modified by introducing a "position gap" between consecutive field "sections", * simply by overriding * {@link org.apache.lucene.analysis.Analyzer#getPositionIncrementGap(java.lang.String) Analyzer.getPositionIncrementGap(fieldName)}: * </p> * <PRE class="prettyprint"> * Version matchVersion = Version.LUCENE_XY; // Substitute desired Lucene version for XY * Analyzer myAnalyzer = new StandardAnalyzer(matchVersion) { * public int getPositionIncrementGap(String fieldName) { * return 10; * } * }; * </PRE> * <h3>End of Input Cleanup</h3> * <p> * At the ends of each field, Lucene will call the {@link org.apache.lucene.analysis.TokenStream#end()}. * The components of the token stream (the tokenizer and the token filters) <strong>must</strong> * put accurate values into the token attributes to reflect the situation at the end of the field. * The Offset attribute must contain the final offset (the total number of characters processed) * in both start and end. Attributes like PositionLength must be correct. * </p> * <p> * The base method{@link org.apache.lucene.analysis.TokenStream#end()} sets PositionIncrement to 0, which is required. * Other components must override this method to fix up the other attributes. * </p> * <h3>Token Position Increments</h3> * <p> * By default, TokenStream arranges for the * {@link org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute#getPositionIncrement() position increment} of all tokens to be one. * This means that the position stored for that token in the index would be one more than * that of the previous token. * Recall that phrase and proximity searches rely on position info. * </p> * <p> * If the selected analyzer filters the stop words "is" and "the", then for a document * containing the string "blue is the sky", only the tokens "blue", "sky" are indexed, * with position("sky") = 3 + position("blue"). Now, a phrase query "blue is the sky" * would find that document, because the same analyzer filters the same stop words from * that query. But the phrase query "blue sky" would not find that document because the * position increment between "blue" and "sky" is only 1. * </p> * <p> * If this behavior does not fit the application needs, the query parser needs to be * configured to not take position increments into account when generating phrase queries. * </p> * <p> * Note that a filter that filters <strong>out</strong> tokens <strong>must</strong> increment the position increment in order not to generate corrupt * tokenstream graphs. Here is the logic used by StopFilter to increment positions when filtering out tokens: * </p> * <PRE class="prettyprint"> * public TokenStream tokenStream(final String fieldName, Reader reader) { * final TokenStream ts = someAnalyzer.tokenStream(fieldName, reader); * TokenStream res = new TokenStream() { * CharTermAttribute termAtt = addAttribute(CharTermAttribute.class); * PositionIncrementAttribute posIncrAtt = addAttribute(PositionIncrementAttribute.class); * * public boolean incrementToken() throws IOException { * int extraIncrement = 0; * while (true) { * boolean hasNext = ts.incrementToken(); * if (hasNext) { * if (stopWords.contains(termAtt.toString())) { * extraIncrement += posIncrAtt.getPositionIncrement(); // filter this word * continue; * } * if (extraIncrement > 0) { * posIncrAtt.setPositionIncrement(posIncrAtt.getPositionIncrement()+extraIncrement); * } * } * return hasNext; * } * } * }; * return res; * } * </PRE> * <p> * A few more use cases for modifying position increments are: * </p> * <ol> * <li>Inhibiting phrase and proximity matches in sentence boundaries – for this, a tokenizer that * identifies a new sentence can add 1 to the position increment of the first token of the new sentence.</li> * <li>Injecting synonyms – synonyms of a token should be created at the same position as the * original token, and the output order of the original token and the injected synonym is undefined * as long as they both leave from the same position. As result, all synonyms of a token would be * considered to appear in exactly the same position as that token, and so would they be seen by * phrase and proximity searches. For multi-token synonyms to work correctly, you should use * {@code SynoymGraphFilter} at search time only.</li> * </ol> * * <h3>Token Position Length</h3> * <p> * By default, all tokens created by Analyzers and Tokenizers have a * {@link org.apache.lucene.analysis.tokenattributes.PositionLengthAttribute#getPositionLength() position length} of one. * This means that the token occupies a single position. This attribute is not indexed * and thus not taken into account for positional queries, but is used by eg. suggesters. * </p> * <p> * The main use case for positions lengths is multi-word synonyms. With single-word * synonyms, setting the position increment to 0 is enough to denote the fact that two * words are synonyms, for example: * </p> * <table summary="table showing position increments of 1 and 0 for red and magenta, respectively"> * <tr><td>Term</td><td>red</td><td>magenta</td></tr> * <tr><td>Position increment</td><td>1</td><td>0</td></tr> * </table> * <p> * Given that position(magenta) = 0 + position(red), they are at the same position, so anything * working with analyzers will return the exact same result if you replace "magenta" with "red" * in the input. However, multi-word synonyms are more tricky. Let's say that you want to build * a TokenStream where "IBM" is a synonym of "Internal Business Machines". Position increments * are not enough anymore: * </p> * <table summary="position increments where international is zero"> * <tr><td>Term</td><td>IBM</td><td>International</td><td>Business</td><td>Machines</td></tr> * <tr><td>Position increment</td><td>1</td><td>0</td><td>1</td><td>1</td></tr> * </table> * <p> * The problem with this token stream is that "IBM" is at the same position as "International" * although it is a synonym with "International Business Machines" as a whole. Setting * the position increment of "Business" and "Machines" to 0 wouldn't help as it would mean * than "International" is a synonym of "Business". The only way to solve this issue is to * make "IBM" span across 3 positions, this is where position lengths come to rescue. * </p> * <table summary="position lengths where IBM is three"> * <tr><td>Term</td><td>IBM</td><td>International</td><td>Business</td><td>Machines</td></tr> * <tr><td>Position increment</td><td>1</td><td>0</td><td>1</td><td>1</td></tr> * <tr><td>Position length</td><td>3</td><td>1</td><td>1</td><td>1</td></tr> * </table> * <p> * This new attribute makes clear that "IBM" and "International Business Machines" start and end * at the same positions. * </p> * <a name="corrupt"></a> * <h3>How to not write corrupt token streams</h3> * <p> * There are a few rules to observe when writing custom Tokenizers and TokenFilters: * </p> * <ul> * <li>The first position increment must be > 0.</li> * <li>Positions must not go backward.</li> * <li>Tokens that have the same start position must have the same start offset.</li> * <li>Tokens that have the same end position (taking into account the * position length) must have the same end offset.</li> * <li>Tokenizers must call {@link * org.apache.lucene.util.AttributeSource#clearAttributes()} in * incrementToken().</li> * <li>Tokenizers must override {@link * org.apache.lucene.analysis.TokenStream#end()}, and pass the final * offset (the total number of input characters processed) to both * parameters of {@link org.apache.lucene.analysis.tokenattributes.OffsetAttribute#setOffset(int, int)}.</li> * </ul> * <p> * Although these rules might seem easy to follow, problems can quickly happen when chaining * badly implemented filters that play with positions and offsets, such as synonym or n-grams * filters. Here are good practices for writing correct filters: * </p> * <ul> * <li>Token filters should not modify offsets. If you feel that your filter would need to modify offsets, then it should probably be implemented as a tokenizer.</li> * <li>Token filters should not insert positions. If a filter needs to add tokens, then they should all have a position increment of 0.</li> * <li>When they add tokens, token filters should call {@link org.apache.lucene.util.AttributeSource#clearAttributes()} first.</li> * <li>When they remove tokens, token filters should increment the position increment of the following token.</li> * <li>Token filters should preserve position lengths.</li> * </ul> * <h2>TokenStream API</h2> * <p> * "Flexible Indexing" summarizes the effort of making the Lucene indexer * pluggable and extensible for custom index formats. A fully customizable * indexer means that users will be able to store custom data structures on * disk. Therefore the analysis API must transport custom types of * data from the documents to the indexer. (It also supports communications * amongst the analysis components.) * </p> * <h3>Attribute and AttributeSource</h3> * <p> * Classes {@link org.apache.lucene.util.Attribute} and * {@link org.apache.lucene.util.AttributeSource} serve as the basis upon which * the analysis elements of "Flexible Indexing" are implemented. An Attribute * holds a particular piece of information about a text token. For example, * {@link org.apache.lucene.analysis.tokenattributes.CharTermAttribute} * contains the term text of a token, and * {@link org.apache.lucene.analysis.tokenattributes.OffsetAttribute} contains * the start and end character offsets of a token. An AttributeSource is a * collection of Attributes with a restriction: there may be only one instance * of each attribute type. TokenStream now extends AttributeSource, which means * that one can add Attributes to a TokenStream. Since TokenFilter extends * TokenStream, all filters are also AttributeSources. * </p> * <p> * Lucene provides seven Attributes out of the box: * </p> * <table rules="all" frame="box" cellpadding="3" summary="common bundled attributes"> * <tr> * <td>{@link org.apache.lucene.analysis.tokenattributes.CharTermAttribute}</td> * <td> * The term text of a token. Implements {@link java.lang.CharSequence} * (providing methods length() and charAt(), and allowing e.g. for direct * use with regular expression {@link java.util.regex.Matcher}s) and * {@link java.lang.Appendable} (allowing the term text to be appended to.) * </td> * </tr> * <tr> * <td>{@link org.apache.lucene.analysis.tokenattributes.OffsetAttribute}</td> * <td>The start and end offset of a token in characters.</td> * </tr> * <tr> * <td>{@link org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute}</td> * <td>See above for detailed information about position increment.</td> * </tr> * <tr> * <td>{@link org.apache.lucene.analysis.tokenattributes.PositionLengthAttribute}</td> * <td>The number of positions occupied by a token.</td> * </tr> * <tr> * <td>{@link org.apache.lucene.analysis.tokenattributes.PayloadAttribute}</td> * <td>The payload that a Token can optionally have.</td> * </tr> * <tr> * <td>{@link org.apache.lucene.analysis.tokenattributes.TypeAttribute}</td> * <td>The type of the token. Default is 'word'.</td> * </tr> * <tr> * <td>{@link org.apache.lucene.analysis.tokenattributes.FlagsAttribute}</td> * <td>Optional flags a token can have.</td> * </tr> * <tr> * <td>{@link org.apache.lucene.analysis.tokenattributes.KeywordAttribute}</td> * <td> * Keyword-aware TokenStreams/-Filters skip modification of tokens that * return true from this attribute's isKeyword() method. * </td> * </tr> * </table> * <h3>More Requirements for Analysis Component Classes</h3> * Due to the historical development of the API, there are some perhaps * less than obvious requirements to implement analysis components * classes. * <h4 id="analysis-lifetime">Token Stream Lifetime</h4> * The code fragment of the <a href="#analysis-workflow">analysis workflow * protocol</a> above shows a token stream being obtained, used, and then * left for garbage. However, that does not mean that the components of * that token stream will, in fact, be discarded. The default is just the * opposite. {@link org.apache.lucene.analysis.Analyzer} applies a reuse * strategy to the tokenizer and the token filters. It will reuse * them. For each new input, it calls {@link org.apache.lucene.analysis.Tokenizer#setReader(java.io.Reader)} * to set the input. Your components must be prepared for this scenario, * as described below. * <h4>Tokenizer</h4> * <ul> * <li> * You should create your tokenizer class by extending {@link org.apache.lucene.analysis.Tokenizer}. * </li> * <li> * Your tokenizer <strong>must</strong> override {@link org.apache.lucene.analysis.TokenStream#end()}. * Your implementation <strong>must</strong> call * <code>super.end()</code>. It must set a correct final offset into * the offset attribute, and finish up and other attributes to reflect * the end of the stream. * </li> * <li> * If your tokenizer overrides {@link org.apache.lucene.analysis.TokenStream#reset()} * or {@link org.apache.lucene.analysis.TokenStream#close()}, it * <strong>must</strong> call the corresponding superclass method. * </li> * </ul> * <h4>Token Filter</h4> * You should create your token filter class by extending {@link org.apache.lucene.analysis.TokenFilter}. * If your token filter overrides {@link org.apache.lucene.analysis.TokenStream#reset()}, * {@link org.apache.lucene.analysis.TokenStream#end()} * or {@link org.apache.lucene.analysis.TokenStream#close()}, it * <strong>must</strong> call the corresponding superclass method. * <h4>Creating delegates</h4> * Forwarding classes (those which extend {@link org.apache.lucene.analysis.Tokenizer} but delegate * selected logic to another tokenizer) must also set the reader to the delegate in the overridden * {@link org.apache.lucene.analysis.Tokenizer#reset()} method, e.g.: * <pre class="prettyprint"> * public class ForwardingTokenizer extends Tokenizer { * private Tokenizer delegate; * ... * {@literal @Override} * public void reset() { * super.reset(); * delegate.setReader(this.input); * delegate.reset(); * } * } * </pre> * <h3>Testing Your Analysis Component</h3> * <p> * The lucene-test-framework component defines * <a href="{@docRoot}/../test-framework/org/apache/lucene/analysis/BaseTokenStreamTestCase.html">BaseTokenStreamTestCase</a>. By extending * this class, you can create JUnit tests that validate that your * Analyzer and/or analysis components correctly implement the * protocol. The checkRandomData methods of that class are particularly effective in flushing out errors. * </p> * <h3>Using the TokenStream API</h3> * There are a few important things to know in order to use the new API efficiently which are summarized here. You may want * to walk through the example below first and come back to this section afterwards. * <ol><li> * Please keep in mind that an AttributeSource can only have one instance of a particular Attribute. Furthermore, if * a chain of a TokenStream and multiple TokenFilters is used, then all TokenFilters in that chain share the Attributes * with the TokenStream. * </li> * <li> * Attribute instances are reused for all tokens of a document. Thus, a TokenStream/-Filter needs to update * the appropriate Attribute(s) in incrementToken(). The consumer, commonly the Lucene indexer, consumes the data in the * Attributes and then calls incrementToken() again until it returns false, which indicates that the end of the stream * was reached. This means that in each call of incrementToken() a TokenStream/-Filter can safely overwrite the data in * the Attribute instances. * </li> * <li> * For performance reasons a TokenStream/-Filter should add/get Attributes during instantiation; i.e., create an attribute in the * constructor and store references to it in an instance variable. Using an instance variable instead of calling addAttribute()/getAttribute() * in incrementToken() will avoid attribute lookups for every token in the document. * </li> * <li> * All methods in AttributeSource are idempotent, which means calling them multiple times always yields the same * result. This is especially important to know for addAttribute(). The method takes the <b>type</b> (<code>Class</code>) * of an Attribute as an argument and returns an <b>instance</b>. If an Attribute of the same type was previously added, then * the already existing instance is returned, otherwise a new instance is created and returned. Therefore TokenStreams/-Filters * can safely call addAttribute() with the same Attribute type multiple times. Even consumers of TokenStreams should * normally call addAttribute() instead of getAttribute(), because it would not fail if the TokenStream does not have this * Attribute (getAttribute() would throw an IllegalArgumentException, if the Attribute is missing). More advanced code * could simply check with hasAttribute(), if a TokenStream has it, and may conditionally leave out processing for * extra performance. * </li></ol> * <h3>Example</h3> * <p> * In this example we will create a WhiteSpaceTokenizer and use a LengthFilter to suppress all words that have * only two or fewer characters. The LengthFilter is part of the Lucene core and its implementation will be explained * here to illustrate the usage of the TokenStream API. * </p> * <p> * Then we will develop a custom Attribute, a PartOfSpeechAttribute, and add another filter to the chain which * utilizes the new custom attribute, and call it PartOfSpeechTaggingFilter. * </p> * <h4>Whitespace tokenization</h4> * <pre class="prettyprint"> * public class MyAnalyzer extends Analyzer { * * private Version matchVersion; * * public MyAnalyzer(Version matchVersion) { * this.matchVersion = matchVersion; * } * * {@literal @Override} * protected TokenStreamComponents createComponents(String fieldName) { * return new TokenStreamComponents(new WhitespaceTokenizer(matchVersion)); * } * * public static void main(String[] args) throws IOException { * // text to tokenize * final String text = "This is a demo of the TokenStream API"; * * Version matchVersion = Version.LUCENE_XY; // Substitute desired Lucene version for XY * MyAnalyzer analyzer = new MyAnalyzer(matchVersion); * TokenStream stream = analyzer.tokenStream("field", new StringReader(text)); * * // get the CharTermAttribute from the TokenStream * CharTermAttribute termAtt = stream.addAttribute(CharTermAttribute.class); * * try { * stream.reset(); * * // print all tokens until stream is exhausted * while (stream.incrementToken()) { * System.out.println(termAtt.toString()); * } * * stream.end(); * } finally { * stream.close(); * } * } * } * </pre> * In this easy example a simple white space tokenization is performed. In main() a loop consumes the stream and * prints the term text of the tokens by accessing the CharTermAttribute that the WhitespaceTokenizer provides. * Here is the output: * <pre> * This * is * a * demo * of * the * new * TokenStream * API * </pre> * <h4>Adding a LengthFilter</h4> * We want to suppress all tokens that have 2 or less characters. We can do that * easily by adding a LengthFilter to the chain. Only the * <code>createComponents()</code> method in our analyzer needs to be changed: * <pre class="prettyprint"> * {@literal @Override} * protected TokenStreamComponents createComponents(String fieldName) { * final Tokenizer source = new WhitespaceTokenizer(matchVersion); * TokenStream result = new LengthFilter(true, source, 3, Integer.MAX_VALUE); * return new TokenStreamComponents(source, result); * } * </pre> * Note how now only words with 3 or more characters are contained in the output: * <pre> * This * demo * the * new * TokenStream * API * </pre> * Now let's take a look how the LengthFilter is implemented: * <pre class="prettyprint"> * public final class LengthFilter extends FilteringTokenFilter { * * private final int min; * private final int max; * * private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class); * * /** * * Create a new LengthFilter. This will filter out tokens whose * * CharTermAttribute is either too short * * (< min) or too long (> max). * * {@literal @param} version the Lucene match version * * {@literal @param} in the TokenStream to consume * * {@literal @param} min the minimum length * * {@literal @param} max the maximum length * */ * public LengthFilter(Version version, TokenStream in, int min, int max) { * super(version, in); * this.min = min; * this.max = max; * } * * {@literal @Override} * public boolean accept() { * final int len = termAtt.length(); * return (len >= min && len <= max); * } * * } * </pre> * <p> * In LengthFilter, the CharTermAttribute is added and stored in the instance * variable <code>termAtt</code>. Remember that there can only be a single * instance of CharTermAttribute in the chain, so in our example the * <code>addAttribute()</code> call in LengthFilter returns the * CharTermAttribute that the WhitespaceTokenizer already added. * </p> * <p> * The tokens are retrieved from the input stream in FilteringTokenFilter's * <code>incrementToken()</code> method (see below), which calls LengthFilter's * <code>accept()</code> method. By looking at the term text in the * CharTermAttribute, the length of the term can be determined and tokens that * are either too short or too long are skipped. Note how * <code>accept()</code> can efficiently access the instance variable; no * attribute lookup is necessary. The same is true for the consumer, which can * simply use local references to the Attributes. * </p> * <p> * LengthFilter extends FilteringTokenFilter: * </p> * * <pre class="prettyprint"> * public abstract class FilteringTokenFilter extends TokenFilter { * * private final PositionIncrementAttribute posIncrAtt = addAttribute(PositionIncrementAttribute.class); * * /** * * Create a new FilteringTokenFilter. * * {@literal @param} in the TokenStream to consume * */ * public FilteringTokenFilter(Version version, TokenStream in) { * super(in); * } * * /** Override this method and return if the current input token should be returned by incrementToken. */ * protected abstract boolean accept() throws IOException; * * {@literal @Override} * public final boolean incrementToken() throws IOException { * int skippedPositions = 0; * while (input.incrementToken()) { * if (accept()) { * if (skippedPositions != 0) { * posIncrAtt.setPositionIncrement(posIncrAtt.getPositionIncrement() + skippedPositions); * } * return true; * } * skippedPositions += posIncrAtt.getPositionIncrement(); * } * // reached EOS -- return false * return false; * } * * {@literal @Override} * public void reset() throws IOException { * super.reset(); * } * * } * </pre> * * <h4>Adding a custom Attribute</h4> * Now we're going to implement our own custom Attribute for part-of-speech tagging and call it consequently * <code>PartOfSpeechAttribute</code>. First we need to define the interface of the new Attribute: * <pre class="prettyprint"> * public interface PartOfSpeechAttribute extends Attribute { * public static enum PartOfSpeech { * Noun, Verb, Adjective, Adverb, Pronoun, Preposition, Conjunction, Article, Unknown * } * * public void setPartOfSpeech(PartOfSpeech pos); * * public PartOfSpeech getPartOfSpeech(); * } * </pre> * <p> * Now we also need to write the implementing class. The name of that class is important here: By default, Lucene * checks if there is a class with the name of the Attribute with the suffix 'Impl'. In this example, we would * consequently call the implementing class <code>PartOfSpeechAttributeImpl</code>. * </p> * <p> * This should be the usual behavior. However, there is also an expert-API that allows changing these naming conventions: * {@link org.apache.lucene.util.AttributeFactory}. The factory accepts an Attribute interface as argument * and returns an actual instance. You can implement your own factory if you need to change the default behavior. * </p> * <p> * Now here is the actual class that implements our new Attribute. Notice that the class has to extend * {@link org.apache.lucene.util.AttributeImpl}: * </p> * <pre class="prettyprint"> * public final class PartOfSpeechAttributeImpl extends AttributeImpl * implements PartOfSpeechAttribute { * * private PartOfSpeech pos = PartOfSpeech.Unknown; * * public void setPartOfSpeech(PartOfSpeech pos) { * this.pos = pos; * } * * public PartOfSpeech getPartOfSpeech() { * return pos; * } * * {@literal @Override} * public void clear() { * pos = PartOfSpeech.Unknown; * } * * {@literal @Override} * public void copyTo(AttributeImpl target) { * ((PartOfSpeechAttribute) target).setPartOfSpeech(pos); * } * } * </pre> * <p> * This is a simple Attribute implementation has only a single variable that * stores the part-of-speech of a token. It extends the * <code>AttributeImpl</code> class and therefore implements its abstract methods * <code>clear()</code> and <code>copyTo()</code>. Now we need a TokenFilter that * can set this new PartOfSpeechAttribute for each token. In this example we * show a very naive filter that tags every word with a leading upper-case letter * as a 'Noun' and all other words as 'Unknown'. * </p> * <pre class="prettyprint"> * public static class PartOfSpeechTaggingFilter extends TokenFilter { * PartOfSpeechAttribute posAtt = addAttribute(PartOfSpeechAttribute.class); * CharTermAttribute termAtt = addAttribute(CharTermAttribute.class); * * protected PartOfSpeechTaggingFilter(TokenStream input) { * super(input); * } * * public boolean incrementToken() throws IOException { * if (!input.incrementToken()) {return false;} * posAtt.setPartOfSpeech(determinePOS(termAtt.buffer(), 0, termAtt.length())); * return true; * } * * // determine the part of speech for the given term * protected PartOfSpeech determinePOS(char[] term, int offset, int length) { * // naive implementation that tags every uppercased word as noun * if (length > 0 && Character.isUpperCase(term[0])) { * return PartOfSpeech.Noun; * } * return PartOfSpeech.Unknown; * } * } * </pre> * <p> * Just like the LengthFilter, this new filter stores references to the * attributes it needs in instance variables. Notice how you only need to pass * in the interface of the new Attribute and instantiating the correct class * is automatically taken care of. * </p> * <p>Now we need to add the filter to the chain in MyAnalyzer:</p> * <pre class="prettyprint"> * {@literal @Override} * protected TokenStreamComponents createComponents(String fieldName) { * final Tokenizer source = new WhitespaceTokenizer(matchVersion); * TokenStream result = new LengthFilter(true, source, 3, Integer.MAX_VALUE); * result = new PartOfSpeechTaggingFilter(result); * return new TokenStreamComponents(source, result); * } * </pre> * Now let's look at the output: * <pre> * This * demo * the * new * TokenStream * API * </pre> * Apparently it hasn't changed, which shows that adding a custom attribute to a TokenStream/Filter chain does not * affect any existing consumers, simply because they don't know the new Attribute. Now let's change the consumer * to make use of the new PartOfSpeechAttribute and print it out: * <pre class="prettyprint"> * public static void main(String[] args) throws IOException { * // text to tokenize * final String text = "This is a demo of the TokenStream API"; * * MyAnalyzer analyzer = new MyAnalyzer(); * TokenStream stream = analyzer.tokenStream("field", new StringReader(text)); * * // get the CharTermAttribute from the TokenStream * CharTermAttribute termAtt = stream.addAttribute(CharTermAttribute.class); * * // get the PartOfSpeechAttribute from the TokenStream * PartOfSpeechAttribute posAtt = stream.addAttribute(PartOfSpeechAttribute.class); * * try { * stream.reset(); * * // print all tokens until stream is exhausted * while (stream.incrementToken()) { * System.out.println(termAtt.toString() + ": " + posAtt.getPartOfSpeech()); * } * * stream.end(); * } finally { * stream.close(); * } * } * </pre> * The change that was made is to get the PartOfSpeechAttribute from the TokenStream and print out its contents in * the while loop that consumes the stream. Here is the new output: * <pre> * This: Noun * demo: Unknown * the: Unknown * new: Unknown * TokenStream: Noun * API: Noun * </pre> * Each word is now followed by its assigned PartOfSpeech tag. Of course this is a naive * part-of-speech tagging. The word 'This' should not even be tagged as noun; it is only spelled capitalized because it * is the first word of a sentence. Actually this is a good opportunity for an exercise. To practice the usage of the new * API the reader could now write an Attribute and TokenFilter that can specify for each word if it was the first token * of a sentence or not. Then the PartOfSpeechTaggingFilter can make use of this knowledge and only tag capitalized words * as nouns if not the first word of a sentence (we know, this is still not a correct behavior, but hey, it's a good exercise). * As a small hint, this is how the new Attribute class could begin: * <pre class="prettyprint"> * public class FirstTokenOfSentenceAttributeImpl extends AttributeImpl * implements FirstTokenOfSentenceAttribute { * * private boolean firstToken; * * public void setFirstToken(boolean firstToken) { * this.firstToken = firstToken; * } * * public boolean getFirstToken() { * return firstToken; * } * * {@literal @Override} * public void clear() { * firstToken = false; * } * * ... * </pre> * <h4>Adding a CharFilter chain</h4> * Analyzers take Java {@link java.io.Reader}s as input. Of course you can wrap your Readers with {@link java.io.FilterReader}s * to manipulate content, but this would have the big disadvantage that character offsets might be inconsistent with your original * text. * <p> * {@link org.apache.lucene.analysis.CharFilter} is designed to allow you to pre-process input like a FilterReader would, but also * preserve the original offsets associated with those characters. This way mechanisms like highlighting still work correctly. * CharFilters can be chained. * <p> * Example: * <pre class="prettyprint"> * public class MyAnalyzer extends Analyzer { * * {@literal @Override} * protected TokenStreamComponents createComponents(String fieldName) { * return new TokenStreamComponents(new MyTokenizer()); * } * * {@literal @Override} * protected Reader initReader(String fieldName, Reader reader) { * // wrap the Reader in a CharFilter chain. * return new SecondCharFilter(new FirstCharFilter(reader)); * } * } * </pre> */ package org.apache.lucene.analysis;