package-info.java example

Explorer
siren-master
/**
 * Copyright 2014 National University of Ireland, Galway.
 *
 * This file is part of the SIREn project. Project and contact information:
 *
 *  https://github.com/rdelbru/SIREn
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *  http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
/**
 * Analyzer for indexing JSON content.
 *
 * <h2>Introduction</h2>
 *
 * This package extends the Lucene's analysis API to provide support for
 * parsing and indexing JSON content. For an introduction to Lucene's analysis
 * API, see the {@link org.apache.lucene.analysis} package documentation.
 *
 *
 * <h2>Overview of the API</h2>
 *
 * This package contains concrete components
 * ({@link org.apache.lucene.util.Attribute}s,
 * {@link org.apache.lucene.analysis.Tokenizer}s and
 * {@link org.apache.lucene.analysis.TokenFilter}s) for analyzing different
 * JSON content.
 * <p>
 * It also provides a pre-built JSON analyzer
 * {@link org.sindice.siren.analysis.JsonAnalyzer} that you can use to get
 * started quickly.
 * <p>
 * It also contains a number of
 * {@link org.sindice.siren.analysis.NumericAnalyzer}s that are used for
 * supporting datatypes.
 * <p>
 * The SIREn's analysis API is divided into several packages:
 * <ul>
 * <li><b>{@link org.sindice.siren.analysis.attributes}</b> contains a number of
 * {@link org.apache.lucene.util.Attribute}s that are used to add metadata
 * to a stream of tokens.
 * <li><b>{@link org.sindice.siren.analysis.filter}</b> contains a number of
 * {@link org.apache.lucene.analysis.TokenFilter}s that alter incoming tokens.
 * </ul>
 *
 * <h2>JSON Analyzer</h2>
 *
 * The {@link org.sindice.siren.analysis.JsonTokenizer JSON tokenizer} parses
 * the JSON data and converts it into an abstract tree model. The conversion
 * is performed in a streaming mode during the parsing.
 *
 * <p>
 *
 * The tokenizer traverses the JSON tree using a depth-first search approach.
 * During the traversal of the tree, the tokenizer increments the dewey code
 * (i.e., node label) whenever an object, an array, a field or a value
 * is encountered. The tokenizer attaches to any token generated the current
 * node label using the
 * {@link org.sindice.siren.analysis.attributes.NodeAttribute}.
 *
 * <p>
 *
 * The tokenizer attaches also a datatype metadata to any token generated using
 * the {@link org.sindice.siren.analysis.attributes.DatatypeAttribute}.
 * A datatype specifies the type of the data a node contains. By default, the
 * tokenizer differentiates five datatypes in the JSON syntax:
 *
 * <ul>
 * <li> {@link org.sindice.siren.util.XSDDatatype#XSD_STRING}
 * <li> {@link org.sindice.siren.util.XSDDatatype#XSD_LONG}
 * <li> {@link org.sindice.siren.util.XSDDatatype#XSD_DOUBLE}
 * <li> {@link org.sindice.siren.util.XSDDatatype#XSD_BOOLEAN}
 * <li> {@link org.sindice.siren.util.JSONDatatype#JSON_FIELD}
 * </ul>
 *
 * The datatype metadata is used to perform an appropriate analysis of the
 * content of a node. Such analysis is performed by the
 * {@link org.sindice.siren.analysis.filter.DatatypeAnalyzerFilter}. The
 * analysis of each datatype can be configured freely by the user using the
 * method
 * {@link org.sindice.siren.analysis.JsonAnalyzer#registerDatatype(char[], org.apache.lucene.analysis.Analyzer)}.
 *
 * <p>
 * 
 * Custom datatypes can also be used thanks to a specific datatype JSON object.
 * The schema of the datatype JSON object is the following:
 * <pre>
 * {
 *   "_datatype_" : <LABEL>,
 *   "_value_" : <VALUE>
 * }
 * </pre>
 * <code><LABEL></code> is a string which represents the name of the datatype to be
 * applied on the associated value.
 * <code><VALUE></code> is the associated value and is a string.
 * The datatype JSON object is a way for passing custom datatypes to SIREn.
 * It does not have influence on the label of the value node.
 * For example, the label (i.e., <code>0.0</code>) to the value <code>b</code> below:
 * <pre>
 * {
 *   "a" : "b"
 * }
 * </pre>
 * is the same for the value <code>b</code> with a custom datatype:
 * <pre>
 * {
 *   "a" : {
 *     "_datatype_" : "my datatype",
 *     "_value_" : "b"
 *   }
 * }
 * </pre>
 * 
 * <p>
 *
 * The Lucene's
 * {@link org.apache.lucene.analysis.tokenattributes.PayloadAttribute payload}
 * interface is used by SIREn to encode information such as the node label and
 * the position of the token. This payload is then decoded by the
 * {@link org.sindice.siren.index index API} and encoded back into the node-based
 * inverted index data structure.
 *
 */
package org.sindice.siren.analysis;