/**
* Copyright 2014 National University of Ireland, Galway.
*
* This file is part of the SIREn project. Project and contact information:
*
* https://github.com/rdelbru/SIREn
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/**
* Analyzer for indexing JSON content.
*
* <h2>Introduction</h2>
*
* This package extends the Lucene's analysis API to provide support for
* parsing and indexing JSON content. For an introduction to Lucene's analysis
* API, see the {@link org.apache.lucene.analysis} package documentation.
*
*
* <h2>Overview of the API</h2>
*
* This package contains concrete components
* ({@link org.apache.lucene.util.Attribute}s,
* {@link org.apache.lucene.analysis.Tokenizer}s and
* {@link org.apache.lucene.analysis.TokenFilter}s) for analyzing different
* JSON content.
* <p>
* It also provides a pre-built JSON analyzer
* {@link org.sindice.siren.analysis.JsonAnalyzer} that you can use to get
* started quickly.
* <p>
* It also contains a number of
* {@link org.sindice.siren.analysis.NumericAnalyzer}s that are used for
* supporting datatypes.
* <p>
* The SIREn's analysis API is divided into several packages:
* <ul>
* <li><b>{@link org.sindice.siren.analysis.attributes}</b> contains a number of
* {@link org.apache.lucene.util.Attribute}s that are used to add metadata
* to a stream of tokens.
* <li><b>{@link org.sindice.siren.analysis.filter}</b> contains a number of
* {@link org.apache.lucene.analysis.TokenFilter}s that alter incoming tokens.
* </ul>
*
* <h2>JSON Analyzer</h2>
*
* The {@link org.sindice.siren.analysis.JsonTokenizer JSON tokenizer} parses
* the JSON data and converts it into an abstract tree model. The conversion
* is performed in a streaming mode during the parsing.
*
* <p>
*
* The tokenizer traverses the JSON tree using a depth-first search approach.
* During the traversal of the tree, the tokenizer increments the dewey code
* (i.e., node label) whenever an object, an array, a field or a value
* is encountered. The tokenizer attaches to any token generated the current
* node label using the
* {@link org.sindice.siren.analysis.attributes.NodeAttribute}.
*
* <p>
*
* The tokenizer attaches also a datatype metadata to any token generated using
* the {@link org.sindice.siren.analysis.attributes.DatatypeAttribute}.
* A datatype specifies the type of the data a node contains. By default, the
* tokenizer differentiates five datatypes in the JSON syntax:
*
* <ul>
* <li> {@link org.sindice.siren.util.XSDDatatype#XSD_STRING}
* <li> {@link org.sindice.siren.util.XSDDatatype#XSD_LONG}
* <li> {@link org.sindice.siren.util.XSDDatatype#XSD_DOUBLE}
* <li> {@link org.sindice.siren.util.XSDDatatype#XSD_BOOLEAN}
* <li> {@link org.sindice.siren.util.JSONDatatype#JSON_FIELD}
* </ul>
*
* The datatype metadata is used to perform an appropriate analysis of the
* content of a node. Such analysis is performed by the
* {@link org.sindice.siren.analysis.filter.DatatypeAnalyzerFilter}. The
* analysis of each datatype can be configured freely by the user using the
* method
* {@link org.sindice.siren.analysis.JsonAnalyzer#registerDatatype(char[], org.apache.lucene.analysis.Analyzer)}.
*
* <p>
*
* Custom datatypes can also be used thanks to a specific datatype JSON object.
* The schema of the datatype JSON object is the following:
* <pre>
* {
* "_datatype_" : <LABEL>,
* "_value_" : <VALUE>
* }
* </pre>
* <code><LABEL></code> is a string which represents the name of the datatype to be
* applied on the associated value.
* <code><VALUE></code> is the associated value and is a string.
* The datatype JSON object is a way for passing custom datatypes to SIREn.
* It does not have influence on the label of the value node.
* For example, the label (i.e., <code>0.0</code>) to the value <code>b</code> below:
* <pre>
* {
* "a" : "b"
* }
* </pre>
* is the same for the value <code>b</code> with a custom datatype:
* <pre>
* {
* "a" : {
* "_datatype_" : "my datatype",
* "_value_" : "b"
* }
* }
* </pre>
*
* <p>
*
* The Lucene's
* {@link org.apache.lucene.analysis.tokenattributes.PayloadAttribute payload}
* interface is used by SIREn to encode information such as the node label and
* the position of the token. This payload is then decoded by the
* {@link org.sindice.siren.index index API} and encoded back into the node-based
* inverted index data structure.
*
*/
package org.sindice.siren.analysis;