package-info.java example

Explorer
siren-master
/**
 * Copyright 2014 National University of Ireland, Galway.
 *
 * This file is part of the SIREn project. Project and contact information:
 *
 *  https://github.com/rdelbru/SIREn
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *  http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
/**
 * Programmatic API to search node-based inverted indexes.
 *
 * <h2>Introduction</h2>
 *
 * This package contains the API for building queries to search JSON data
 * over node-based inverted indexes. For an introduction about the Lucene's
 * search API, see the {@link org.apache.lucene.search} package documentation.
 *
 * <h2>Search Basics</h2>
 *
 * In contrast to the Lucene's {@link org.apache.lucene.search.Query} API
 * which provides complex querying capabilities to search for documents, SIREn
 * provide a {@link org.sindice.siren.search.node.NodeQuery} API to provide
 * complex querying capabilities to search for nodes and documents. The
 * information retrieved not only consists of the matching documents, but also
 * of the matching nodes within these documents.
 *
 * <p>
 *
 * SIREn offers a wide variety of
 * {@link org.sindice.siren.search.node.NodeQuery} implementations. Most of them
 * are similar to the ones provided by the Lucene's
 * {@link org.apache.lucene.search.Query} API. For example, while Lucene
 * provides a {@link org.apache.lucene.search.TermQuery} implementation
 * to search documents that contain a specific term, SIREn provides a {@link
 * org.sindice.siren.search.node.NodeTermQuery} implementation to search nodes
 * and documents that contain a specific term.
 *
 * <h3>Level and Range Constraints</h3>
 *
 * The {@link org.sindice.siren.search.node.NodeQuery} provides methods to set
 * constraints on the nodes matched by the query. There are two types of
 * constraints:
 * <ul>
 *   <li> Level constraint: this constraint will filter out all nodes that do
 *   not belong to the specified level of the tree.
 *   <li> Interval constraint: this constraint will filter out all nodes in
 *   which the last integer of their dewey code vector is not contained in the
 *   specified interval.
 * </ul>
 *
 * <h2>Query Classes</h2>
 *
 * <h3>{@link org.sindice.siren.search.node.NodeTermQuery}</h3>
 *
 * A {@link org.sindice.siren.search.node.NodeTermQuery} matches all the
 * nodes that contain the specified {@link org.apache.lucene.index.Term},
 * which is a word that occurs in a certain
 * {@link org.apache.lucene.document.Field} containing JSON data.
 * <p>
 * Constructing a {@link org.sindice.siren.search.node.NodeTermQuery} is as
 * simple as:
 * <pre>
 *      NodeTermQuery tq = new NodeTermQuery(new Term("json-field", "term"));
 * </pre>
 *
 * In this example, the {@link org.sindice.siren.search.node.NodeQuery}
 * identifies all {@link org.apache.lucene.document.Document}s that have the
 * {@link org.apache.lucene.document.Field} named <tt>"json-field"</tt>
 * where a node contains the word <tt>"term"</tt>.
 *
 * <h3>{@link org.sindice.siren.search.node.NodePhraseQuery}
 *
 * A {@link org.sindice.siren.search.node.NodePhraseQuery} matches all the nodes
 * containing the specified phrase. A phrase is defined as a sequence of
 * {@link org.apache.lucene.index.Term}.
 *
 * <h3>{@link org.sindice.siren.search.node.NodeBooleanQuery}</h3>
 *
 * A {@link org.sindice.siren.search.node.NodeBooleanQuery} matches all the
 * nodes containing the specified boolean combination of queries.
 * A {@link org.sindice.siren.search.node.NodeBooleanQuery} contains multiple
 * {@link org.sindice.siren.search.node.NodeBooleanClause}s, where each clause
 * contains a sub-query
 * ({@link org.sindice.siren.search.node.NodePrimitiveQuery} instance) and an
 * operator (from {@link org.sindice.siren.search.node.NodeBooleanClause.Occur})
 * describing how that sub-query is combined with the other clauses. The
 * semantic of {@link org.sindice.siren.search.node.NodeBooleanClause.Occur} is
 * identical to the semantic of {@link org.apache.lucene.search.BooleanClause.Occur}.
 *
 * <h3>{@link org.sindice.siren.search.node.NodeTermRangeQuery}</h3>
 *
 * A {@link org.sindice.siren.search.node.NodeTermRangeQuery} matches all
 * nodes containing a term that occurs in the inclusive or exclusive range of a
 * lower {@link org.apache.lucene.index.Term Term} and an upper
 * {@link org.apache.lucene.index.Term Term} according to
 * {@link org.apache.lucene.index.TermsEnum#getComparator TermsEnum.getComparator()}.
 * It is not intended for numerical ranges; use
 * {@link org.sindice.siren.search.node.NodeNumericRangeQuery} instead.
 *
 * <h3>{@link org.sindice.siren.search.node.NodeNumericRangeQuery}</h3>
 *
 * A {@link org.sindice.siren.search.node.NodeNumericRangeQuery} matches all
 * nodes containing a value that occurs in a numeric range. For
 * NodeNumericRangeQuery to work, you must index the values with the datatypes
 * configured with the appropriate numeric analyzers
 * ({@link org.sindice.siren.analysis.NumericAnalyzer}).
 *
 * <h3>{@link org.sindice.siren.search.node.NodePrefixQuery},
 *     {@link org.sindice.siren.search.node.NodeWildcardQuery},
 *     {@link org.sindice.siren.search.node.NodeRegexpQuery}</h3>
 *
 * A {@link org.sindice.siren.search.node.NodePrefixQuery} matches all nodes
 * containing terms that begin with the specified string. A
 * {@link org.sindice.siren.search.node.NodeWildcardQuery} generalizes this
 * by allowing for the use of <tt>+</tt> (matches 1 or more characters),
 * <tt>*</tt> (matches 0 or more characters) and
 * <tt>?</tt> (matches exactly one character) wildcards. Note that the
 * {@link org.sindice.siren.search.node.NodeWildcardQuery} can be quite slow. Also
 * note that {@link org.sindice.siren.search.node.NodeWildcardQuery} should
 * not start with <tt>+</tt>, <tt>*</tt> and <tt>?</tt>, as these are extremely slow.
 * Some QueryParsers may not allow this by default, but provide a
 * <code>setAllowLeadingWildcard</code> method to remove that protection.
 * The {@link org.sindice.siren.search.node.NodeRegexpQuery} is even more
 * general than NodeWildcardQuery, matching all nodes with terms that match a
 * regular expression pattern.
 *
 * <h3>{@link org.sindice.siren.search.node.NodeFuzzyQuery}</h3>
 *
 * A {@link org.sindice.siren.search.node.NodeFuzzyQuery} matches nodes that
 * contain terms similar to the specified term. Similarity is determined using
 * <a href="http://en.wikipedia.org/wiki/Levenshtein">Levenshtein (edit)
 * distance</a>.
 *
 * <h3>{@link org.sindice.siren.search.node.TwigQuery}</h3>
 *
 * A {@link org.sindice.siren.search.node.TwigQuery} enables to combine
 * {@link org.sindice.siren.search.node.NodeQuery}s with a Parent-Child or
 * Ancestor-Descendant relation. This is the basic building block to build
 * tree-shaped queries.
 *
 * <p>
 *
 * A {@link org.sindice.siren.search.node.TwigQuery} is composed of a root and
 * of one or more children or descendants:
 * <ul>
 *  <li> The root is a {@link org.sindice.siren.search.node.NodeQuery} instance.
 *       An empty root is considered as a wildcard node query and will match all
 *       nodes. We call "root nodes" the set of nodes that are retrieved by the
 *       root query.
 *  <li> A descendant is a {@link org.sindice.siren.search.node.NodeQuery}
 *       associated to an operator (from
 *       {@link org.sindice.siren.search.node.NodeBooleanClause.Occur}). A
 *       descendant query will match all the nodes for which it exists a path
 *       to a root node. A descendant is associated to a node level, which
 *       corresponds to the relative distance (in term of levels) from the root.
 *  <li> A child is a descendant that is exactly one level above the root level.
 * </ul>
 *
 * <p>
 *
 * A twig query is always associated to a level. If no level is specified, then
 * by default the level is set to 1. When a twig query is used as a child or
 * descendant of another twig query, then its level is automatically updated
 * according to the level of the parent twig query. For example, given
 * the following instructions:
 * <pre>
 *      TwigQuery tw1 = new TwigQuery();
 *      TwigQuery tw2 = new TwigQuery();
 *      tw1.addChild(tw2, Occur.MUST);
 * </pre>
 *
 * In this example, the first twig query <tt>tw1</tt> is defined at the default
 * level 1. The second twig query <tt>tw2</tt>, after the call to
 * {@link org.sindice.siren.search.node.TwigQuery#addChild(NodeQuery, org.sindice.siren.search.node.NodeBooleanClause.Occur)},
 * will have its level updated to 2 since it is now a child of a twig query at a
 * level 1.
 *
 * <h2>The Scorer Class</h2>
 *
 * The {@link org.sindice.siren.search.node.NodeScorer} abstract class provides
 * common scoring functionality for all the node scorer implementations which
 * are the heart of the SIREn scoring process.
 *
 * <p>
 *
 * The implementation of the query processing framework follows a node-at-a-time
 * approach, where the query operators (i.e., {@link org.sindice.siren.search.node.NodeScorer})
 * process one node at a time. The query processing framework has been
 * designed for high efficiency processing:
 * <ol>
 *   <li> All the query operators leverage has much as possible the lazy-loading
 *   feature of the
 *   {@link org.sindice.siren.index.codecs.siren10.Siren10PostingsReader}. For
 *   example, there is not the concept of next matching document (i.e.,
 *   {@link org.apache.lucene.search.Scorer#nextDoc()}) in the
 *   {@link NodeScorer} interface, but instead the concept of next candidate
 *   document (i.e.,
 *   {@link org.sindice.siren.search.node.NodeScorer#nextCandidateDocument()}).
 *   This enables {@link org.sindice.siren.search.node.NodeConjunctionScorer} to
 *   efficiently iterates over the document identifiers wihtout having to
 *   decode the node labels until a potential candidate is found.
 *   <li> The node label array (i.e., {@link org.apache.lucene.util.IntsRef})
 *   being processed is the same in all the query operators, which means that
 *   the same array is reused across and no new arrays are created during the
 *   query processing.
 *   <li> The node label array is itself a slice of the array of the
 *   uncompressed node block. The node label array is created by sliding a
 *   window (i.e., {@link org.apache.lucene.util.IntsRef}) over the array of the
 *   uncompressed node block.
 * </ol>
 *
 */
package org.sindice.siren.search.node;