/**
* Copyright 2014 National University of Ireland, Galway.
*
* This file is part of the SIREn project. Project and contact information:
*
* https://github.com/rdelbru/SIREn
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/**
* Implementation of the encoding and decoding of the block-based
* postings format for SIREn 1.0.
*
* <h2>Introduction</h2>
*
* This package contains the implementation of the encoding and decoding of the
* SIREn 1.0 block-based postings format. This postings format is compatible
* with the Lucene 4.0 codec. For an introduction to Lucene 4.0 codec API, see
* the {@link org.apache.lucene.codecs.lucene40} package documentation.
*
* <h2>SIREn 1.0 Postings Format</h2>
*
* The SIREn 1.0 postings format is organised around four files:
* <ul>
* <li> The .doc file contains the document identifiers and node frequencies;
* <li> The .nod file contains the node labels and the term frequencies;
* <li> The .pos file contains the term positions;
* <li> The .skp file contains the skip data.
* </ul>
*
* The SIREn 1.0 postings format is divided into multiple blocks. The default
* block size is defined by
* {@link org.sindice.siren.index.codecs.siren10.Siren10PostingsFormat#DEFAULT_POSTINGS_BLOCK_SIZE}.
*
* <h3>Documents and Node Frequencies</h3>
*
* The .doc file contains a list of document identifiers for each
* term along with the node frequency of the term in that document. The
* document identifiers are ordered by increasing number.
*
* <p>
*
* The file contains one postings lists for each term. The term is implicit
* and the position of the start of the postings list for a term is provided
* by the term dictionary.
*
* <p>
*
* One postings list is organised by block. Each block has a maximum size which
* determines the maximum number of document identifiers it can contain. It is
* possible that one block contain less document identifiers than its maximum
* size. For example, if a postings list contain only one document identifier,
* the size of the block will be one.
*
* The block format follows the schema:
* <pre>
* Block = Header, CompressedDoc, CompressedNodeFreq
* Header = BlockSize,
* CompressedDocSize, CompressedNodeFreqSize,
* FirstDocId, LastDocId,
* NodeBlockPointer, PosBlockPointer
* CompressedDoc = [DeltaDocId]
* CompressedNodeFreq = [NodeFreq]
* </pre>
*
* <b>BlockSize</b> records the size of the block, i.e., the number of document
* identifiers.
* <p>
* <b>CompressedDocLength</b> records the size (in bytes) of the compressed byte
* array CompressedDoc.
* <p>
* <b>CompressedNodeFreqLength</b> records the size (in bytes) of the compressed
* byte array CompressedNodeFreq.
* <p>
* <b>FirstDocId</b> and <b>LastDocId</b> records the first and last document identifiers
* of the block. This information is used by the skip list algorithm. The
* LastDocId is encoded as delta between the FirstDocId.
* <p>
* <b>NodeBlockPointer</b> records the pointer of the node block in the .nod file that
* is associated to this block.
* <p>
* <b>PosBlockPointer</b> records the pointer of the position block in the .pos file
* that is associated to this block.
* <p>
* <b>CompressedDoc</b> is the compressed list of document identifiers. This list is
* compressed using the AFOR algorithm. The document identifiers are encoded
* as delta. The first document of this list is always 0 as it is encoded as
* delta with FirstDocId.
* <p>
* <b>CompressedNodeFreq</b> is the compressed list of node frequencies. This list is
* compressed using the AFOR algorithm. There is one node frequency per document
* identifier. The node frequency is encoded with a decrement of 1 to optimise
* AFOR compression: it gives a higher chance to get smaller bit frames.
*
* <h3>Node Labels and Term Frequencies</h3>
*
* The .nod file contains a list of node labels for each
* document along with the length of each node labels, and the frequency of the
* term for each node. The node labels are ordered by increasing dewey codes.
*
* <p>
*
* The file is organised by block. Each block is synchronised with a .doc block,
* i.e., a single block contains all the node labels for the complete set of
* document identifiers contained in the .doc block. A block has a variable size
* which is determined by the number of node labels associated with the
* documents. Synchronising blocks across files simplifies encoding and decoding
* instructions and improves the performance.
*
* <p>
*
* The block format follows the schema:
* <pre>
* Block = Header, CompressedNodeLength, CompressedNode, CompressedTermFreq
* Header = NodeLengthBlockSize, NodeBlockSize, TermFreqBlockSize,
* CompressedNodeLengthSize, CompressedNodeSize, CompressedTermFreqSize
* CompressedNodeLength = [NodeLength]
* CompressedNode = [DeltaNode]
* CompressedTermFreq = [TermFreq]
* </pre>
*
* <b>NodeLengthBlockSize</b> records the size of the block of the node lengths,
* i.e., the number of node lengths.
* <p>
* <b>NodeBlockSize</b> records the size of the node block, i.e., the number of
* integers composing the node labels.
* <p>
* <b>TermFreqBlockSize</b> records the size of the term frequency block, i.e.,
* the number of term frequencies.
* <p>
* <b>CompressedNodeLengthSize</b> records the size (in bytes) of the compressed
* byte array CompressedNodeLength.
* <p>
* <b>CompressedNodeSize</b> records the size (in bytes) of the compressed
* byte array CompressedNode.
* <p>
* <b>CompressedTermFreqSize</b> records the size (in bytes) of the compressed
* byte array CompressedTermFreq.
* <p>
* <b>CompressedNodeLength</b> is the compressed list of node lengths. Since
* each node label can have a different length, the node length records the
* number of integers that composes a node label. This list
* is compressed using the AFOR algorithm. The node frequency is encoded with a
* decrement of 1 to optimise AFOR compression: it gives a higher chance to get
* smaller bit frames.
* <p>
* <b>CompressedNode</b> is the compressed list of node labels. This list is
* compressed using the AFOR algorithm. The node labels relative to a document
* are encoded as delta.
* <p>
* <b>CompressedTermFreq</b> is the compressed list of term frequencies. This list is
* compressed using the AFOR algorithm. There is one term frequency per node
* label. The node frequency is encoded with a decrement of 1 to optimise
* AFOR compression: it gives a higher chance to get smaller bit frames.
*
* <h3>Term Positions</h3>
*
* The .pos file contains the list of term positions within nodes. The term
* positions are ordered by increasing number.
*
* <p>
*
* The file is organised by block. Each block is synchronised with a .nod block,
* i.e., a single block contains all the term positions for the complete set of
* node labels contained in the .nod block. A block has a variable size
* which is determined by the number of term positions associated with the
* nodes. Synchronising blocks across files simplifies encoding and decoding
* instructions and improves the performance.
*
* <p>
*
* The block format follows the schema:
* <pre>
* Block = Header, CompressedTermPos
* Header = TermPosBlockSize,
* CompressedTermPosSize
* CompressedTermPos = [DeltaTermPos]
* </pre>
*
* <b>TermPosBlockSize</b> records the size of the term position block, i.e.,
* the number of term positions.
* <p>
* <b>CompressedTermPosSize</b> records the size (in bytes) of the compressed
* byte array CompressedTermPos.
* <p>
* <b>CompressedTermPos</b> is the compressed list of term positions. This list
* is compressed using the AFOR algorithm. There is one or more term positions
* per node. The number of term positions per node is provided by the .nod block
* with the term frequency information. The term positions relative to a node
* are encoded as delta.
*
* <h3>Skip Lists</h3>
*
* The .skp file contains the skip data. The structure of the skip table
* is quite similar to Lucene40PostingsFormat. However, the skip data is defined
* around the concept of block instead of document. The skip interval defines
* the number of blocks between each skip data. The default skip interval is
* defined by
* {@link org.sindice.siren.index.codecs.siren10.Siren10PostingsFormat#DEFAULT_POSTINGS_BLOCK_SIZE}.
* Each skip entry points to the beginning of one block.
*
* <p>
*
* In contrast to the Lucene skip lists, part of the skip data is inlined within
* the .doc file. The pointers to the .nod block and .pos block associated to
* the .doc block are encoded in the header of the .doc file by NodeBlockPointer
* and PosBlockPointer.
*
* <p>
*
* For more information about our skip table approach over blocks, please refer
* to the publication <a href="http://dx.doi.org/10.1007/978-3-642-20161-5_55">
* SkipBlock: Self-indexing for Block-Based Inverted List</a>.
*
* <h2>Interaction with the Postings List</h2>
*
* The reading of the SIREn 1.0 postings format relies on the lazy-loading
* approach. Any information, e.g., term positions, term frequencies,
* node labels and node frequencies, that are not requested explicitly are (1)
* never decoded, and (2) not read from disk whenever possible.
*
*/
package org.sindice.siren.index.codecs.siren10;