/** * Copyright 2014 National University of Ireland, Galway. * * This file is part of the SIREn project. Project and contact information: * * https://github.com/rdelbru/SIREn * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ /** * Implementation of the encoding and decoding of the block-based * postings format for SIREn 1.0. * * <h2>Introduction</h2> * * This package contains the implementation of the encoding and decoding of the * SIREn 1.0 block-based postings format. This postings format is compatible * with the Lucene 4.0 codec. For an introduction to Lucene 4.0 codec API, see * the {@link org.apache.lucene.codecs.lucene40} package documentation. * * <h2>SIREn 1.0 Postings Format</h2> * * The SIREn 1.0 postings format is organised around four files: * <ul> * <li> The .doc file contains the document identifiers and node frequencies; * <li> The .nod file contains the node labels and the term frequencies; * <li> The .pos file contains the term positions; * <li> The .skp file contains the skip data. * </ul> * * The SIREn 1.0 postings format is divided into multiple blocks. The default * block size is defined by * {@link org.sindice.siren.index.codecs.siren10.Siren10PostingsFormat#DEFAULT_POSTINGS_BLOCK_SIZE}. * * <h3>Documents and Node Frequencies</h3> * * The .doc file contains a list of document identifiers for each * term along with the node frequency of the term in that document. The * document identifiers are ordered by increasing number. * * <p> * * The file contains one postings lists for each term. The term is implicit * and the position of the start of the postings list for a term is provided * by the term dictionary. * * <p> * * One postings list is organised by block. Each block has a maximum size which * determines the maximum number of document identifiers it can contain. It is * possible that one block contain less document identifiers than its maximum * size. For example, if a postings list contain only one document identifier, * the size of the block will be one. * * The block format follows the schema: * <pre> * Block = Header, CompressedDoc, CompressedNodeFreq * Header = BlockSize, * CompressedDocSize, CompressedNodeFreqSize, * FirstDocId, LastDocId, * NodeBlockPointer, PosBlockPointer * CompressedDoc = [DeltaDocId] * CompressedNodeFreq = [NodeFreq] * </pre> * * <b>BlockSize</b> records the size of the block, i.e., the number of document * identifiers. * <p> * <b>CompressedDocLength</b> records the size (in bytes) of the compressed byte * array CompressedDoc. * <p> * <b>CompressedNodeFreqLength</b> records the size (in bytes) of the compressed * byte array CompressedNodeFreq. * <p> * <b>FirstDocId</b> and <b>LastDocId</b> records the first and last document identifiers * of the block. This information is used by the skip list algorithm. The * LastDocId is encoded as delta between the FirstDocId. * <p> * <b>NodeBlockPointer</b> records the pointer of the node block in the .nod file that * is associated to this block. * <p> * <b>PosBlockPointer</b> records the pointer of the position block in the .pos file * that is associated to this block. * <p> * <b>CompressedDoc</b> is the compressed list of document identifiers. This list is * compressed using the AFOR algorithm. The document identifiers are encoded * as delta. The first document of this list is always 0 as it is encoded as * delta with FirstDocId. * <p> * <b>CompressedNodeFreq</b> is the compressed list of node frequencies. This list is * compressed using the AFOR algorithm. There is one node frequency per document * identifier. The node frequency is encoded with a decrement of 1 to optimise * AFOR compression: it gives a higher chance to get smaller bit frames. * * <h3>Node Labels and Term Frequencies</h3> * * The .nod file contains a list of node labels for each * document along with the length of each node labels, and the frequency of the * term for each node. The node labels are ordered by increasing dewey codes. * * <p> * * The file is organised by block. Each block is synchronised with a .doc block, * i.e., a single block contains all the node labels for the complete set of * document identifiers contained in the .doc block. A block has a variable size * which is determined by the number of node labels associated with the * documents. Synchronising blocks across files simplifies encoding and decoding * instructions and improves the performance. * * <p> * * The block format follows the schema: * <pre> * Block = Header, CompressedNodeLength, CompressedNode, CompressedTermFreq * Header = NodeLengthBlockSize, NodeBlockSize, TermFreqBlockSize, * CompressedNodeLengthSize, CompressedNodeSize, CompressedTermFreqSize * CompressedNodeLength = [NodeLength] * CompressedNode = [DeltaNode] * CompressedTermFreq = [TermFreq] * </pre> * * <b>NodeLengthBlockSize</b> records the size of the block of the node lengths, * i.e., the number of node lengths. * <p> * <b>NodeBlockSize</b> records the size of the node block, i.e., the number of * integers composing the node labels. * <p> * <b>TermFreqBlockSize</b> records the size of the term frequency block, i.e., * the number of term frequencies. * <p> * <b>CompressedNodeLengthSize</b> records the size (in bytes) of the compressed * byte array CompressedNodeLength. * <p> * <b>CompressedNodeSize</b> records the size (in bytes) of the compressed * byte array CompressedNode. * <p> * <b>CompressedTermFreqSize</b> records the size (in bytes) of the compressed * byte array CompressedTermFreq. * <p> * <b>CompressedNodeLength</b> is the compressed list of node lengths. Since * each node label can have a different length, the node length records the * number of integers that composes a node label. This list * is compressed using the AFOR algorithm. The node frequency is encoded with a * decrement of 1 to optimise AFOR compression: it gives a higher chance to get * smaller bit frames. * <p> * <b>CompressedNode</b> is the compressed list of node labels. This list is * compressed using the AFOR algorithm. The node labels relative to a document * are encoded as delta. * <p> * <b>CompressedTermFreq</b> is the compressed list of term frequencies. This list is * compressed using the AFOR algorithm. There is one term frequency per node * label. The node frequency is encoded with a decrement of 1 to optimise * AFOR compression: it gives a higher chance to get smaller bit frames. * * <h3>Term Positions</h3> * * The .pos file contains the list of term positions within nodes. The term * positions are ordered by increasing number. * * <p> * * The file is organised by block. Each block is synchronised with a .nod block, * i.e., a single block contains all the term positions for the complete set of * node labels contained in the .nod block. A block has a variable size * which is determined by the number of term positions associated with the * nodes. Synchronising blocks across files simplifies encoding and decoding * instructions and improves the performance. * * <p> * * The block format follows the schema: * <pre> * Block = Header, CompressedTermPos * Header = TermPosBlockSize, * CompressedTermPosSize * CompressedTermPos = [DeltaTermPos] * </pre> * * <b>TermPosBlockSize</b> records the size of the term position block, i.e., * the number of term positions. * <p> * <b>CompressedTermPosSize</b> records the size (in bytes) of the compressed * byte array CompressedTermPos. * <p> * <b>CompressedTermPos</b> is the compressed list of term positions. This list * is compressed using the AFOR algorithm. There is one or more term positions * per node. The number of term positions per node is provided by the .nod block * with the term frequency information. The term positions relative to a node * are encoded as delta. * * <h3>Skip Lists</h3> * * The .skp file contains the skip data. The structure of the skip table * is quite similar to Lucene40PostingsFormat. However, the skip data is defined * around the concept of block instead of document. The skip interval defines * the number of blocks between each skip data. The default skip interval is * defined by * {@link org.sindice.siren.index.codecs.siren10.Siren10PostingsFormat#DEFAULT_POSTINGS_BLOCK_SIZE}. * Each skip entry points to the beginning of one block. * * <p> * * In contrast to the Lucene skip lists, part of the skip data is inlined within * the .doc file. The pointers to the .nod block and .pos block associated to * the .doc block are encoded in the header of the .doc file by NodeBlockPointer * and PosBlockPointer. * * <p> * * For more information about our skip table approach over blocks, please refer * to the publication <a href="http://dx.doi.org/10.1007/978-3-642-20161-5_55"> * SkipBlock: Self-indexing for Block-Based Inverted List</a>. * * <h2>Interaction with the Postings List</h2> * * The reading of the SIREn 1.0 postings format relies on the lazy-loading * approach. Any information, e.g., term positions, term frequencies, * node labels and node frequencies, that are not requested explicitly are (1) * never decoded, and (2) not read from disk whenever possible. * */ package org.sindice.siren.index.codecs.siren10;