/** * Copyright 2014 National University of Ireland, Galway. * * This file is part of the SIREn project. Project and contact information: * * https://github.com/rdelbru/SIREn * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ /** * Abstraction over the encoding and decoding of the block-based posting * format. * * <h2>Introduction</h2> * * This package contains the abstract API for encoding * ({@link org.sindice.siren.index.codecs.block.BlockIndexOutput}) and decoding * ({@link org.sindice.siren.index.codecs.block.BlockIndexInput}) * block-based posting format. It also includes algorithms for compressing * and decompressing blocks of bytes. * * <h2>Block-Based Posting Format</h2> * * The block-based posting format encodes a posting list as a sequence of * blocks. A block is composed of an header, i.e., metadata, and some content, * i.e., bytes array. While the content of a block can be anything, it * usually contains a sequence of integers. In certain cases it can be composed * by multiple blocks of integers, for example to create interleaved blocks. * The size of a block can be either variable or fixed. * * <h3>Block Compression</h3> * * A {@link org.sindice.siren.index.codecs.block.BlockCompressor} compresses a * list of integers into a byte array in one batch. * A {@link org.sindice.siren.index.codecs.block.BlockIndexOutput} must ensure * that the given byte array is large enough for hosting the compressed data. * The method {@link org.sindice.siren.index.codecs.block.BlockCompressor#maxCompressedSize(int)} * can be used to estimate the maximum size of a compressed block of values. * * <p> * * A {@link org.sindice.siren.index.codecs.block.BlockDecompressor} decompresses * a compressed byte array into a list of integers in one batch. * A {@link org.sindice.siren.index.codecs.block.BlockIndexInput} must ensure * that the given integer array is large enough for hosting the uncompressed data. * * <p> * * Two block compression algorithms are implemented: * <ul> * <li> Variable Integer: encodes integers using the variable integer encoding * technique. It is very simple and provides a relatively efficient compression. * However, the compression ratio is not very good, especially for node labels * and term positions. * <li> Adaptive Frame Of Reference: encodes frames of integers using * highly-optimised routines. Its implementation is relatively complex but it * provides the best balance between compression ratio, compression speed and * decompression speed. * </ul> * * <h3>Concurrent Access</h3> * * During the creation of a new index segment, terms are processed sequentially. * This ensures that: * <ul> * <li> there is no concurrent access of the same * {@link org.sindice.siren.index.codecs.block.BlockIndexOutput} instance; and * <li> there is no concurrent encoding of multiple blocks. * </ul> * * During query processing, multiple terms are processed in parallel. The same * {@link org.sindice.siren.index.codecs.block.BlockIndexInput} will be used * to decode multiple postings list. Safe concurrent access of the index files * is ensured only if a different * {@link org.sindice.siren.index.codecs.block.BlockIndexInput.BlockReader} * is used for each postings list. The method * {@link org.sindice.siren.index.codecs.block.BlockIndexInput#getBlockReader()} * provides a * {@link org.sindice.siren.index.codecs.block.BlockIndexInput.BlockReader} * which contains a clone of the underlying * {@link org.apache.lucene.store.IndexInput}. * */ package org.sindice.siren.index.codecs.block;