RangeTracker.java example

Explorer
beam-master
/*
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
package org.apache.beam.sdk.io.range;

/**
 * A {@code RangeTracker} is a thread-safe helper object for implementing dynamic work rebalancing
 * in position-based {@link org.apache.beam.sdk.io.BoundedSource.BoundedReader}
 * subclasses.
 *
 * <h3>Usage of the RangeTracker class hierarchy</h3>
 * The abstract {@code RangeTracker} interface should not be used per se - all users should use its
 * subclasses directly. We declare it here because all subclasses have roughly the same interface
 * and the same properties, to centralize the documentation. Currently we provide one
 * implementation - {@link OffsetRangeTracker}.
 *
 * <h3>Position-based sources</h3>
 * A position-based source is one where the source can be described by a range of positions of
 * an ordered type and the records returned by the reader can be described by positions of the
 * same type.
 *
 * <p>In case a record occupies a range of positions in the source, the most important thing about
 * the record is the position where it starts.
 *
 * <p>Defining the semantics of positions for a source is entirely up to the source class, however
 * the chosen definitions have to obey certain properties in order to make it possible to correctly
 * split the source into parts, including dynamic splitting. Two main aspects need to be defined:
 * <ul>
 *   <li>How to assign starting positions to records.
 *   <li>Which records should be read by a source with a range {@code [A, B)}.
 * </ul>
 * Moreover, reading a range must be <i>efficient</i>, i.e., the performance of reading a range
 * should not significantly depend on the location of the range. For example, reading the range
 * {@code [A, B)} should not require reading all data before {@code A}.
 *
 * <p>The sections below explain exactly what properties these definitions must satisfy, and
 * how to use a {@code RangeTracker} with a properly defined source.
 *
 * <h3>Properties of position-based sources</h3>
 * The main requirement for position-based sources is <i>associativity</i>: reading records from
 * {@code [A, B)} and records from {@code [B, C)} should give the same records as reading from
 * {@code [A, C)}, where {@code A <= B <= C}. This property ensures that no matter how a range
 * of positions is split into arbitrarily many sub-ranges, the total set of records described by
 * them stays the same.
 *
 * <p>The other important property is how the source's range relates to positions of records in
 * the source. In many sources each record can be identified by a unique starting position.
 * In this case:
 * <ul>
 *   <li>All records returned by a source {@code [A, B)} must have starting positions
 *   in this range.
 *   <li>All but the last record should end within this range. The last record may or may not
 *   extend past the end of the range.
 *   <li>Records should not overlap.
 * </ul>
 * Such sources should define "read {@code [A, B)}" as "read from the first record starting at or
 * after A, up to but not including the first record starting at or after B".
 *
 * <p>Some examples of such sources include reading lines or CSV from a text file, reading keys and
 * values from a BigTable, etc.
 *
 * <p>The concept of <i>split points</i> allows to extend the definitions for dealing with sources
 * where some records cannot be identified by a unique starting position.
 *
 * <p>In all cases, all records returned by a source {@code [A, B)} must <i>start</i> at or after
 * {@code A}.
 *
 * <h3>Split points</h3>
 *
 * <p>Some sources may have records that are not directly addressable. For example, imagine a file
 * format consisting of a sequence of compressed blocks. Each block can be assigned an offset, but
 * records within the block cannot be directly addressed without decompressing the block. Let us
 * refer to this hypothetical format as <i>CBF (Compressed Blocks Format)</i>.
 *
 * <p>Many such formats can still satisfy the associativity property. For example, in CBF, reading
 * {@code [A, B)} can mean "read all the records in all blocks whose starting offset is in
 * {@code [A, B)}".
 *
 * <p>To support such complex formats, we introduce the notion of <i>split points</i>. We say that
 * a record is a split point if there exists a position {@code A} such that the record is the first
 * one to be returned when reading the range  {@code [A, infinity)}. In CBF, the only split points
 * would be the first records in each block.
 *
 * <p>Split points allow us to define the meaning of a record's position and a source's range
 * in all cases:
 * <ul>
 *   <li>For a record that is at a split point, its position is defined to be the largest
 *   {@code A} such that reading a source with the range {@code [A, infinity)} returns this record;
 *   <li>Positions of other records are only required to be non-decreasing;
 *   <li>Reading the source {@code [A, B)} must return records starting from the first split point
 *   at or after {@code A}, up to but not including the first split point at or after {@code B}.
 *   In particular, this means that the first record returned by a source MUST always be
 *   a split point.
 *   <li>Positions of split points must be unique.
 * </ul>
 * As a result, for any decomposition of the full range of the source into position ranges, the
 * total set of records will be the full set of records in the source, and each record
 * will be read exactly once.
 *
 * <h3>Consumed positions</h3>
 * As the source is being read, and records read from it are being passed to the downstream
 * transforms in the pipeline, we say that positions in the source are being <i>consumed</i>.
 * When a reader has read a record (or promised to a caller that a record will be returned),
 * positions up to and including the record's start position are considered <i>consumed</i>.
 *
 * <p>Dynamic splitting can happen only at <i>unconsumed</i> positions. If the reader just
 * returned a record at offset 42 in a file, dynamic splitting can happen only at offset 43 or
 * beyond, as otherwise that record could be read twice (by the current reader and by a reader
 * of the task starting at 43).
 *
 * <h3>Example</h3>
 * The following example uses an {@link OffsetRangeTracker} to support dynamically splitting
 * a source with integer positions (offsets).
 * <pre> {@code
 *   class MyReader implements BoundedReader<Foo> {
 *     private MySource currentSource;
 *     private final OffsetRangeTracker tracker = new OffsetRangeTracker();
 *     ...
 *     MyReader(MySource source) {
 *       this.currentSource = source;
 *       this.tracker = new MyRangeTracker<>(source.getStartOffset(), source.getEndOffset())
 *     }
 *     ...
 *     boolean start() {
 *       ... (general logic for locating the first record) ...
 *       if (!tracker.tryReturnRecordAt(true, recordStartOffset)) return false;
 *       ... (any logic that depends on the record being returned, e.g. counting returned records)
 *       return true;
 *     }
 *     boolean advance() {
 *       ... (general logic for locating the next record) ...
 *       if (!tracker.tryReturnRecordAt(isAtSplitPoint, recordStartOffset)) return false;
 *       ... (any logic that depends on the record being returned, e.g. counting returned records)
 *       return true;
 *     }
 *
 *     double getFractionConsumed() {
 *       return tracker.getFractionConsumed();
 *     }
 *   }
 * } </pre>
 *
 * <h3>Usage with different models of iteration</h3>
 * When using this class to protect a
 * {@link org.apache.beam.sdk.io.BoundedSource.BoundedReader}, follow the pattern
 * described above.
 *
 * <p>When using this class to protect iteration in the {@code hasNext()/next()}
 * model, consider the record consumed when {@code hasNext()} is about to return true, rather than
 * when {@code next()} is called, because {@code hasNext()} returning true is promising the caller
 * that {@code next()} will have an element to return - so {@link #trySplitAtPosition} must not
 * split the range in a way that would make the record promised by {@code hasNext()} belong to
 * a different range.
 *
 * <p>Also note that implementations of {@code hasNext()} need to ensure
 * that they call {@link #tryReturnRecordAt} only once even if {@code hasNext()} is called
 * repeatedly, due to the requirement on uniqueness of split point positions.
 *
 * @param <PositionT> Type of positions used by the source to define ranges and identify records.
 */
public interface RangeTracker<PositionT> {
  /**
   * Returns the starting position of the current range, inclusive.
   */
  PositionT getStartPosition();

  /**
   * Returns the ending position of the current range, exclusive.
   */
  PositionT getStopPosition();

  /**
   * Atomically determines whether a record at the given position can be returned and updates
   * internal state. In particular:
   * <ul>
   *   <li>If {@code isAtSplitPoint} is {@code true}, and {@code recordStart} is outside the current
   *   range, returns {@code false};
   *   <li>Otherwise, updates the last-consumed position to {@code recordStart} and returns
   *   {@code true}.
   * </ul>
   *
   * <p>This method MUST be called on all split point records. It may be called on every record.
   */
  boolean tryReturnRecordAt(boolean isAtSplitPoint, PositionT recordStart);

  /**
   * Atomically splits the current range [{@link #getStartPosition}, {@link #getStopPosition})
   * into a "primary" part [{@link #getStartPosition}, {@code splitPosition})
   * and a "residual" part [{@code splitPosition}, {@link #getStopPosition}), assuming the current
   * last-consumed position is within [{@link #getStartPosition}, splitPosition)
   * (i.e., {@code splitPosition} has not been consumed yet).
   *
   * <p>Updates the current range to be the primary and returns {@code true}. This means that
   * all further calls on the current object will interpret their arguments relative to the
   * primary range.
   *
   * <p>If the split position has already been consumed, or if no {@link #tryReturnRecordAt} call
   * was made yet, returns {@code false}. The second condition is to prevent dynamic splitting
   * during reader start-up.
   */
  boolean trySplitAtPosition(PositionT splitPosition);

  /**
   * Returns the approximate fraction of positions in the source that have been consumed by
   * successful {@link #tryReturnRecordAt} calls, or 0.0 if no such calls have happened.
   */
  double getFractionConsumed();
}