Chunk.java example

Explorer
h2o-3-master
package water.fvec;

import water.*;
import water.parser.BufferedString;

import java.util.UUID;

/** A compression scheme, over a chunk of data - a single array of bytes.
 *  Chunks are mapped many-to-1 to a {@link Vec}.  The <em>actual</em> vector
 *  header info is in the Vec - which contains info to find all the bytes of
 *  the distributed vector.  Subclasses of this abstract class implement
 *  (possibly empty) compression schemes.
 *
 *  <p>Chunks are collections of elements, and support an array-like API.
 *  Chunks are subsets of a Vec; while the elements in a Vec are numbered
 *  starting at 0, any given Chunk has some (probably non-zero) starting row,
 *  and a length which is smaller than the whole Vec.  Chunks are limited to a
 *  single Java byte array in a single JVM heap, and only an int's worth of
 *  elements.  Chunks support both the notions of a global row-number and a
 *  chunk-local numbering.  The global row-number calls are variants of {@code
 *  at_abs} and {@code set_abs}.  If the row is outside the current Chunk's
 *  range, the data will be loaded by fetching from the correct Chunk.  This
 *  probably involves some network traffic, and if all rows are loaded then the
 *  entire dataset will be pulled locally (possibly triggering an OutOfMemory).
 *
 *  <p>The chunk-local numbering supports the common {@code for} loop iterator
 *  pattern, using {@code at*} and {@code set} calls, and is faster than the
 *  global row-numbering for tight loops (because it skips some range checks):
 *  <pre>{@code
 *  for (int row = 0; row < chunk._len; row++)
 *    ...chunk.atd(row)...
 *  }</pre>
 *
 *  <p>The array-like API allows loading and storing elements in and out of
 *  Chunks.  When loading, values are decompressed.  When storing, an attempt
 *  to compress back into the actual underlying Chunk subclass is made; if this
 *  fails the Chunk is "inflated" into a {@link NewChunk}, and the store
 *  completed there.  Later the NewChunk will be compressed (probably into a
 *  different underlying Chunk subclass) and put back in the K/V store under
 *  the same Key - effectively replacing the original Chunk; this is done when
 *  {@link #close} is called, and is taken care of by the standard {@link
 *  MRTask} calls.
 *
 *  <p>Chunk updates are not multi-thread safe; the caller must do correct
 *  synchronization.  This is already handled by the Map/Reduce {MRTask)
 *  framework.  Chunk updates are not visible cross-cluster until the {@link
 *  #close} is made; again this is handled by MRTask directly.
 *
 *  <p>In addition to normal load and store operations, Chunks support the
 *  notion a missing element via the {@link #isNA} call, and a "next non-zero"
 *  notion for rapidly iterating over sparse data.
 *
 *  <p><b>Data Types</b>
 *
 *  <p>Chunks hold Java primitive values, timestamps, UUIDs, or Strings.  All
 *  the Chunks in a Vec hold the same type.  Most of the types are compressed.
 *  Integer types (boolean, byte, short, int, long) are always lossless.  Float
 *  and Double types might lose 1 or 2 ulps in the compression.  Time data is
 *  held as milliseconds since the Unix Epoch.  UUIDs are held as 128-bit
 *  integers (a pair of Java longs).  Strings are compressed in various obvious
 *  ways.  Sparse data is held... sparsely; e.g. loading data in SVMLight
 *  format will not "blow up" the in-memory representation. Categoricals/factors
 *  are held as small integers, with a shared String lookup table on the side.
 *
 *  <p>Chunks support the notion of <em>missing</em> data.  Missing float and
 *  double data is always treated as a NaN, both if read or written.  There is
 *  no equivalent of NaN for integer data; reading a missing integer value is a
 *  coding error and will be flagged.  If you are working with integer data
 *  with missing elements, you must first check for a missing value before
 *  loading it:
 *  <pre>{@code
 *  if( !chk.isNA(row) ) ...chk.at8(row)....
 *  }</pre>
 *
 *  <p>The same holds true for the other non-real types (timestamps, UUIDs,
 *  Strings, or categoricals): they must be checked for missing before being
 *  used.
 *
 *  <p><b>Performance Concerns</b>
 *
 *  <p>The standard {@code for} loop mentioned above is the fastest way to
 *  access data; definitely faster (and less error prone) than iterating over
 *  global row numbers.  Iterating over a single Chunk is nearly always
 *  memory-bandwidth bound.  Often code will iterate over a number of Chunks
 *  aligned together (the common use-case of looking a whole rows of a
 *  dataset).  Again, typically such a code pattern is memory-bandwidth bound
 *  although the X86 will stop being able to prefetch well beyond 100 or 200
 *  Chunks.
 *
 *  <p>Note that Chunk alignment is guaranteed within all the Vecs of a Frame:
 *  Same numbered Chunks of <em>different</em> Vecs will have the same global
 *  row numbering and the same length, enabling a particularly simple and
 *  efficient way to iterate over all rows.
 *
 *  <p>This example computes the Euclidean distance between all the columns and
 *  a given point, and stores the squared distance back in the last column.
 *  Note that due "NaN poisoning" if any row element is missing, the entire
 *  distance calculated will be NaN.
 *  <pre>{@code
 *  final double[] _point;                             // The given point
 *  public void map( Chunk[] chks ) {                  // Map over a set of same-numbered Chunks
 *   for( int row=0; row < chks[0]._len; row++ ) {    // For all rows
 *     double dist=0;                                 // Squared distance
 *     for( int col=0; col < chks.length-1; col++ ) { // For all cols, except the last output col
 *       double d = chks[col].atd(row) - _point[col]; // Distance along this dimension
 *       dist += d*d;                                 // Sum-squared-distance
 *     }
 *     chks[chks.length-1].set( row, dist );          // Store back the distance in the last col
 *   }
 *  }}</pre>
 */

public abstract class Chunk extends Iced<Chunk> implements Vec.Holder {

  public Chunk() {}
  private Chunk(byte [] bytes) {_mem = bytes;initFromBytes();}





  /** Global starting row for this local Chunk; a read-only field. */
  transient long _start = -1;
  /** Global starting row for this local Chunk */
  public final long start() { return _start; }
  /** Global index of this chunk filled during chunk load */
  transient int _cidx = -1;

  /** Number of rows in this Chunk; publicly a read-only field.  Odd API
   *  design choice: public, not-final, read-only, NO-ACCESSOR.
   *
   *  <p>NO-ACCESSOR: This is a high-performance field, and must have a known
   *  zero-cost cost-model; accessors hide that cost model, and make it
   *  not-obvious that a loop will be properly optimized or not.
   *
   *  <p>not-final: set in various deserializers.
   *  <p>Proper usage: read the field, probably in a hot loop.
   *  <pre>
   *  for( int row=0; row < chunk._len; row++ )
   *    ...chunk.atd(row)...
   *  </pre>
   **/
  public transient int _len;
  /** Internal set of _len.  Used by lots of subclasses.  Not a publically visible API. */
  int set_len(int len) { return _len = len; }
  /** Read-only length of chunk (number of rows). */
  public int len() { return _len; }

  /** Normally==null, changed if chunk is written to.  Not a publically readable or writable field. */
  private transient Chunk _chk2;
  /** Exposed for internal testing only.  Not a publically visible API. */
  public Chunk chk2() { return _chk2; }

  /** Owning Vec; a read-only field */
  transient Vec _vec;
  /** Owning Vec */
  public Vec vec() { return _vec; }

  /** Set the owning Vec */
  public void setVec(Vec vec) { _vec = vec; }

  /** Set the start */
  public void setStart(long start) { _start = start; }
  /** The Big Data.  Frequently set in the subclasses, but not otherwise a publically writable field. */
  byte[] _mem;
  /** Short-cut to the embedded big-data memory.  Generally not useful for
   *  public consumption, since the data remains compressed and holding on to a
   *  pointer to this array defeats the user-mode spill-to-disk. */
  public byte[] getBytes() { return _mem; }

  public void setBytes(byte[] mem) { _mem = mem; }



  final long at8_abs(long i) {
    long x = i - (_start>0 ? _start : 0);
    if( 0 <= x && x < _len) return at8((int) x);
    throw new ArrayIndexOutOfBoundsException(""+_start+" <= "+i+" < "+(_start+ _len));
  }

  /** Load a {@code double} value using absolute row numbers.  Returns
   *  Double.NaN if value is missing.
   *
   *  <p>This version uses absolute element numbers, but must convert them to
   *  chunk-relative indices - requiring a load from an aliasing local var,
   *  leading to lower quality JIT'd code (similar issue to using iterator
   *  objects).
   *
   *  <p>Slightly slower than {@link #atd} since it range-checks within a chunk.
   *  @return double value at the given row, or NaN if the value is missing */
  final double at_abs(long i) {
    long x = i - (_start>0 ? _start : 0);
    if( 0 <= x && x < _len) return atd((int) x);
    throw new ArrayIndexOutOfBoundsException(""+_start+" <= "+i+" < "+(_start+ _len));
  }

  /** Missing value status.
   *
   *  <p>This version uses absolute element numbers, but must convert them to
   *  chunk-relative indices - requiring a load from an aliasing local var,
   *  leading to lower quality JIT'd code (similar issue to using iterator
   *  objects).
   *
   *  <p>Slightly slower than {@link #isNA} since it range-checks within a chunk.
   *  @return true if the value is missing */
  final boolean isNA_abs(long i) {
    long x = i - (_start>0 ? _start : 0);
    if( 0 <= x && x < _len) return isNA((int) x);
    throw new ArrayIndexOutOfBoundsException(""+_start+" <= "+i+" < "+(_start+ _len));
  }

  /** Low half of a 128-bit UUID, or throws if the value is missing.
   *
   *  <p>This version uses absolute element numbers, but must convert them to
   *  chunk-relative indices - requiring a load from an aliasing local var,
   *  leading to lower quality JIT'd code (similar issue to using iterator
   *  objects).
   *
   *  <p>Slightly slower than {@link #at16l} since it range-checks within a chunk.
   *  @return Low half of a 128-bit UUID, or throws if the value is missing.  */
  final long at16l_abs(long i) {
    long x = i - (_start>0 ? _start : 0);
    if( 0 <= x && x < _len) return at16l((int) x);
    throw new ArrayIndexOutOfBoundsException(""+_start+" <= "+i+" < "+(_start+ _len));
  }

  /** High half of a 128-bit UUID, or throws if the value is missing.
   *
   *  <p>This version uses absolute element numbers, but must convert them to
   *  chunk-relative indices - requiring a load from an aliasing local var,
   *  leading to lower quality JIT'd code (similar issue to using iterator
   *  objects).
   *
   *  <p>Slightly slower than {@link #at16h} since it range-checks within a chunk.
   *  @return High half of a 128-bit UUID, or throws if the value is missing.  */
  final long at16h_abs(long i) {
    long x = i - (_start>0 ? _start : 0);
    if( 0 <= x && x < _len) return at16h((int) x);
    throw new ArrayIndexOutOfBoundsException(""+_start+" <= "+i+" < "+(_start+ _len));
  }

  /** String value using absolute row numbers, or null if missing.
   *
   *  <p>This version uses absolute element numbers, but must convert them to
   *  chunk-relative indices - requiring a load from an aliasing local var,
   *  leading to lower quality JIT'd code (similar issue to using iterator
   *  objects).
   *
   *  <p>Slightly slower than {@link #atStr} since it range-checks within a chunk.
   *  @return String value using absolute row numbers, or null if missing. */
  final BufferedString atStr_abs(BufferedString bStr, long i) {
    long x = i - (_start>0 ? _start : 0);
    if( 0 <= x && x < _len) return atStr(bStr, (int) x);
    throw new ArrayIndexOutOfBoundsException(""+_start+" <= "+i+" < "+(_start+ _len));
  }

  /** Load a {@code double} value using chunk-relative row numbers.  Returns Double.NaN
   *  if value is missing.
   *  @return double value at the given row, or NaN if the value is missing */
  public final double atd(int i) { return _chk2 == null ? atd_impl(i) : _chk2. atd_impl(i); }

  /** Load a {@code long} value using chunk-relative row numbers.  Floating
   *  point values are silently rounded to a long.  Throws if the value is
   *  missing.
   *  @return long value at the given row, or throw if the value is missing */
  public final long at8(int i) { return _chk2 == null ? at8_impl(i) : _chk2. at8_impl(i); }

  /** Missing value status using chunk-relative row numbers.
   *
   *  @return true if the value is missing */
  public final boolean isNA(int i) { return _chk2 == null ?isNA_impl(i) : _chk2.isNA_impl(i); }

  /** Low half of a 128-bit UUID, or throws if the value is missing.
   *
   *  @return Low half of a 128-bit UUID, or throws if the value is missing.  */
  public final long at16l(int i) { return _chk2 == null ? at16l_impl(i) : _chk2.at16l_impl(i); }

  /** High half of a 128-bit UUID, or throws if the value is missing.
   *
   *  @return High half of a 128-bit UUID, or throws if the value is missing.  */
  public final long at16h(int i) { return _chk2 == null ? at16h_impl(i) : _chk2.at16h_impl(i); }

  /** String value using chunk-relative row numbers, or null if missing.
   *
   *  @return String value or null if missing. */
  public final BufferedString atStr(BufferedString bStr, int i) { return _chk2 == null ? atStr_impl(bStr, i) : _chk2.atStr_impl(bStr, i); }
  
  public String stringAt(int i) {
    return atStr(new BufferedString(), i).toString();
  }


  /** Write a {@code long} using absolute row numbers.  There is no way to
   *  write a missing value with this call.  Under rare circumstances this can
   *  throw: if the long does not fit in a double (value is larger magnitude
   *  than 2^52), AND float values are stored in Vector.  In this case, there
   *  is no common compatible data representation.
   *
   *  <p>As with all the {@code set} calls, if the value written does not fit
   *  in the current compression scheme, the Chunk will be inflated into a
   *  NewChunk and the value written there.  Later, the NewChunk will be
   *  compressed (after a {@link #close} call) and written back to the DKV.
   *  i.e., there is some interesting cost if Chunk compression-types need to
   *  change.
   *
   *  <p>This version uses absolute element numbers, but must convert them to
   *  chunk-relative indices - requiring a load from an aliasing local var,
   *  leading to lower quality JIT'd code (similar issue to using iterator
   *  objects). */
  final void set_abs(long i, long l) { long x = i-_start; if (0 <= x && x < _len) set((int) x, l); else _vec.set(i,l); }

  /** Write a {@code double} using absolute row numbers; NaN will be treated as
   *  a missing value.
   *
   *  <p>As with all the {@code set} calls, if the value written does not fit
   *  in the current compression scheme, the Chunk will be inflated into a
   *  NewChunk and the value written there.  Later, the NewChunk will be
   *  compressed (after a {@link #close} call) and written back to the DKV.
   *  i.e., there is some interesting cost if Chunk compression-types need to
   *  change.
   *
   *  <p>This version uses absolute element numbers, but must convert them to
   *  chunk-relative indices - requiring a load from an aliasing local var,
   *  leading to lower quality JIT'd code (similar issue to using iterator
   *  objects). */
  final void set_abs(long i, double d) { long x = i-_start; if (0 <= x && x < _len) set((int) x, d); else _vec.set(i,d); }

  /** Write a {@code float} using absolute row numbers; NaN will be treated as
   *  a missing value.
   *
   *  <p>As with all the {@code set} calls, if the value written does not fit
   *  in the current compression scheme, the Chunk will be inflated into a
   *  NewChunk and the value written there.  Later, the NewChunk will be
   *  compressed (after a {@link #close} call) and written back to the DKV.
   *  i.e., there is some interesting cost if Chunk compression-types need to
   *  change.
   *
   *  <p>This version uses absolute element numbers, but must convert them to
   *  chunk-relative indices - requiring a load from an aliasing local var,
   *  leading to lower quality JIT'd code (similar issue to using iterator
   *  objects). */
  final void set_abs( long i, float  f) { long x = i-_start; if (0 <= x && x < _len) set((int) x, f); else _vec.set(i,f); }

  /** Set the element as missing, using absolute row numbers.
   *
   *  <p>As with all the {@code set} calls, if the value written does not fit
   *  in the current compression scheme, the Chunk will be inflated into a
   *  NewChunk and the value written there.  Later, the NewChunk will be
   *  compressed (after a {@link #close} call) and written back to the DKV.
   *  i.e., there is some interesting cost if Chunk compression-types need to
   *  change.
   *
   *  <p>This version uses absolute element numbers, but must convert them to
   *  chunk-relative indices - requiring a load from an aliasing local var,
   *  leading to lower quality JIT'd code (similar issue to using iterator
   *  objects). */
  final void setNA_abs(long i) { long x = i-_start; if (0 <= x && x < _len) setNA((int) x); else _vec.setNA(i); }

  /** Set a {@code String}, using absolute row numbers.
   *
   *  <p>As with all the {@code set} calls, if the value written does not fit
   *  in the current compression scheme, the Chunk will be inflated into a
   *  NewChunk and the value written there.  Later, the NewChunk will be
   *  compressed (after a {@link #close} call) and written back to the DKV.
   *  i.e., there is some interesting cost if Chunk compression-types need to
   *  change.
   *
   *  <p>This version uses absolute element numbers, but must convert them to
   *  chunk-relative indices - requiring a load from an aliasing local var,
   *  leading to lower quality JIT'd code (similar issue to using iterator
   *  objects). */
  public final void set_abs(long i, String str) { long x = i-_start; if (0 <= x && x < _len) set((int) x, str); else _vec.set(i,str); }

  public final void set_abs(long i, UUID uuid) { long x = i-_start; if (0 <= x && x < _len) set((int) x, uuid); else _vec.set(i,uuid); }

  public boolean hasFloat(){return true;}
  public boolean hasNA(){return true;}

  /** Replace all rows with this new chunk */
  public void replaceAll( Chunk replacement ) {
    assert _len == replacement._len;
    _vec.preWriting();          // One-shot writing-init
    _chk2 = replacement;
    assert _chk2._chk2 == null; // Replacement has NOT been written into
  }

  public Chunk deepCopy() {
    Chunk c2 = clone();
    c2._vec=null;
    c2._start=-1;
    c2._cidx=-1;
    c2._mem = _mem.clone();
    c2.initFromBytes();
    assert len() == c2._len;
    return c2;
  }

  private void setWrite() {
    if( _chk2 != null ) return; // Already setWrite
    assert !(this instanceof NewChunk) : "Cannot direct-write into a NewChunk, only append";
    setWrite(clone());
  }

  private void setWrite(Chunk ck) {
    assert(_chk2==null);
    _vec.preWriting();          // One-shot writing-init
    _chk2 = ck;
    assert _chk2._chk2 == null; // Clone has NOT been written into
  }

  /** Write a {@code long} with check-relative indexing.  There is no way to
   *  write a missing value with this call.  Under rare circumstances this can
   *  throw: if the long does not fit in a double (value is larger magnitude
   *  than 2^52), AND float values are stored in Vector.  In this case, there
   *  is no common compatible data representation.
   *
   *  <p>As with all the {@code set} calls, if the value written does not fit
   *  in the current compression scheme, the Chunk will be inflated into a
   *  NewChunk and the value written there.  Later, the NewChunk will be
   *  compressed (after a {@link #close} call) and written back to the DKV.
   *  i.e., there is some interesting cost if Chunk compression-types need to
   *  change.
   *  @return the set value
   */
  public final long set(int idx, long l) {
    setWrite();
    if( _chk2.set_impl(idx,l) ) return l;
    (_chk2 = inflate()).set_impl(idx,l);
    return l;
  }

  public final double [] set(double [] d){
    assert d.length == _len && _chk2 == null;
    setWrite(new NewChunk(this,d));
    return d;
  }
  /** Write a {@code double} with check-relative indexing.  NaN will be treated
   *  as a missing value.
   *
   *  <p>As with all the {@code set} calls, if the value written does not fit
   *  in the current compression scheme, the Chunk will be inflated into a
   *  NewChunk and the value written there.  Later, the NewChunk will be
   *  compressed (after a {@link #close} call) and written back to the DKV.
   *  i.e., there is some interesting cost if Chunk compression-types need to
   *  change.
   *  @return the set value
   */
  public final double set(int idx, double d) {
    setWrite();
    if( _chk2.set_impl(idx,d) ) return d;
    (_chk2 = inflate()).set_impl(idx,d);
    return d;
  }

  /** Write a {@code float} with check-relative indexing.  NaN will be treated
   *  as a missing value.
   *
   *  <p>As with all the {@code set} calls, if the value written does not fit
   *  in the current compression scheme, the Chunk will be inflated into a
   *  NewChunk and the value written there.  Later, the NewChunk will be
   *  compressed (after a {@link #close} call) and written back to the DKV.
   *  i.e., there is some interesting cost if Chunk compression-types need to
   *  change.
   *  @return the set value
   */
  public final float set(int idx, float f) {
    setWrite();
    if( _chk2.set_impl(idx,f) ) return f;
    (_chk2 = inflate()).set_impl(idx,f);
    return f;
  }

  /** Set a value as missing.
   *
   *  <p>As with all the {@code set} calls, if the value written does not fit
   *  in the current compression scheme, the Chunk will be inflated into a
   *  NewChunk and the value written there.  Later, the NewChunk will be
   *  compressed (after a {@link #close} call) and written back to the DKV.
   *  i.e., there is some interesting cost if Chunk compression-types need to
   *  change.
   *  @return the set value
   */
  public final boolean setNA(int idx) {
    setWrite();
    if( _chk2.setNA_impl(idx) ) return true;
    (_chk2 = inflate()).setNA_impl(idx);
    return true;
  }

  /** Write a {@code String} with check-relative indexing.  {@code null} will
   *  be treated as a missing value.
   *
   *  <p>As with all the {@code set} calls, if the value written does not fit
   *  in the current compression scheme, the Chunk will be inflated into a
   *  NewChunk and the value written there.  Later, the NewChunk will be
   *  compressed (after a {@link #close} call) and written back to the DKV.
   *  i.e., there is some interesting cost if Chunk compression-types need to
   *  change.
   *  @return the set value
   */
  public final String set(int idx, String str) {
    setWrite();
    if( _chk2.set_impl(idx,str) ) return str;
    (_chk2 = inflate()).set_impl(idx,str);
    return str;
  }

  public final UUID set(int idx, UUID uuid) {
    setWrite();
    long lo = uuid.getLeastSignificantBits();
    long hi = uuid.getMostSignificantBits();

    if( _chk2.set_impl(idx, lo, hi) ) return uuid;
    _chk2 = inflate();
    _chk2.set_impl(idx,lo, hi);
    return uuid;
  }

  private Object setUnknown(int idx) {
    setNA(idx);
    return null;
  }

  /**
   * @param idx index of the value in Chunk
   * @param x new value to set
   * @return x on success, or null if something went wrong
   */
  public final Object setAny(int idx, Object x) {
    return x instanceof String         ? set(idx, (String) x) :
           x instanceof Double         ? set(idx, (Double)x) :
           x instanceof Float          ? set(idx, (Float)x) :
           x instanceof Long           ? set(idx, (Long)x) :
           x instanceof Integer        ? set(idx, ((Integer)x).longValue()) :
           x instanceof UUID           ? set(idx, (UUID) x) :
           x instanceof java.util.Date ? set(idx, ((java.util.Date) x).getTime()) :
           /* otherwise */               setUnknown(idx);
      }

  /** After writing we must call close() to register the bulk changes.  If a
   *  NewChunk was needed, it will be compressed into some other kind of Chunk.
   *  The resulting Chunk (either a modified self, or a compressed NewChunk)
   *  will be written to the DKV.  Only after that {@code DKV.put} completes
   *  will all readers of this Chunk witness the changes.
   *  @return the passed-in {@link Futures}, for flow-coding.
   */
  public Futures close( int cidx, Futures fs ) {
    if( this  instanceof NewChunk ) _chk2 = this;
    if( _chk2 == null ) return fs;          // No change?
    if( _chk2 instanceof NewChunk ) _chk2 = ((NewChunk)_chk2).new_close();

    DKV.put(_vec.chunkKey(cidx),_chk2,fs,true); // Write updated chunk back into K/V
    return fs;
  }

  /** @return Chunk index */
  public int cidx() {
    assert _cidx != -1 : "Chunk idx was not properly loaded!";
    return _cidx;
  }

  public final Chunk setVolatile(double [] ds) {
    Chunk res;
    Value v = new Value(_vec.chunkKey(_cidx), res = new C8DVolatileChunk(ds),ds.length*8,Value.ICE);
    DKV.put(v._key,v);
    return res;
  }

  public final Chunk setVolatile(int[] vals) {
    Chunk res;
    Value v = new Value(_vec.chunkKey(_cidx), res = new C4VolatileChunk(vals),vals.length*4,Value.ICE);
    DKV.put(v._key,v);
    return res;
  }

  public boolean isVolatile() {return false;}

  static class WrongType extends IllegalArgumentException {
    private final Class<?> expected;
    private final Class<?> actual;

    public WrongType(Class<?> expected, Class<?> actual) {
      super("Expected: " + expected + ", actual: " + actual);
      this.expected = expected;
      this.actual = actual;
    }
  }

  static WrongType wrongType(Class<?> expected, Class<?> actual) { return new WrongType(expected, actual); }

  /** Chunk-specific readers.  Not a public API */
  abstract double   atd_impl(int idx);
  abstract long     at8_impl(int idx);
  abstract boolean isNA_impl(int idx);
  long at16l_impl(int idx) { throw wrongType(UUID.class, Object.class); }
  long at16h_impl(int idx) { throw wrongType(UUID.class, Object.class); }
  BufferedString atStr_impl(BufferedString bStr, int idx) { throw new IllegalArgumentException("Not a String"); }

  /** Chunk-specific writer.  Returns false if the value does not fit in the
   *  current compression scheme.  */
  abstract boolean set_impl  (int idx, long l );
  abstract boolean set_impl  (int idx, double d );
  abstract boolean set_impl  (int idx, float f );
  abstract boolean setNA_impl(int idx);
  boolean set_impl (int idx, String str) { return false; }
  boolean set_impl(int i, long lo, long hi) { return false; }
  //Zero sparse methods:

  /** Sparse Chunks have a significant number of zeros, and support for
   *  skipping over large runs of zeros in a row.
   *  @return true if this Chunk is sparse.  */
  public boolean isSparseZero() {return false;}

  /** Sparse Chunks have a significant number of zeros, and support for
   *  skipping over large runs of zeros in a row.
   *  @return At least as large as the count of non-zeros, but may be significantly smaller than the {@link #_len} */
  public int sparseLenZero() {return _len;}

  public int nextNZ(int rid){ return rid + 1;}

  /**
   *  Get indeces of non-zero values stored in this chunk
   *  @return array of chunk-relative indices of values stored in this chunk. */
  public int nonzeros(int [] res) {
    int k = 0;
    for( int i = 0; i < _len; ++i)
      if(atd(i) != 0)
        res[k++] = i;
    return k;
  }

  //NA sparse methods:
  
  /** Sparse Chunks have a significant number of NAs, and support for
   *  skipping over large runs of NAs in a row.
   *  @return true if this Chunk is sparseNA.  */
  public boolean isSparseNA() {return false;}

  /** Sparse Chunks have a significant number of NAs, and support for
   *  skipping over large runs of NAs in a row.
   *  @return At least as large as the count of non-NAs, but may be significantly smaller than the {@link #_len} */
  public int sparseLenNA() {return _len;}

  // Next non-NA. Analogous to nextNZ()
  public int nextNNA(int rid){ return rid + 1;}

  /** Get chunk-relative indices of values (nonnas for nasparse, all for dense)
   *  stored in this chunk.  For dense chunks, this will contain indices of all
   *  the rows in this chunk.
   *  @return array of chunk-relative indices of values stored in this chunk. */
  public int nonnas(int [] res) {
    for( int i = 0; i < _len; ++i) res[i] = i;
    return _len;
  }

  /** Report the Chunk min-value (excluding NAs), or NaN if unknown.  Actual
   *  min can be higher than reported.  Used to short-cut RollupStats for
   *  constant and boolean chunks. */
  double min() { return Double.NaN; }
  /** Report the Chunk max-value (excluding NAs), or NaN if unknown.  Actual
   *  max can be lower than reported.  Used to short-cut RollupStats for
   *  constant and boolean chunks. */
  double max() { return Double.NaN; }


  public final NewChunk inflate(){ return extractRows(new NewChunk(this), 0,_len);}


  /** Return the next Chunk, or null if at end.  Mostly useful for parsers or
   *  optimized stencil calculations that want to "roll off the end" of a
   *  Chunk, but in a highly optimized way. */
  public Chunk nextChunk( ) { return _vec.nextChunk(this); }

  /** @return String version of a Chunk, class name and range*/
  @Override public String toString() { return getClass().getSimpleName() + "[" + _start + ".." + (_start + _len - 1) + "]"; }

  /** In memory size in bytes of the compressed Chunk plus embedded array. */
  public long byteSize() {
    long s= _mem == null ? 0 : _mem.length;
    s += (2+5)*8 + 12; // 2 hdr words, 5 other words, @8bytes each, plus mem array hdr
    if( _chk2 != null ) s += _chk2.byteSize();
    return s;
  }

  /** Custom serializers implemented by Chunk subclasses: the _mem field
   *  contains ALL the fields already. */
  public final  AutoBuffer write_impl(AutoBuffer bb) {return bb.putA1(_mem);}

  @Override
  public byte [] asBytes(){return _mem;}

  @Override
  public final Chunk reloadFromBytes(byte [] ary){
    _mem = ary;
    initFromBytes();
    return this;
  }

  protected abstract void initFromBytes();
  public final Chunk read_impl(AutoBuffer ab){
    _mem = ab.getA1();
    initFromBytes();
    return this;
  }

//  /** Custom deserializers, implemented by Chunk subclasses: the _mem field
//   *  contains ALL the fields already.  Init _start to -1, so we know we have
//   *  not filled in other fields.  Leave _vec and _chk2 null, leave _len
//   *  unknown. */
//  abstract public Chunk read_impl( AutoBuffer ab );

  // -----------------
  // Support for fixed-width format printing
//  private String pformat () { return pformat0(); }
//  private int pformat__len { return pformat_len0(); }

  /** Fixed-width format printing support.  Filled in by the subclasses. */
  public byte precision() { return -1; } // Digits after the decimal, or -1 for "all"

//  protected String pformat0() {
//    long min = (long)_vec.min();
//    if( min < 0 ) return "% "+pformat_len0()+"d";
//    return "%"+pformat_len0()+"d";
//  }
//  protected int pformat_len0() {
//    int len=0;
//    long min = (long)_vec.min();
//    if( min < 0 ) len++;
//    long max = Math.max(Math.abs(min),Math.abs((long)_vec.max()));
//    throw H2O.unimpl();
//    //for( int i=1; i<DParseTask.powers10i.length; i++ )
//    //  if( max < DParseTask.powers10i[i] )
//    //    return i+len;
//    //return 20;
//  }
//  protected int pformat_len0( double scale, int lg ) {
//    double dx = Math.log10(scale);
//    int x = (int)dx;
//    throw H2O.unimpl();
//    //if( DParseTask.pow10i(x) != scale ) throw H2O.unimpl();
//    //int w=1/*blank/sign*/+lg/*compression limits digits*/+1/*dot*/+1/*e*/+1/*neg exp*/+2/*digits of exp*/;
//    //return w;
//  }

  /** Used by the parser to help report various internal bugs.  Not intended for public use. */
  public final void reportBrokenCategorical(int i, int j, long l, int[] cmap, int levels) {
    StringBuilder sb = new StringBuilder("Categorical renumber task, column # " + i + ": Found OOB index " + l + " (expected 0 - " + cmap.length + ", global domain has " + levels + " levels) pulled from " + getClass().getSimpleName() +  "\n");
    int k = 0;
    for(; k < Math.min(5,_len); ++k)
      sb.append("at8_abs[" + (k+_start) + "] = " + atd(k) + ", _chk2 = " + (_chk2 != null?_chk2.atd(k):"") + "\n");
    k = Math.max(k,j-2);
    sb.append("...\n");
    for(; k < Math.min(_len,j+2); ++k)
      sb.append("at8_abs[" + (k+_start) + "] = " + atd(k) + ", _chk2 = " + (_chk2 != null?_chk2.atd(k):"") + "\n");
    sb.append("...\n");
    k = Math.max(k,_len-5);
    for(; k < _len; ++k)
      sb.append("at8_abs[" + (k+_start) + "] = " + atd(k) + ", _chk2 = " + (_chk2 != null?_chk2.atd(k):"") + "\n");
    throw new RuntimeException(sb.toString());
  }

  public abstract <T extends ChunkVisitor> T processRows(T v, int from, int to);
  public abstract <T extends ChunkVisitor> T processRows(T v, int [] ids);

  // convenience methods wrapping around visitor interface
  public NewChunk extractRows(NewChunk nc, int from, int to){
    return processRows(new ChunkVisitor.NewChunkVisitor(nc),from,to)._nc;
  }
  public NewChunk extractRows(NewChunk nc, int[] rows){
    return processRows(new ChunkVisitor.NewChunkVisitor(nc),rows)._nc;
  }
  public NewChunk extractRows(NewChunk nc, int row){
    return processRows(new ChunkVisitor.NewChunkVisitor(nc),row,row+1)._nc;
  }

  /**
   * Dense bulk interface, fetch values from the given range
   * @param vals
   * @param from
   * @param to
   */
  public double [] getDoubles(double[] vals, int from, int to){ return getDoubles(vals,from,to, Double.NaN);}
  public double [] getDoubles(double [] vals, int from, int to, double NA){
    return processRows(new ChunkVisitor.DoubleAryVisitor(vals,NA),from,to).vals;
  }
  public int [] getIntegers(int [] vals, int from, int to, int NA){
    return processRows(new ChunkVisitor.IntAryVisitor(vals,NA),from,to).vals;
  }
  /**
   * Dense bulk interface, fetch values from the given ids
   * @param vals
   * @param ids
   */
  public double[] getDoubles(double [] vals, int [] ids){
    return processRows(new ChunkVisitor.DoubleAryVisitor(vals),ids).vals;
  }
  /**
   * Sparse bulk interface, stream through the compressed values and extract them into dense double array.
   * @param vals holds extracted values, length must be >= this.sparseLen()
   * @param ids holds extracted chunk-relative row ids, length must be >= this.sparseLen()
   * @return number of extracted (non-zero) elements, equal to sparseLen()
   */
  public int getSparseDoubles(double[] vals, int[] ids){return getSparseDoubles(vals,ids,Double.NaN);}
  public int getSparseDoubles(double [] vals, int [] ids, double NA) {
    return processRows(new ChunkVisitor.SparseDoubleAryVisitor(vals,ids,isSparseNA(),NA),0,_len).sparseLen();
  }
}