package water.fvec;
import water.*;
import water.parser.BufferedString;
import java.util.UUID;
/** A compression scheme, over a chunk of data - a single array of bytes.
* Chunks are mapped many-to-1 to a {@link Vec}. The <em>actual</em> vector
* header info is in the Vec - which contains info to find all the bytes of
* the distributed vector. Subclasses of this abstract class implement
* (possibly empty) compression schemes.
*
* <p>Chunks are collections of elements, and support an array-like API.
* Chunks are subsets of a Vec; while the elements in a Vec are numbered
* starting at 0, any given Chunk has some (probably non-zero) starting row,
* and a length which is smaller than the whole Vec. Chunks are limited to a
* single Java byte array in a single JVM heap, and only an int's worth of
* elements. Chunks support both the notions of a global row-number and a
* chunk-local numbering. The global row-number calls are variants of {@code
* at_abs} and {@code set_abs}. If the row is outside the current Chunk's
* range, the data will be loaded by fetching from the correct Chunk. This
* probably involves some network traffic, and if all rows are loaded then the
* entire dataset will be pulled locally (possibly triggering an OutOfMemory).
*
* <p>The chunk-local numbering supports the common {@code for} loop iterator
* pattern, using {@code at*} and {@code set} calls, and is faster than the
* global row-numbering for tight loops (because it skips some range checks):
* <pre>{@code
* for (int row = 0; row < chunk._len; row++)
* ...chunk.atd(row)...
* }</pre>
*
* <p>The array-like API allows loading and storing elements in and out of
* Chunks. When loading, values are decompressed. When storing, an attempt
* to compress back into the actual underlying Chunk subclass is made; if this
* fails the Chunk is "inflated" into a {@link NewChunk}, and the store
* completed there. Later the NewChunk will be compressed (probably into a
* different underlying Chunk subclass) and put back in the K/V store under
* the same Key - effectively replacing the original Chunk; this is done when
* {@link #close} is called, and is taken care of by the standard {@link
* MRTask} calls.
*
* <p>Chunk updates are not multi-thread safe; the caller must do correct
* synchronization. This is already handled by the Map/Reduce {MRTask)
* framework. Chunk updates are not visible cross-cluster until the {@link
* #close} is made; again this is handled by MRTask directly.
*
* <p>In addition to normal load and store operations, Chunks support the
* notion a missing element via the {@link #isNA} call, and a "next non-zero"
* notion for rapidly iterating over sparse data.
*
* <p><b>Data Types</b>
*
* <p>Chunks hold Java primitive values, timestamps, UUIDs, or Strings. All
* the Chunks in a Vec hold the same type. Most of the types are compressed.
* Integer types (boolean, byte, short, int, long) are always lossless. Float
* and Double types might lose 1 or 2 ulps in the compression. Time data is
* held as milliseconds since the Unix Epoch. UUIDs are held as 128-bit
* integers (a pair of Java longs). Strings are compressed in various obvious
* ways. Sparse data is held... sparsely; e.g. loading data in SVMLight
* format will not "blow up" the in-memory representation. Categoricals/factors
* are held as small integers, with a shared String lookup table on the side.
*
* <p>Chunks support the notion of <em>missing</em> data. Missing float and
* double data is always treated as a NaN, both if read or written. There is
* no equivalent of NaN for integer data; reading a missing integer value is a
* coding error and will be flagged. If you are working with integer data
* with missing elements, you must first check for a missing value before
* loading it:
* <pre>{@code
* if( !chk.isNA(row) ) ...chk.at8(row)....
* }</pre>
*
* <p>The same holds true for the other non-real types (timestamps, UUIDs,
* Strings, or categoricals): they must be checked for missing before being
* used.
*
* <p><b>Performance Concerns</b>
*
* <p>The standard {@code for} loop mentioned above is the fastest way to
* access data; definitely faster (and less error prone) than iterating over
* global row numbers. Iterating over a single Chunk is nearly always
* memory-bandwidth bound. Often code will iterate over a number of Chunks
* aligned together (the common use-case of looking a whole rows of a
* dataset). Again, typically such a code pattern is memory-bandwidth bound
* although the X86 will stop being able to prefetch well beyond 100 or 200
* Chunks.
*
* <p>Note that Chunk alignment is guaranteed within all the Vecs of a Frame:
* Same numbered Chunks of <em>different</em> Vecs will have the same global
* row numbering and the same length, enabling a particularly simple and
* efficient way to iterate over all rows.
*
* <p>This example computes the Euclidean distance between all the columns and
* a given point, and stores the squared distance back in the last column.
* Note that due "NaN poisoning" if any row element is missing, the entire
* distance calculated will be NaN.
* <pre>{@code
* final double[] _point; // The given point
* public void map( Chunk[] chks ) { // Map over a set of same-numbered Chunks
* for( int row=0; row < chks[0]._len; row++ ) { // For all rows
* double dist=0; // Squared distance
* for( int col=0; col < chks.length-1; col++ ) { // For all cols, except the last output col
* double d = chks[col].atd(row) - _point[col]; // Distance along this dimension
* dist += d*d; // Sum-squared-distance
* }
* chks[chks.length-1].set( row, dist ); // Store back the distance in the last col
* }
* }}</pre>
*/
public abstract class Chunk extends Iced<Chunk> implements Vec.Holder {
public Chunk() {}
private Chunk(byte [] bytes) {_mem = bytes;initFromBytes();}
/** Global starting row for this local Chunk; a read-only field. */
transient long _start = -1;
/** Global starting row for this local Chunk */
public final long start() { return _start; }
/** Global index of this chunk filled during chunk load */
transient int _cidx = -1;
/** Number of rows in this Chunk; publicly a read-only field. Odd API
* design choice: public, not-final, read-only, NO-ACCESSOR.
*
* <p>NO-ACCESSOR: This is a high-performance field, and must have a known
* zero-cost cost-model; accessors hide that cost model, and make it
* not-obvious that a loop will be properly optimized or not.
*
* <p>not-final: set in various deserializers.
* <p>Proper usage: read the field, probably in a hot loop.
* <pre>
* for( int row=0; row < chunk._len; row++ )
* ...chunk.atd(row)...
* </pre>
**/
public transient int _len;
/** Internal set of _len. Used by lots of subclasses. Not a publically visible API. */
int set_len(int len) { return _len = len; }
/** Read-only length of chunk (number of rows). */
public int len() { return _len; }
/** Normally==null, changed if chunk is written to. Not a publically readable or writable field. */
private transient Chunk _chk2;
/** Exposed for internal testing only. Not a publically visible API. */
public Chunk chk2() { return _chk2; }
/** Owning Vec; a read-only field */
transient Vec _vec;
/** Owning Vec */
public Vec vec() { return _vec; }
/** Set the owning Vec */
public void setVec(Vec vec) { _vec = vec; }
/** Set the start */
public void setStart(long start) { _start = start; }
/** The Big Data. Frequently set in the subclasses, but not otherwise a publically writable field. */
byte[] _mem;
/** Short-cut to the embedded big-data memory. Generally not useful for
* public consumption, since the data remains compressed and holding on to a
* pointer to this array defeats the user-mode spill-to-disk. */
public byte[] getBytes() { return _mem; }
public void setBytes(byte[] mem) { _mem = mem; }
final long at8_abs(long i) {
long x = i - (_start>0 ? _start : 0);
if( 0 <= x && x < _len) return at8((int) x);
throw new ArrayIndexOutOfBoundsException(""+_start+" <= "+i+" < "+(_start+ _len));
}
/** Load a {@code double} value using absolute row numbers. Returns
* Double.NaN if value is missing.
*
* <p>This version uses absolute element numbers, but must convert them to
* chunk-relative indices - requiring a load from an aliasing local var,
* leading to lower quality JIT'd code (similar issue to using iterator
* objects).
*
* <p>Slightly slower than {@link #atd} since it range-checks within a chunk.
* @return double value at the given row, or NaN if the value is missing */
final double at_abs(long i) {
long x = i - (_start>0 ? _start : 0);
if( 0 <= x && x < _len) return atd((int) x);
throw new ArrayIndexOutOfBoundsException(""+_start+" <= "+i+" < "+(_start+ _len));
}
/** Missing value status.
*
* <p>This version uses absolute element numbers, but must convert them to
* chunk-relative indices - requiring a load from an aliasing local var,
* leading to lower quality JIT'd code (similar issue to using iterator
* objects).
*
* <p>Slightly slower than {@link #isNA} since it range-checks within a chunk.
* @return true if the value is missing */
final boolean isNA_abs(long i) {
long x = i - (_start>0 ? _start : 0);
if( 0 <= x && x < _len) return isNA((int) x);
throw new ArrayIndexOutOfBoundsException(""+_start+" <= "+i+" < "+(_start+ _len));
}
/** Low half of a 128-bit UUID, or throws if the value is missing.
*
* <p>This version uses absolute element numbers, but must convert them to
* chunk-relative indices - requiring a load from an aliasing local var,
* leading to lower quality JIT'd code (similar issue to using iterator
* objects).
*
* <p>Slightly slower than {@link #at16l} since it range-checks within a chunk.
* @return Low half of a 128-bit UUID, or throws if the value is missing. */
final long at16l_abs(long i) {
long x = i - (_start>0 ? _start : 0);
if( 0 <= x && x < _len) return at16l((int) x);
throw new ArrayIndexOutOfBoundsException(""+_start+" <= "+i+" < "+(_start+ _len));
}
/** High half of a 128-bit UUID, or throws if the value is missing.
*
* <p>This version uses absolute element numbers, but must convert them to
* chunk-relative indices - requiring a load from an aliasing local var,
* leading to lower quality JIT'd code (similar issue to using iterator
* objects).
*
* <p>Slightly slower than {@link #at16h} since it range-checks within a chunk.
* @return High half of a 128-bit UUID, or throws if the value is missing. */
final long at16h_abs(long i) {
long x = i - (_start>0 ? _start : 0);
if( 0 <= x && x < _len) return at16h((int) x);
throw new ArrayIndexOutOfBoundsException(""+_start+" <= "+i+" < "+(_start+ _len));
}
/** String value using absolute row numbers, or null if missing.
*
* <p>This version uses absolute element numbers, but must convert them to
* chunk-relative indices - requiring a load from an aliasing local var,
* leading to lower quality JIT'd code (similar issue to using iterator
* objects).
*
* <p>Slightly slower than {@link #atStr} since it range-checks within a chunk.
* @return String value using absolute row numbers, or null if missing. */
final BufferedString atStr_abs(BufferedString bStr, long i) {
long x = i - (_start>0 ? _start : 0);
if( 0 <= x && x < _len) return atStr(bStr, (int) x);
throw new ArrayIndexOutOfBoundsException(""+_start+" <= "+i+" < "+(_start+ _len));
}
/** Load a {@code double} value using chunk-relative row numbers. Returns Double.NaN
* if value is missing.
* @return double value at the given row, or NaN if the value is missing */
public final double atd(int i) { return _chk2 == null ? atd_impl(i) : _chk2. atd_impl(i); }
/** Load a {@code long} value using chunk-relative row numbers. Floating
* point values are silently rounded to a long. Throws if the value is
* missing.
* @return long value at the given row, or throw if the value is missing */
public final long at8(int i) { return _chk2 == null ? at8_impl(i) : _chk2. at8_impl(i); }
/** Missing value status using chunk-relative row numbers.
*
* @return true if the value is missing */
public final boolean isNA(int i) { return _chk2 == null ?isNA_impl(i) : _chk2.isNA_impl(i); }
/** Low half of a 128-bit UUID, or throws if the value is missing.
*
* @return Low half of a 128-bit UUID, or throws if the value is missing. */
public final long at16l(int i) { return _chk2 == null ? at16l_impl(i) : _chk2.at16l_impl(i); }
/** High half of a 128-bit UUID, or throws if the value is missing.
*
* @return High half of a 128-bit UUID, or throws if the value is missing. */
public final long at16h(int i) { return _chk2 == null ? at16h_impl(i) : _chk2.at16h_impl(i); }
/** String value using chunk-relative row numbers, or null if missing.
*
* @return String value or null if missing. */
public final BufferedString atStr(BufferedString bStr, int i) { return _chk2 == null ? atStr_impl(bStr, i) : _chk2.atStr_impl(bStr, i); }
public String stringAt(int i) {
return atStr(new BufferedString(), i).toString();
}
/** Write a {@code long} using absolute row numbers. There is no way to
* write a missing value with this call. Under rare circumstances this can
* throw: if the long does not fit in a double (value is larger magnitude
* than 2^52), AND float values are stored in Vector. In this case, there
* is no common compatible data representation.
*
* <p>As with all the {@code set} calls, if the value written does not fit
* in the current compression scheme, the Chunk will be inflated into a
* NewChunk and the value written there. Later, the NewChunk will be
* compressed (after a {@link #close} call) and written back to the DKV.
* i.e., there is some interesting cost if Chunk compression-types need to
* change.
*
* <p>This version uses absolute element numbers, but must convert them to
* chunk-relative indices - requiring a load from an aliasing local var,
* leading to lower quality JIT'd code (similar issue to using iterator
* objects). */
final void set_abs(long i, long l) { long x = i-_start; if (0 <= x && x < _len) set((int) x, l); else _vec.set(i,l); }
/** Write a {@code double} using absolute row numbers; NaN will be treated as
* a missing value.
*
* <p>As with all the {@code set} calls, if the value written does not fit
* in the current compression scheme, the Chunk will be inflated into a
* NewChunk and the value written there. Later, the NewChunk will be
* compressed (after a {@link #close} call) and written back to the DKV.
* i.e., there is some interesting cost if Chunk compression-types need to
* change.
*
* <p>This version uses absolute element numbers, but must convert them to
* chunk-relative indices - requiring a load from an aliasing local var,
* leading to lower quality JIT'd code (similar issue to using iterator
* objects). */
final void set_abs(long i, double d) { long x = i-_start; if (0 <= x && x < _len) set((int) x, d); else _vec.set(i,d); }
/** Write a {@code float} using absolute row numbers; NaN will be treated as
* a missing value.
*
* <p>As with all the {@code set} calls, if the value written does not fit
* in the current compression scheme, the Chunk will be inflated into a
* NewChunk and the value written there. Later, the NewChunk will be
* compressed (after a {@link #close} call) and written back to the DKV.
* i.e., there is some interesting cost if Chunk compression-types need to
* change.
*
* <p>This version uses absolute element numbers, but must convert them to
* chunk-relative indices - requiring a load from an aliasing local var,
* leading to lower quality JIT'd code (similar issue to using iterator
* objects). */
final void set_abs( long i, float f) { long x = i-_start; if (0 <= x && x < _len) set((int) x, f); else _vec.set(i,f); }
/** Set the element as missing, using absolute row numbers.
*
* <p>As with all the {@code set} calls, if the value written does not fit
* in the current compression scheme, the Chunk will be inflated into a
* NewChunk and the value written there. Later, the NewChunk will be
* compressed (after a {@link #close} call) and written back to the DKV.
* i.e., there is some interesting cost if Chunk compression-types need to
* change.
*
* <p>This version uses absolute element numbers, but must convert them to
* chunk-relative indices - requiring a load from an aliasing local var,
* leading to lower quality JIT'd code (similar issue to using iterator
* objects). */
final void setNA_abs(long i) { long x = i-_start; if (0 <= x && x < _len) setNA((int) x); else _vec.setNA(i); }
/** Set a {@code String}, using absolute row numbers.
*
* <p>As with all the {@code set} calls, if the value written does not fit
* in the current compression scheme, the Chunk will be inflated into a
* NewChunk and the value written there. Later, the NewChunk will be
* compressed (after a {@link #close} call) and written back to the DKV.
* i.e., there is some interesting cost if Chunk compression-types need to
* change.
*
* <p>This version uses absolute element numbers, but must convert them to
* chunk-relative indices - requiring a load from an aliasing local var,
* leading to lower quality JIT'd code (similar issue to using iterator
* objects). */
public final void set_abs(long i, String str) { long x = i-_start; if (0 <= x && x < _len) set((int) x, str); else _vec.set(i,str); }
public final void set_abs(long i, UUID uuid) { long x = i-_start; if (0 <= x && x < _len) set((int) x, uuid); else _vec.set(i,uuid); }
public boolean hasFloat(){return true;}
public boolean hasNA(){return true;}
/** Replace all rows with this new chunk */
public void replaceAll( Chunk replacement ) {
assert _len == replacement._len;
_vec.preWriting(); // One-shot writing-init
_chk2 = replacement;
assert _chk2._chk2 == null; // Replacement has NOT been written into
}
public Chunk deepCopy() {
Chunk c2 = clone();
c2._vec=null;
c2._start=-1;
c2._cidx=-1;
c2._mem = _mem.clone();
c2.initFromBytes();
assert len() == c2._len;
return c2;
}
private void setWrite() {
if( _chk2 != null ) return; // Already setWrite
assert !(this instanceof NewChunk) : "Cannot direct-write into a NewChunk, only append";
setWrite(clone());
}
private void setWrite(Chunk ck) {
assert(_chk2==null);
_vec.preWriting(); // One-shot writing-init
_chk2 = ck;
assert _chk2._chk2 == null; // Clone has NOT been written into
}
/** Write a {@code long} with check-relative indexing. There is no way to
* write a missing value with this call. Under rare circumstances this can
* throw: if the long does not fit in a double (value is larger magnitude
* than 2^52), AND float values are stored in Vector. In this case, there
* is no common compatible data representation.
*
* <p>As with all the {@code set} calls, if the value written does not fit
* in the current compression scheme, the Chunk will be inflated into a
* NewChunk and the value written there. Later, the NewChunk will be
* compressed (after a {@link #close} call) and written back to the DKV.
* i.e., there is some interesting cost if Chunk compression-types need to
* change.
* @return the set value
*/
public final long set(int idx, long l) {
setWrite();
if( _chk2.set_impl(idx,l) ) return l;
(_chk2 = inflate()).set_impl(idx,l);
return l;
}
public final double [] set(double [] d){
assert d.length == _len && _chk2 == null;
setWrite(new NewChunk(this,d));
return d;
}
/** Write a {@code double} with check-relative indexing. NaN will be treated
* as a missing value.
*
* <p>As with all the {@code set} calls, if the value written does not fit
* in the current compression scheme, the Chunk will be inflated into a
* NewChunk and the value written there. Later, the NewChunk will be
* compressed (after a {@link #close} call) and written back to the DKV.
* i.e., there is some interesting cost if Chunk compression-types need to
* change.
* @return the set value
*/
public final double set(int idx, double d) {
setWrite();
if( _chk2.set_impl(idx,d) ) return d;
(_chk2 = inflate()).set_impl(idx,d);
return d;
}
/** Write a {@code float} with check-relative indexing. NaN will be treated
* as a missing value.
*
* <p>As with all the {@code set} calls, if the value written does not fit
* in the current compression scheme, the Chunk will be inflated into a
* NewChunk and the value written there. Later, the NewChunk will be
* compressed (after a {@link #close} call) and written back to the DKV.
* i.e., there is some interesting cost if Chunk compression-types need to
* change.
* @return the set value
*/
public final float set(int idx, float f) {
setWrite();
if( _chk2.set_impl(idx,f) ) return f;
(_chk2 = inflate()).set_impl(idx,f);
return f;
}
/** Set a value as missing.
*
* <p>As with all the {@code set} calls, if the value written does not fit
* in the current compression scheme, the Chunk will be inflated into a
* NewChunk and the value written there. Later, the NewChunk will be
* compressed (after a {@link #close} call) and written back to the DKV.
* i.e., there is some interesting cost if Chunk compression-types need to
* change.
* @return the set value
*/
public final boolean setNA(int idx) {
setWrite();
if( _chk2.setNA_impl(idx) ) return true;
(_chk2 = inflate()).setNA_impl(idx);
return true;
}
/** Write a {@code String} with check-relative indexing. {@code null} will
* be treated as a missing value.
*
* <p>As with all the {@code set} calls, if the value written does not fit
* in the current compression scheme, the Chunk will be inflated into a
* NewChunk and the value written there. Later, the NewChunk will be
* compressed (after a {@link #close} call) and written back to the DKV.
* i.e., there is some interesting cost if Chunk compression-types need to
* change.
* @return the set value
*/
public final String set(int idx, String str) {
setWrite();
if( _chk2.set_impl(idx,str) ) return str;
(_chk2 = inflate()).set_impl(idx,str);
return str;
}
public final UUID set(int idx, UUID uuid) {
setWrite();
long lo = uuid.getLeastSignificantBits();
long hi = uuid.getMostSignificantBits();
if( _chk2.set_impl(idx, lo, hi) ) return uuid;
_chk2 = inflate();
_chk2.set_impl(idx,lo, hi);
return uuid;
}
private Object setUnknown(int idx) {
setNA(idx);
return null;
}
/**
* @param idx index of the value in Chunk
* @param x new value to set
* @return x on success, or null if something went wrong
*/
public final Object setAny(int idx, Object x) {
return x instanceof String ? set(idx, (String) x) :
x instanceof Double ? set(idx, (Double)x) :
x instanceof Float ? set(idx, (Float)x) :
x instanceof Long ? set(idx, (Long)x) :
x instanceof Integer ? set(idx, ((Integer)x).longValue()) :
x instanceof UUID ? set(idx, (UUID) x) :
x instanceof java.util.Date ? set(idx, ((java.util.Date) x).getTime()) :
/* otherwise */ setUnknown(idx);
}
/** After writing we must call close() to register the bulk changes. If a
* NewChunk was needed, it will be compressed into some other kind of Chunk.
* The resulting Chunk (either a modified self, or a compressed NewChunk)
* will be written to the DKV. Only after that {@code DKV.put} completes
* will all readers of this Chunk witness the changes.
* @return the passed-in {@link Futures}, for flow-coding.
*/
public Futures close( int cidx, Futures fs ) {
if( this instanceof NewChunk ) _chk2 = this;
if( _chk2 == null ) return fs; // No change?
if( _chk2 instanceof NewChunk ) _chk2 = ((NewChunk)_chk2).new_close();
DKV.put(_vec.chunkKey(cidx),_chk2,fs,true); // Write updated chunk back into K/V
return fs;
}
/** @return Chunk index */
public int cidx() {
assert _cidx != -1 : "Chunk idx was not properly loaded!";
return _cidx;
}
public final Chunk setVolatile(double [] ds) {
Chunk res;
Value v = new Value(_vec.chunkKey(_cidx), res = new C8DVolatileChunk(ds),ds.length*8,Value.ICE);
DKV.put(v._key,v);
return res;
}
public final Chunk setVolatile(int[] vals) {
Chunk res;
Value v = new Value(_vec.chunkKey(_cidx), res = new C4VolatileChunk(vals),vals.length*4,Value.ICE);
DKV.put(v._key,v);
return res;
}
public boolean isVolatile() {return false;}
static class WrongType extends IllegalArgumentException {
private final Class<?> expected;
private final Class<?> actual;
public WrongType(Class<?> expected, Class<?> actual) {
super("Expected: " + expected + ", actual: " + actual);
this.expected = expected;
this.actual = actual;
}
}
static WrongType wrongType(Class<?> expected, Class<?> actual) { return new WrongType(expected, actual); }
/** Chunk-specific readers. Not a public API */
abstract double atd_impl(int idx);
abstract long at8_impl(int idx);
abstract boolean isNA_impl(int idx);
long at16l_impl(int idx) { throw wrongType(UUID.class, Object.class); }
long at16h_impl(int idx) { throw wrongType(UUID.class, Object.class); }
BufferedString atStr_impl(BufferedString bStr, int idx) { throw new IllegalArgumentException("Not a String"); }
/** Chunk-specific writer. Returns false if the value does not fit in the
* current compression scheme. */
abstract boolean set_impl (int idx, long l );
abstract boolean set_impl (int idx, double d );
abstract boolean set_impl (int idx, float f );
abstract boolean setNA_impl(int idx);
boolean set_impl (int idx, String str) { return false; }
boolean set_impl(int i, long lo, long hi) { return false; }
//Zero sparse methods:
/** Sparse Chunks have a significant number of zeros, and support for
* skipping over large runs of zeros in a row.
* @return true if this Chunk is sparse. */
public boolean isSparseZero() {return false;}
/** Sparse Chunks have a significant number of zeros, and support for
* skipping over large runs of zeros in a row.
* @return At least as large as the count of non-zeros, but may be significantly smaller than the {@link #_len} */
public int sparseLenZero() {return _len;}
public int nextNZ(int rid){ return rid + 1;}
/**
* Get indeces of non-zero values stored in this chunk
* @return array of chunk-relative indices of values stored in this chunk. */
public int nonzeros(int [] res) {
int k = 0;
for( int i = 0; i < _len; ++i)
if(atd(i) != 0)
res[k++] = i;
return k;
}
//NA sparse methods:
/** Sparse Chunks have a significant number of NAs, and support for
* skipping over large runs of NAs in a row.
* @return true if this Chunk is sparseNA. */
public boolean isSparseNA() {return false;}
/** Sparse Chunks have a significant number of NAs, and support for
* skipping over large runs of NAs in a row.
* @return At least as large as the count of non-NAs, but may be significantly smaller than the {@link #_len} */
public int sparseLenNA() {return _len;}
// Next non-NA. Analogous to nextNZ()
public int nextNNA(int rid){ return rid + 1;}
/** Get chunk-relative indices of values (nonnas for nasparse, all for dense)
* stored in this chunk. For dense chunks, this will contain indices of all
* the rows in this chunk.
* @return array of chunk-relative indices of values stored in this chunk. */
public int nonnas(int [] res) {
for( int i = 0; i < _len; ++i) res[i] = i;
return _len;
}
/** Report the Chunk min-value (excluding NAs), or NaN if unknown. Actual
* min can be higher than reported. Used to short-cut RollupStats for
* constant and boolean chunks. */
double min() { return Double.NaN; }
/** Report the Chunk max-value (excluding NAs), or NaN if unknown. Actual
* max can be lower than reported. Used to short-cut RollupStats for
* constant and boolean chunks. */
double max() { return Double.NaN; }
public final NewChunk inflate(){ return extractRows(new NewChunk(this), 0,_len);}
/** Return the next Chunk, or null if at end. Mostly useful for parsers or
* optimized stencil calculations that want to "roll off the end" of a
* Chunk, but in a highly optimized way. */
public Chunk nextChunk( ) { return _vec.nextChunk(this); }
/** @return String version of a Chunk, class name and range*/
@Override public String toString() { return getClass().getSimpleName() + "[" + _start + ".." + (_start + _len - 1) + "]"; }
/** In memory size in bytes of the compressed Chunk plus embedded array. */
public long byteSize() {
long s= _mem == null ? 0 : _mem.length;
s += (2+5)*8 + 12; // 2 hdr words, 5 other words, @8bytes each, plus mem array hdr
if( _chk2 != null ) s += _chk2.byteSize();
return s;
}
/** Custom serializers implemented by Chunk subclasses: the _mem field
* contains ALL the fields already. */
public final AutoBuffer write_impl(AutoBuffer bb) {return bb.putA1(_mem);}
@Override
public byte [] asBytes(){return _mem;}
@Override
public final Chunk reloadFromBytes(byte [] ary){
_mem = ary;
initFromBytes();
return this;
}
protected abstract void initFromBytes();
public final Chunk read_impl(AutoBuffer ab){
_mem = ab.getA1();
initFromBytes();
return this;
}
// /** Custom deserializers, implemented by Chunk subclasses: the _mem field
// * contains ALL the fields already. Init _start to -1, so we know we have
// * not filled in other fields. Leave _vec and _chk2 null, leave _len
// * unknown. */
// abstract public Chunk read_impl( AutoBuffer ab );
// -----------------
// Support for fixed-width format printing
// private String pformat () { return pformat0(); }
// private int pformat__len { return pformat_len0(); }
/** Fixed-width format printing support. Filled in by the subclasses. */
public byte precision() { return -1; } // Digits after the decimal, or -1 for "all"
// protected String pformat0() {
// long min = (long)_vec.min();
// if( min < 0 ) return "% "+pformat_len0()+"d";
// return "%"+pformat_len0()+"d";
// }
// protected int pformat_len0() {
// int len=0;
// long min = (long)_vec.min();
// if( min < 0 ) len++;
// long max = Math.max(Math.abs(min),Math.abs((long)_vec.max()));
// throw H2O.unimpl();
// //for( int i=1; i<DParseTask.powers10i.length; i++ )
// // if( max < DParseTask.powers10i[i] )
// // return i+len;
// //return 20;
// }
// protected int pformat_len0( double scale, int lg ) {
// double dx = Math.log10(scale);
// int x = (int)dx;
// throw H2O.unimpl();
// //if( DParseTask.pow10i(x) != scale ) throw H2O.unimpl();
// //int w=1/*blank/sign*/+lg/*compression limits digits*/+1/*dot*/+1/*e*/+1/*neg exp*/+2/*digits of exp*/;
// //return w;
// }
/** Used by the parser to help report various internal bugs. Not intended for public use. */
public final void reportBrokenCategorical(int i, int j, long l, int[] cmap, int levels) {
StringBuilder sb = new StringBuilder("Categorical renumber task, column # " + i + ": Found OOB index " + l + " (expected 0 - " + cmap.length + ", global domain has " + levels + " levels) pulled from " + getClass().getSimpleName() + "\n");
int k = 0;
for(; k < Math.min(5,_len); ++k)
sb.append("at8_abs[" + (k+_start) + "] = " + atd(k) + ", _chk2 = " + (_chk2 != null?_chk2.atd(k):"") + "\n");
k = Math.max(k,j-2);
sb.append("...\n");
for(; k < Math.min(_len,j+2); ++k)
sb.append("at8_abs[" + (k+_start) + "] = " + atd(k) + ", _chk2 = " + (_chk2 != null?_chk2.atd(k):"") + "\n");
sb.append("...\n");
k = Math.max(k,_len-5);
for(; k < _len; ++k)
sb.append("at8_abs[" + (k+_start) + "] = " + atd(k) + ", _chk2 = " + (_chk2 != null?_chk2.atd(k):"") + "\n");
throw new RuntimeException(sb.toString());
}
public abstract <T extends ChunkVisitor> T processRows(T v, int from, int to);
public abstract <T extends ChunkVisitor> T processRows(T v, int [] ids);
// convenience methods wrapping around visitor interface
public NewChunk extractRows(NewChunk nc, int from, int to){
return processRows(new ChunkVisitor.NewChunkVisitor(nc),from,to)._nc;
}
public NewChunk extractRows(NewChunk nc, int[] rows){
return processRows(new ChunkVisitor.NewChunkVisitor(nc),rows)._nc;
}
public NewChunk extractRows(NewChunk nc, int row){
return processRows(new ChunkVisitor.NewChunkVisitor(nc),row,row+1)._nc;
}
/**
* Dense bulk interface, fetch values from the given range
* @param vals
* @param from
* @param to
*/
public double [] getDoubles(double[] vals, int from, int to){ return getDoubles(vals,from,to, Double.NaN);}
public double [] getDoubles(double [] vals, int from, int to, double NA){
return processRows(new ChunkVisitor.DoubleAryVisitor(vals,NA),from,to).vals;
}
public int [] getIntegers(int [] vals, int from, int to, int NA){
return processRows(new ChunkVisitor.IntAryVisitor(vals,NA),from,to).vals;
}
/**
* Dense bulk interface, fetch values from the given ids
* @param vals
* @param ids
*/
public double[] getDoubles(double [] vals, int [] ids){
return processRows(new ChunkVisitor.DoubleAryVisitor(vals),ids).vals;
}
/**
* Sparse bulk interface, stream through the compressed values and extract them into dense double array.
* @param vals holds extracted values, length must be >= this.sparseLen()
* @param ids holds extracted chunk-relative row ids, length must be >= this.sparseLen()
* @return number of extracted (non-zero) elements, equal to sparseLen()
*/
public int getSparseDoubles(double[] vals, int[] ids){return getSparseDoubles(vals,ids,Double.NaN);}
public int getSparseDoubles(double [] vals, int [] ids, double NA) {
return processRows(new ChunkVisitor.SparseDoubleAryVisitor(vals,ids,isSparseNA(),NA),0,_len).sparseLen();
}
}