/* * Licensed to the Apache Software Foundation (ASF) under one * or more contributor license agreements. See the NOTICE file * distributed with this work for additional information * regarding copyright ownership. The ASF licenses this file * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ /** * Provides a light-weight, simplified set of column readers and writers that * can be plugged into a variety of row-level readers and writers. The classes * and interfaces here form a framework for accessing rows and columns, but do * not provide the code to build accessors for a given row batch. This code is * meant to be generic, but the first (and, thus far, only) use is with the test * framework for the java-exec project. That one implementation is specific to * unit tests, but the accessor framework could easily be used for other * purposes as well. * <p> * Drill provides a set of column readers and writers. Compared to those, this * set: * <ul> * <li>Works with all Drill data types. The other set works only with repeated * and nullable types.</li> * <li>Is a generic interface. The other set is bound tightly to the * {@link ScanBatch} class.</li> * <li>Uses generic types such as <tt>getInt()</tt> for most numeric types. The * other set has accessors specific to each of the ~30 data types which Drill * supports.</li> * </ul> * The key difference is that this set is designed for developer ease-of-use, a * primary requirement for unit tests. The other set is designed to be used in * machine-generated or write-once code and so can be much more complex. * <p> * That is, the accessors here are optimized for test code: they trade * convenience for a slight decrease in speed (the performance hit comes from * the extra level of indirection which hides the complex, type-specific code * otherwise required.) * <p> * {@link ColumnReader} and {@link ColumnWriter} are the core abstractions: they * provide simplified access to the myriad of Drill column types via a * simplified, uniform API. {@link TupleReader} and {@link TupleWriter} provide * a simplified API to rows or maps (both of which are tuples in Drill.) * {@link AccessorUtilities} provides a number of data conversion tools. * <p> * Overview of the code structure: * <dl> * <dt>TupleWriter, TupleReader</dt> * <dd>In relational terms, a tuple is an ordered collection of values, where * the meaning of the order is provided by a schema (usually a name/type pair.) * It turns out that Drill rows and maps are both tuples. The tuple classes * provide the means to work with a tuple: get the schema, get a column by name * or by position. Note that Drill code normally references columns by name. * But, doing so is slower than access by position (index). To provide efficient * code, the tuple classes assume that the implementation imposes a column * ordering which can be exposed via the indexes.</dd> * <dt>ColumnAccessor</dt> * <dd>A generic base class for column readers and writers that provides the * column data type.</dd> * <dt>ColumnWriter, ColumnReader</dt> * <dd>A uniform interface implemented for each column type ("major type" in * Drill terminology). The scalar types: Nullable (Drill optional) and * non-nullable (Drill required) fields use the same interface. Arrays (Drill * repeated) are special. To handle the array aspect, even array fields use the * same interface, but the <tt>getArray</tt> method returns another layer of * accessor (writer or reader) specific for arrays. * <p> * Both the column reader and writer use a reduced set of data types to access * values. Drill provides about 38 different types, but they can be mapped to a * smaller set for programmatic access. For example, the signed byte, short, * int; and the unsigned 8-bit, and 16-bit values can all be mapped to ints for * get/set. The result is a much simpler set of get/set methods compared to the * underlying set of vector types.</dt> * <dt>ArrayWriter, ArrayReader * <dt> * <dd>The interface for the array accessors as described above. Of particular * note is the difference in the form of the methods. The writer has only a * <tt>setInt()</tt> method, no index. The methods assume write-only, write-once * semantics: each set adds a new value. The reader, by contrast has a * <tt>getInt(int index)</tt> method: read access is random.</tt> * <dt>ScalarWriter<dt> * <dd>Because of the form of the array writer, both the array writer and * column writer have the same method signatures. To avoid repeating these * methods, they are factored out into the common <tt>ScalarWriter</tt> * interface.</dd> * <dt>ColumnAccessors (templates)</dt> * <dd>The Freemarker-based template used to generate the actual accessor * implementations.</dd> * <dt>ColumnAccessors (accessors)</dt> * <dd>The generated accessors: one for each combination of write/read, data * (minor) type and cardinality (data model). * <dd> * <dt>RowIndex</dt> * <dd>This nested class binds the accessor to the current row position for the * entire record batch. That is, you don't ask for the value of column a for row * 5, then the value of column b for row 5, etc. as with the "raw" vectors. * Instead, the implementation sets the row position (with, say an interator.) * Then, all columns implicitly return values for the current row. * <p> * Different implementations of the row index handle the case of no selection * vector, a selection vector 2, or a selection vector 4.</dd> * <dt>VectorAccessor</dt> * <dd>The readers can work with single batches or "hyper" * batches. A hyper batch occurs in operators such as sort where an operator * references a collection of batches as if they were one huge batch. In this * case, each column consists of a "stack" of vectors. The vector accessor picks * out one vector from the stack for each row. Vector accessors are used only * for hyper batches; single batches work directly with the corresponding * vector. * <p> * You can think of the (row index + vector accessor, column index) as forming a * coordinate pair. The row index provides the y index (vertical position along * the rows.) The vector accessor maps the row position to a vector when needed. * The column index picks out the x coordinate (horizontal position along the * columns.)</dt> * </dl> */ package org.apache.drill.exec.vector.accessor;