/* * Licensed to the Apache Software Foundation (ASF) under one * or more contributor license agreements. See the NOTICE file * distributed with this work for additional information * regarding copyright ownership. The ASF licenses this file * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ /** * Defines a mock data source which generates dummy test data for use * in testing. The data source operates in two modes: * <ul> * <li><b>Classic:</b> used in physical plans in many unit tests. * The plan specifies a set of columns; data is generated by the * vectors themselves based on two alternating values.</li> * <li><b>Enhanced:</b> available for use in newer unit tests. * Enhances the physical plan description to allow specifying a data * generator class (for various types, data formats, etc.) Also * provides a data storage engine framework to allow using mock * tables in SQL queries.</li> * </ul> * <h3>Classic Mode</h3> * Create a scan operator that looks like the following (from * <tt>/src/test/resources/functions/cast/two_way_implicit_cast.json</tt>, * used in {@link TestReverseImplicitCast}): * <pre><code> * graph:[ * { * @id:1, * pop:"mock-scan", * url: "http://apache.org", * entries:[ * {records: 1, types: [ * {name: "col1", type: "FLOAT4", mode: "REQUIRED"}, * {name: "col2", type: "FLOAT8", mode: "REQUIRED"} * ]} * ] * }, * }, ... * </code></pre> * Here: * <ul> * <li>The <tt>pop</tt> must be <tt>mock-scan</tt>.</li> * <li>The <tt>url</tt> is unused.</li> * <li>The <tt>entries</tt> section can have one or more entries. If * more than one entry, the storage engine will enable parallel scans * up to the number of entries, as though each entry was a different * file or group.</li> * <li>The entry <tt>name</tt> is arbitrary, though color names seem * to be the traditional names used in Drill tests.</li> * <li>The <tt>type</tt> is one of the supported Drill * {@link MinorType} names.</li> * <li>The <tt>mode</tt> is one of the supported Drill * {@link DataMode} names: usually <tt>OPTIONAL</tt> or <tt>REQUIRED</tt>.</li> * </ul> * <p> * Recent extensions include: * <ul> * <li><tt>repeat</tt> in either the "entry" or "record" elements allow * repeating entries (simulating multiple blocks or row groups) and * repeating fields (easily create a dozen fields of some type.)</li> * <li><tt>generator</tt> in a field definition lets you specify a * specific data generator (see below.)</tt> * <li><tt>properties</tt> in a field definition lets you pass * generator-specific values to the data generator (such as, say * a minimum and maximum value.)</li> * </ul> * * <h3>Enhanced Mode</h3> * Enhanced builds on the Classic mode to add additional capabilities. * Enhanced mode can be used either in a physical plan or in SQL. Data * is randomly generated over a wide range of values and can be * controlled by custom generator classes. When * in a physical plan, the <tt>records</tt> section has additional * attributes as described in {@link MockTableDef.MockColumn}: * <ul> * <li>The <tt>generator</tt> lets you specify a class to generate the * sample data. Rules for the class name is that it can either contain * a full package path, or just a class name. If just a class name, the * class is assumed to reside in this package. For example, to generate * an ISO date into a string, use <tt>DateGen</tt>. Additional generators * can (and should) be added as the need arises.</li> * <li>The <tt>repeat</tt> attribute lets you create a very wide row by * repeating a column the specified number of times. Actual column names * have a numeric suffix. For example, if the base name is "blue" and * is repeated twice, actual columns are "blue1" and "blue2".</li> * </ul> * When used in SQL, use the <tt>mock</tt> name space as follows: * <pre><code> * SELECT id_i, name_s50 FROM `mock`.`employee_500`; * </code></pre> * Both the column names and table names encode information that specifies * what data to generate. * <p> * Columns are of the form <tt><i>name</i>_<i>type</i><i>length</i>?</tt>. * <ul> * <li>The name is anything you want ("id" and "name" in the example.)</li> * <li>The underscore is required to separate the type from the name.</li> * <li>The type is one of "i" (integer), "d" (double) or "s" (string). * Other types can be added as needed: n (decimal number), l (long), etc.</li> * <li>The length is optional and is used only for string (<tt>VARCHAR</tt>) * columns. The default string length is 10.</li> * <li>Columns do not yet support nulls. When they do, the encoding will * be "_n<i>percent</i>" where the percent specifies the percent of rows * that should contain null values in this column.<l/i> * <li>The column is known to SQL as its full name, that is "id_i" or * "name_s50".</li> * </ul> * <p> * Tables are of the form <tt><i>name</i>_<i>rows</i><i>unit<i>?</tt> where: * <ul> * <li>The name is anything you want. ("employee" in the example.)</li> * <li>The underscore is required to separate the row count from the name.</li> * <li>The row count specifies the number of rows to return.</li> * <li>The count unit can be none, K (multiply count by 1000) or M * (multiply row count by one million), case insensitive.</li> * <li>Another field (not yet implemented) might specify the split count.</li> * </ul> * <h3>Enhanced Mode with Definition File</h3> * You can reference a mock data definition file directly from SQL as follows: * <pre<code>SELECT * FROM `mock`.`your_defn_file.json`</code></pre> * <h3>Data Generators</h3> * The classic mode uses data generators built into each vector to generate * the sample data. These generators use a very simple black/white alternating * series of two values. Simple, but limited. The enhanced mode allows custom * data generators. Unfortunately, this requires a separate generator class for * each data type. As a result, we presently support just a few key data types. * On the other hand, the custom generators do allow tests to specify a custom * generator class to generate the kind of data needed for that test. * <p> * All data generators implement the {@link FieldGen} interface, and must have * a non-argument constructor to allow dynamic instantiation. The mock data * source either picks a default generator (if no <tt>generator</tt> is provided) * or uses the custom generator specified in <tt>generator<tt>. Generators * are independent (though one could, perhaps, write generators that correlate * field values.) */ package org.apache.drill.exec.store.mock;