/*
* Copyright © 2015 Cask Data, Inc.
*
* Licensed under the Apache License, Version 2.0 (the "License"); you may not
* use this file except in compliance with the License. You may obtain a copy of
* the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
* WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
* License for the specific language governing permissions and limitations under
* the License.
*/
/**
* <p>
* This package contains the DataCleansing Application that filters records that do not match a given schema.
*
* This DataCleansing Application consists of these programs and datasets:
* </p>
*
* <ol>
* <li>
* {@link co.cask.cdap.examples.datacleansing.DataCleansingService} that allows writing to the
* rawRecords PartitionedFileSets.
* </li>
*
* <li>
* A MapReduce named {@link co.cask.cdap.examples.datacleansing.DataCleansingMapReduce} that reads the
* files from a PartitionedFileSet, applies a filter to remove "unclean" records, based upon a particular
* schema, and outputs the records to an output PartitionedFileSet. Each time the job runs, it processes
* only the files of the newly created partitions.
* </li>
*
* <li>
* Three Datasets used by the MapReduce and Service:
* <ul>
* <li>A PartitionedFileSet named rawRecords which serves as the input data for DataCleansingMapReduce.</li>
* <li>A PartitionedFileSet named cleanRecords which serves as output for DataCleansingMapReduce.</li>
* <li>A KeyValueTable named consumingState which keeps track of the state of the DataCleansingMapReduce
* so that each time it is run, it only processes files of newly created Partitions.</li>
* </ul>
* </li>
* </ol>
*/
package co.cask.cdap.examples.datacleansing;