/*
* Copyright © 2015 Cask Data, Inc.
*
* Licensed under the Apache License, Version 2.0 (the "License"); you may not
* use this file except in compliance with the License. You may obtain a copy of
* the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
* WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
* License for the specific language governing permissions and limitations under
* the License.
*/
/**
* This package contains the WikipediaPipeline Application that demonstrates a CDAP Workflow for processing and
* analyzing Wikipedia data.
* <p>
* The app contains a CDAP Workflow that runs in either online or offline mode.
* In the offline mode, it expects Wikipedia data to be available in a Stream.
* In the online mode, it attempts to download wikipedia data for a provided set of page titles
* (formatted as the output of the Facebook Likes API). Once wikipedia data is available it runs a map-only job to
* filter bad records and normalize data formatted as text/wiki-text into text/plain.
*
* It then runs two analyses on the plain text data in a fork:
* </p>
*
* <ol>
* <li>
* {@link co.cask.cdap.examples.wikipedia.ScalaSparkLDA} runs topic modeling on Wikipedia data using Latent
* Dirichlet Allocation (LDA).
* </li>
* <li>
* {@link co.cask.cdap.examples.wikipedia.TopNMapReduce} that produces the Top N terms in the supplied Wikipedia
* data.
* </li>
* <li>
* The output of the above analyses is stored in the following datasets:
* <ul>
* <li>A Table named lda which contains the output of the Spark LDA program.</li>
* <li>A KeyValueTable named topn which contains the output of the TopNMapReduce program.</li>
* </ul>
* </li>
* </ol>
*
* <p>
* One of the main purposes of this application is to demonstrate how the flow of a typical data pipeline can be
* controlled using Workflow Tokens.
* </p>
*/
package co.cask.cdap.examples.wikipedia;