package-info.java example

Explorer
baleen-master
/**
 * Patterns are blocks of text between two entities (or other significant annotator).
 *
 * They are used in the context of relation extraction to help learn the interaction words used to
 * determine if a relationship may be present in the sentence.
 *
 * The process to converting patterns to interaction words as follow:
 *
 * <ol>
 * <li>Extract patterns into a database
 * <li>Learn interaction words based on the saved patterns
 * <li>Generalise interaction words to cover similar concepts
 * <li>Save the interaction words together with other information into a database for use in
 * pipelines.
 * </ol>
 *
 * As an alternative to the three two steps the user / corpus owner might already have an idea how
 * of the interaction terms that they wish to extract. For example in a genelogy application the
 * user might only want to extract interaction words which establish a family connection, for
 * example "brother", "sister", "uncle of". As such the user may wish to manually craft the
 * interaction gazetteer and upload it via the last step.
 *
 * The interaction words can then bed used in a pipeline, and as the basis to extract relationships.
 *
 * <h3>Pattern extraction</h3>
 *
 * <ul>
 * <li>Determine where entities are in the document, using a standard entity extraction pipeline.
 * <li>Extract all the patterns between the entities
 * <li>Save the patterns to the database.
 * </ul>
 *
 * You must ensure the output of the entity extraction is as good as possible. Any entities not
 * found are candidates for pattern / interaction words!
 *
 * The PatternExtractor annotator ultimately is has a simple operation. It extracts the words
 * between the two entities per sentence. These are the patterns. The patterns are filtered such
 * that:
 * <ul>
 * <li>Patterns are not negative (eg that is do not contain "no, not, neither"). Thus
 * "John does not live in London" is not a pattern.
 * <li>Any other entities are removed. Thus Jane is removed when considering the pattern betten John
 * and London in the sentence "John and Jane live in London". The pattern become "and live in".
 * <li>The words in the pattern are not stop words. Thus from the "and lives in" we have a pattern
 * based of "lives". If we had the sentence "John frequently visits parts of London" then our
 * pattern, without stop words, would be "frequently visits parts" (though the stop words list may
 * be configured differently).
 *
 * The patterns are saves as the Pattern type.
 *
 * Note that language features (Sentence spliting, tokenisation, POS) need to have been performed
 * before using the Pattern extractor. For example use the OpenNlp annotator - this is likely part
 * of the pipeline anyhow.
 *
 * A MongoPatternSaver consumer is available in Baleen which will output the patterns into a Mongo
 * database. As usual the Mongo database is provided as a shared resource and the output collection
 * can be tailored.
 *
 * Thus the pipeline is:
 *
 * <pre>
 *
 * annotators:
 * - # Standard Baleen entity pipeline
 * - patterns.PatternExtractor
 *
 * consumers:
 * - MongoPatternSaver
 * </pre>
 *
 *
 * <h3>Identifying interaction words</h3>
 *
 * The process by which interaction words are identified is run as a Baleen job. It does not form
 * part of a pipeline because it requires the output of multiple documents to have been passed
 * through the Pattern extraction pipeline as discussed above.
 *
 * This is based on the algorithm within [UBRME]. Other implementation are possible, including for
 * example merely saving all verbs / nouns which are need more than a specific threshold in the
 * patterns.
 *
 * We do not go into details here of the algorithm, but it effectively looks to find clusters of the
 * pattern extracted. it then looks within those clusters for common words, which is saves a
 * interaction words.
 *
 * <pre>
 * mongo:
 *  db: baleen
 *  host: localhost
 *
 * job:
 *  tasks:
 *  - class: interactions.IdentifyInteractions
 *    filename: output/interactions.csv
 *
 * </pre>
 *
 * The interactions.csv will be written. It should be reviewed by a subject matter expert and rows
 * removed where relations are invalid or added (with different type constraints).
 *
 * <h3>Enhancing interaction</h3>
 *
 * The interaction words output from the above are those seen in the document. Optionally, we can
 * enhance and generalise these with another job.
 *
 * The job will read the interaction.csv and add alternative words. For example the verb "report"
 * might have alternatives "communicate, broadcast".
 *
 * <pre>
 *
 * job:
 * tasks:
 * - class: interactions.EnhanceInteractions
 *   input: output/interactions.csv
 *   output: output/interactions-enhanced.csv
 * </pre>
 *
 * Again this is an opportunity to review the CSV file in order to ensure that the alternatives are
 * correct.
 *
 * <h3>Upload interactions</h3>
 *
 * Finally we output the interaction (manually created, basic, or enhanced) to the Mongo. This reads
 * the CSV files and saves the data to Mongo. It is saves into two collections, the first
 * interactions (which is Baleen Mongo gazetteer format) and the second relationTypes (which
 * includes information about interaction work types which are used by relation constraints).
 *
 *
 * <pre>
 * mongo:
 *  db: baleen
 *  host: localhost
 *
 * job:
 *  tasks:
 *  - class: interactions.UploadInteractionsToMongo
 *    input: output/interactions-enhanced.csv
 *
 * </pre>
 */
//Dstl (c) Crown Copyright 2017
package uk.gov.dstl.baleen.annotators.patterns;