/**
* Patterns are blocks of text between two entities (or other significant annotator).
*
* They are used in the context of relation extraction to help learn the interaction words used to
* determine if a relationship may be present in the sentence.
*
* The process to converting patterns to interaction words as follow:
*
* <ol>
* <li>Extract patterns into a database
* <li>Learn interaction words based on the saved patterns
* <li>Generalise interaction words to cover similar concepts
* <li>Save the interaction words together with other information into a database for use in
* pipelines.
* </ol>
*
* As an alternative to the three two steps the user / corpus owner might already have an idea how
* of the interaction terms that they wish to extract. For example in a genelogy application the
* user might only want to extract interaction words which establish a family connection, for
* example "brother", "sister", "uncle of". As such the user may wish to manually craft the
* interaction gazetteer and upload it via the last step.
*
* The interaction words can then bed used in a pipeline, and as the basis to extract relationships.
*
* <h3>Pattern extraction</h3>
*
* <ul>
* <li>Determine where entities are in the document, using a standard entity extraction pipeline.
* <li>Extract all the patterns between the entities
* <li>Save the patterns to the database.
* </ul>
*
* You must ensure the output of the entity extraction is as good as possible. Any entities not
* found are candidates for pattern / interaction words!
*
* The PatternExtractor annotator ultimately is has a simple operation. It extracts the words
* between the two entities per sentence. These are the patterns. The patterns are filtered such
* that:
* <ul>
* <li>Patterns are not negative (eg that is do not contain "no, not, neither"). Thus
* "John does not live in London" is not a pattern.
* <li>Any other entities are removed. Thus Jane is removed when considering the pattern betten John
* and London in the sentence "John and Jane live in London". The pattern become "and live in".
* <li>The words in the pattern are not stop words. Thus from the "and lives in" we have a pattern
* based of "lives". If we had the sentence "John frequently visits parts of London" then our
* pattern, without stop words, would be "frequently visits parts" (though the stop words list may
* be configured differently).
*
* The patterns are saves as the Pattern type.
*
* Note that language features (Sentence spliting, tokenisation, POS) need to have been performed
* before using the Pattern extractor. For example use the OpenNlp annotator - this is likely part
* of the pipeline anyhow.
*
* A MongoPatternSaver consumer is available in Baleen which will output the patterns into a Mongo
* database. As usual the Mongo database is provided as a shared resource and the output collection
* can be tailored.
*
* Thus the pipeline is:
*
* <pre>
*
* annotators:
* - # Standard Baleen entity pipeline
* - patterns.PatternExtractor
*
* consumers:
* - MongoPatternSaver
* </pre>
*
*
* <h3>Identifying interaction words</h3>
*
* The process by which interaction words are identified is run as a Baleen job. It does not form
* part of a pipeline because it requires the output of multiple documents to have been passed
* through the Pattern extraction pipeline as discussed above.
*
* This is based on the algorithm within [UBRME]. Other implementation are possible, including for
* example merely saving all verbs / nouns which are need more than a specific threshold in the
* patterns.
*
* We do not go into details here of the algorithm, but it effectively looks to find clusters of the
* pattern extracted. it then looks within those clusters for common words, which is saves a
* interaction words.
*
* <pre>
* mongo:
* db: baleen
* host: localhost
*
* job:
* tasks:
* - class: interactions.IdentifyInteractions
* filename: output/interactions.csv
*
* </pre>
*
* The interactions.csv will be written. It should be reviewed by a subject matter expert and rows
* removed where relations are invalid or added (with different type constraints).
*
* <h3>Enhancing interaction</h3>
*
* The interaction words output from the above are those seen in the document. Optionally, we can
* enhance and generalise these with another job.
*
* The job will read the interaction.csv and add alternative words. For example the verb "report"
* might have alternatives "communicate, broadcast".
*
* <pre>
*
* job:
* tasks:
* - class: interactions.EnhanceInteractions
* input: output/interactions.csv
* output: output/interactions-enhanced.csv
* </pre>
*
* Again this is an opportunity to review the CSV file in order to ensure that the alternatives are
* correct.
*
* <h3>Upload interactions</h3>
*
* Finally we output the interaction (manually created, basic, or enhanced) to the Mongo. This reads
* the CSV files and saves the data to Mongo. It is saves into two collections, the first
* interactions (which is Baleen Mongo gazetteer format) and the second relationTypes (which
* includes information about interaction work types which are used by relation constraints).
*
*
* <pre>
* mongo:
* db: baleen
* host: localhost
*
* job:
* tasks:
* - class: interactions.UploadInteractionsToMongo
* input: output/interactions-enhanced.csv
*
* </pre>
*/
//Dstl (c) Crown Copyright 2017
package uk.gov.dstl.baleen.annotators.patterns;