package-info.java example

Explorer
CoreNLP-master
/**
 * <p>
 * This package implements various subpackages for information extraction.
 * Some examples of use appear later in this description.
 * At the moment, three types of information extraction are supported
 * (where some of these have internal variants):
 * </p>
 * <ol>
 * <li>Regular expression based matching: These extractors are hand-written
 * and match whatever the regular expression matches.</li>
 * <li>Conditional Random Fields classifier: A sequence tagger based on
 * CRF model that can be used for NER tagging and other sequence labeling tasks.</li>
 * <li>Conditional Markov Model classifier: A classifier based on
 * CMM model that can be used for NER tagging and other labeling tasks.</li>
 * <li>Hidden Markov model based extractors:  These can be either single
 * 	field extractors or two level HMMs where the individual
 * 	component models and how they are glued together is trained
 * 	separately.  These models are trained automatically, but require tagged
 * 	training data.</li>
 * <li>Description extractor: This does higher level NLP analysis of
 * 	sentences (using a POS tagger and chunker) to find sentences
 * 	that describe an object.  This might be a biography of a person,
 * 	or a description of an animal.  This module is fixed: there is
 * 	nothing to write or train (unless one wants to start to change
 * 	its internal behavior).
 * </ol>
 * <p>
 * There are some demonstrations of the stuff here which you can run (and several
 * other classes have <code>main()</code> methods which exhibit their
 * functionality):
 * </p>
 * <ol>
 * <li><code>NERGUI</code> is a simple GUI front-end to the NER tagging
 * 	components.</li>
 * <li><code>crf/NERGUI</code> is a simple GUI front-end to the CRF-based NER tagging
 * 	components.  This version only supports the CRF-based NER tagger.</li>
 * <li><code>demo/NERDemo</code> is a simple class examplifying the programmatical use
 * of the CRF-based NER tagger.</li>
 * </ol>
 * <h3>Usage examples</h3>
 * <p>
 * 0. <i>Setup:</i> For all of these examples except 3., you need to be
 * connected to the Internet, and for the application's web search module
 * to be
 * able to connect to search engines.  The web search
 * functionality is provided by the supplied <code>edu.stanford.nlp.web</code>
 * package.  How web search works is controlled
 * by a <code>websearch.init</code> file in your current directory (or if
 * none is present, you will get search results from AltaVista).  If
 * you are registered to use the GoogleAPI, you should probably edit
 * this file so web queries can be done to Google using their SOAP
 * interface.  Even if not, you can specify additional or different
 * search engines to access in <code>websearch.init</code>.
 * A copy of this file is supplied in the distribution.  The
 * <code>DescExtractor</code> in 4. also requires another init file so that
 * it can use the include part-of-speech tagger.
 * <p>
 * 1. Corporate Contact Information.  This illustrates simple information
 * extraction from a web page.
 * Using the included
 * <code>ExtractDemo.bat</code> or by hand run:
 * <code>java edu.stanford.nlp.ie.ExtractDemo</code>
 * </p>
 * <ul>
 * <li>Select as Extractor Directory the folder:
 * <code>serialized-extractors/companycontact</code></li>
 * <li>Select as an Ontology the one in
 * <code>serialized-extractors/companycontact/Corporation-Information.kaon</code>
 * </li>
 * <li>Enter <code>Corporation</code> as the Concept to extract.</li>
 * <li>You can then do various searches:
 * <ul>
 * <li>You can enter a URL, click <code>Extract</code>, and look at the results:
 * <ul>
 * <li><code>http://www.ziatech.com/</code></li>
 * <li><code>http://www.cs.stanford.edu/</code></li>
 * <li><code>http://www.ananova.com/business/story/sm_635565.html</code></li>
 * </ul>
 * The components will work reasonably well on clean-ish text pages like
 * this.  They work even better on text such as newswire or press
 * releases, as one can demonstrate either over the web or using the
 * command line extractor</li>
 * <li>You can do a search for a term and get extraction from the top
 * search hits, by entering a term in the "Search for words" box and
 * 	    pressing "Extract":
 * <ul>
 * <li><code>Audiovox Corporation</code>
 * </ul>
 * Extraction is done over a number of pages from a search engine, and the
 * 	    results from each are shown.  Typically some of these pages
 * 	    will have suitable content to extract, and some just won't.
 * </ul>
 * </ul>
 * <p>2. Corporate Contact Information merged.  This illustrates the addition
 * of information merger across web pages.  Using the included
 * <code>MergeExtractDemo.bat</code> or similarly do:</p>
 * <center><code>java edu.stanford.nlp.ie.ExtractDemo -m</code></center>
 * <p>
 * The <code>ExtractDemo</code> screen is similar, but adds a button to
 * Select a Merger.
 * </p>
 * <ul>
 * <li>Select an Extractor Directory and Ontology as
 * above.</li>
 * <li>Click on "Select Merger" and then navigate to
 * <code>serialized-extractors/mergers</code> and Select the file
 * <code>unscoredmerger.obj</code>.</li>
 * <li>Enter the concept "Corporation" as before.
 * <li>One can now do search as above, by URL or search, but Merger is only
 * 	appropriate to a word search with multiple results.   Try Search
 * 	for words:
 * <ul>
 * <li><code>Audiovox Corporation</code></li>
 * </ul>
 * and press "Extract".  Results gradually appear.  After all results have
 * 	been processed (this may take a few seconds), a Merged best
 * 	extracted information result will be produced and displayed as
 * 	the first of the results.  "Merged Instance" will appear on the
 * 	bottom line corresponding to it, rather than a URL.
 * </ul>
 * <p>3. Company names via direct use of an HMM information extractor.
 * One can also train, load, and use HMM information extractors directly,
 * 	  without using any of the RDF-based KAON framework
 * (<code>http://kaon.semanticweb.org/</code>) used by ExtractDemo.
 * </p>
 * <ul>
 * <li>The file <code>edu.stanford.nlp.ie.hmm.Tester</code> illustrates the use
 * 	  of a pretrained HMM on data via the command line interface:
 * <ul>
 * <li><code>cd serialized-extractors/companycontact/</code></li>
 * <li><code>java edu.stanford.nlp.ie.hmm.Tester cisco.txt company
 * 	      company-name.hmm</code></li>
 * <li><code>java edu.stanford.nlp.ie.hmm.Tester EarningsReports.txt
 * 	      company company-name.hmm</code></li>
 * <li><code>java edu.stanford.nlp.ie.hmm.Tester companytest.txt
 * 	      company company-name.hmm</code></li>
 * </ul>
 * <p>
 * The first shows the HMM running on an unmarked up file with a single
 * 	document.  The second shows a <code>Corpus</code> of several
 * 	documents, separated with ENDOFDOC, used as a document delimiter
 * 	inside a Corpus.  This second use of <code>Tester</code> expects to
 * normally have an annotated corpus on which it can score its answers.
 * Here, the corpus is unannotated, and so some of the output is
 * 	inappropriate, but it shows what is selected as the company name
 * 	for each document (it's <i>mostly</i> correct...).
 * The final example shows it running on a corpus that does have answers
 * marked in it.  It does the testing with the XML elements stripped, but
 * 	then uses them to evaluate correctness.
 * </p>
 * </li>
 * <li>To train one's own HMM, one needs data where one or
 * 	    more fields is annotated in the data in the style of an XML
 * 	    element, with all the documents in one file, separated by
 * 	    lines with <code>ENDOFDOC</code> on them.  Then one can
 * 	    train (and then test) as follows.   Training an HMM
 * 	    (optimizing all its probabilities) takes a <i>long</i> time
 * 	    (it depends on the speed of the computer, but 10 minutes or
 * 	so to adjust probabilities for a fixed structure, and often
 * 	hours if one additionally attempts structure learning).
 * <ol>
 * <li><code>cd edu/stanford/nlp/ie/training/</code></li>
 * <li><code>java -server edu.stanford.nlp.ie.hmm.Trainer companydata.txt
 * 		  company mycompany.hmm</code></li>
 * <li><code>java edu.stanford.nlp.ie.hmm.HMMSingleFieldExtractor Company
 * 		  mycompany.hmm mycompany.obj</code></li>
 * <li><code>java edu.stanford.nlp.ie.hmm.Tester testdoc.txt company
 * 		  mycompany.hmm</code></li>
 * </ol>
 * The third step converts a serialized HMM into the serialized objects used
 * 	    in <code>ExtractDemo</code>.  Note that <code>company</code>
 * 	    in the second line must match the element name in the
 * 	    marked-up data that you will train on, while
 * 	    <code>Company</code> in the third line must match the
 * 	    relation name in the ontology over which you will extract with
 * 	    <code>mycompany.obj</code>.  These two names need not be the
 * 	    same.  The last step then runs the trained HMM on a file.
 * </li>
 * </ul>
 * <p>4. Extraction of descriptions (such as biographical information about
 * 	  a person or a description of an animal).
 * This does extraction of such descriptions
 * from a web page.  This component uses a POS tagger, and looks for where
 * 	  to find a path to it in the file
 * 	  <code>descextractor.init</code> in the current directory.  So,
 * 	  you should be in the root directory of the current archive,
 * 	  which has such a file.  Double click on the included
 * <code>MergeExtractDemo.bat</code> in that directory, or by hand
 * 	  one can equivalently do:
 * <code>java edu.stanford.nlp.ie.ExtractDemo -m</code>
 * </p>
 * <ul>
 * <li>Select as Extractor Directory the folder:
 * <code>serialized-extractors/description</code></li>
 * <li>Select as an Ontology the one in
 * <code>serialized-extractors/description/Entity-NameDescription.kaon</code>
 * </li>
 * <li>Click on "Select Merger" and then navigate to
 * <code>serialized-extractors/mergers</code> and Select the file
 * <code>unscoredmerger.obj</code>.</li>
 * <li>Enter <code>Entity</code> as the Concept to extract.
 * <li>You can then do various searches for people or animals by entering
 * 	    words in the "Search for words" box and pressing Extract:
 * <ul>
 * <li><code>Gareth Evans</code></li>
 * <li><code>Tawny Frogmouth</code></li>
 * <li><code>Christopher Manning</code></li>
 * <li><code>Joshua Nkomo</code></li>
 * </ul>
 * The first search will be slower than subsequent searches, as it takes a
 * 	    while to load the part of speech tagger.
 * </li>
 * </ul>
 */
package edu.stanford.nlp.ie;