/** * <p> * This package implements various subpackages for information extraction. * Some examples of use appear later in this description. * At the moment, three types of information extraction are supported * (where some of these have internal variants): * </p> * <ol> * <li>Regular expression based matching: These extractors are hand-written * and match whatever the regular expression matches.</li> * <li>Conditional Random Fields classifier: A sequence tagger based on * CRF model that can be used for NER tagging and other sequence labeling tasks.</li> * <li>Conditional Markov Model classifier: A classifier based on * CMM model that can be used for NER tagging and other labeling tasks.</li> * <li>Hidden Markov model based extractors: These can be either single * field extractors or two level HMMs where the individual * component models and how they are glued together is trained * separately. These models are trained automatically, but require tagged * training data.</li> * <li>Description extractor: This does higher level NLP analysis of * sentences (using a POS tagger and chunker) to find sentences * that describe an object. This might be a biography of a person, * or a description of an animal. This module is fixed: there is * nothing to write or train (unless one wants to start to change * its internal behavior). * </ol> * <p> * There are some demonstrations of the stuff here which you can run (and several * other classes have <code>main()</code> methods which exhibit their * functionality): * </p> * <ol> * <li><code>NERGUI</code> is a simple GUI front-end to the NER tagging * components.</li> * <li><code>crf/NERGUI</code> is a simple GUI front-end to the CRF-based NER tagging * components. This version only supports the CRF-based NER tagger.</li> * <li><code>demo/NERDemo</code> is a simple class examplifying the programmatical use * of the CRF-based NER tagger.</li> * </ol> * <h3>Usage examples</h3> * <p> * 0. <i>Setup:</i> For all of these examples except 3., you need to be * connected to the Internet, and for the application's web search module * to be * able to connect to search engines. The web search * functionality is provided by the supplied <code>edu.stanford.nlp.web</code> * package. How web search works is controlled * by a <code>websearch.init</code> file in your current directory (or if * none is present, you will get search results from AltaVista). If * you are registered to use the GoogleAPI, you should probably edit * this file so web queries can be done to Google using their SOAP * interface. Even if not, you can specify additional or different * search engines to access in <code>websearch.init</code>. * A copy of this file is supplied in the distribution. The * <code>DescExtractor</code> in 4. also requires another init file so that * it can use the include part-of-speech tagger. * <p> * 1. Corporate Contact Information. This illustrates simple information * extraction from a web page. * Using the included * <code>ExtractDemo.bat</code> or by hand run: * <code>java edu.stanford.nlp.ie.ExtractDemo</code> * </p> * <ul> * <li>Select as Extractor Directory the folder: * <code>serialized-extractors/companycontact</code></li> * <li>Select as an Ontology the one in * <code>serialized-extractors/companycontact/Corporation-Information.kaon</code> * </li> * <li>Enter <code>Corporation</code> as the Concept to extract.</li> * <li>You can then do various searches: * <ul> * <li>You can enter a URL, click <code>Extract</code>, and look at the results: * <ul> * <li><code>http://www.ziatech.com/</code></li> * <li><code>http://www.cs.stanford.edu/</code></li> * <li><code>http://www.ananova.com/business/story/sm_635565.html</code></li> * </ul> * The components will work reasonably well on clean-ish text pages like * this. They work even better on text such as newswire or press * releases, as one can demonstrate either over the web or using the * command line extractor</li> * <li>You can do a search for a term and get extraction from the top * search hits, by entering a term in the "Search for words" box and * pressing "Extract": * <ul> * <li><code>Audiovox Corporation</code> * </ul> * Extraction is done over a number of pages from a search engine, and the * results from each are shown. Typically some of these pages * will have suitable content to extract, and some just won't. * </ul> * </ul> * <p>2. Corporate Contact Information merged. This illustrates the addition * of information merger across web pages. Using the included * <code>MergeExtractDemo.bat</code> or similarly do:</p> * <center><code>java edu.stanford.nlp.ie.ExtractDemo -m</code></center> * <p> * The <code>ExtractDemo</code> screen is similar, but adds a button to * Select a Merger. * </p> * <ul> * <li>Select an Extractor Directory and Ontology as * above.</li> * <li>Click on "Select Merger" and then navigate to * <code>serialized-extractors/mergers</code> and Select the file * <code>unscoredmerger.obj</code>.</li> * <li>Enter the concept "Corporation" as before. * <li>One can now do search as above, by URL or search, but Merger is only * appropriate to a word search with multiple results. Try Search * for words: * <ul> * <li><code>Audiovox Corporation</code></li> * </ul> * and press "Extract". Results gradually appear. After all results have * been processed (this may take a few seconds), a Merged best * extracted information result will be produced and displayed as * the first of the results. "Merged Instance" will appear on the * bottom line corresponding to it, rather than a URL. * </ul> * <p>3. Company names via direct use of an HMM information extractor. * One can also train, load, and use HMM information extractors directly, * without using any of the RDF-based KAON framework * (<code>http://kaon.semanticweb.org/</code>) used by ExtractDemo. * </p> * <ul> * <li>The file <code>edu.stanford.nlp.ie.hmm.Tester</code> illustrates the use * of a pretrained HMM on data via the command line interface: * <ul> * <li><code>cd serialized-extractors/companycontact/</code></li> * <li><code>java edu.stanford.nlp.ie.hmm.Tester cisco.txt company * company-name.hmm</code></li> * <li><code>java edu.stanford.nlp.ie.hmm.Tester EarningsReports.txt * company company-name.hmm</code></li> * <li><code>java edu.stanford.nlp.ie.hmm.Tester companytest.txt * company company-name.hmm</code></li> * </ul> * <p> * The first shows the HMM running on an unmarked up file with a single * document. The second shows a <code>Corpus</code> of several * documents, separated with ENDOFDOC, used as a document delimiter * inside a Corpus. This second use of <code>Tester</code> expects to * normally have an annotated corpus on which it can score its answers. * Here, the corpus is unannotated, and so some of the output is * inappropriate, but it shows what is selected as the company name * for each document (it's <i>mostly</i> correct...). * The final example shows it running on a corpus that does have answers * marked in it. It does the testing with the XML elements stripped, but * then uses them to evaluate correctness. * </p> * </li> * <li>To train one's own HMM, one needs data where one or * more fields is annotated in the data in the style of an XML * element, with all the documents in one file, separated by * lines with <code>ENDOFDOC</code> on them. Then one can * train (and then test) as follows. Training an HMM * (optimizing all its probabilities) takes a <i>long</i> time * (it depends on the speed of the computer, but 10 minutes or * so to adjust probabilities for a fixed structure, and often * hours if one additionally attempts structure learning). * <ol> * <li><code>cd edu/stanford/nlp/ie/training/</code></li> * <li><code>java -server edu.stanford.nlp.ie.hmm.Trainer companydata.txt * company mycompany.hmm</code></li> * <li><code>java edu.stanford.nlp.ie.hmm.HMMSingleFieldExtractor Company * mycompany.hmm mycompany.obj</code></li> * <li><code>java edu.stanford.nlp.ie.hmm.Tester testdoc.txt company * mycompany.hmm</code></li> * </ol> * The third step converts a serialized HMM into the serialized objects used * in <code>ExtractDemo</code>. Note that <code>company</code> * in the second line must match the element name in the * marked-up data that you will train on, while * <code>Company</code> in the third line must match the * relation name in the ontology over which you will extract with * <code>mycompany.obj</code>. These two names need not be the * same. The last step then runs the trained HMM on a file. * </li> * </ul> * <p>4. Extraction of descriptions (such as biographical information about * a person or a description of an animal). * This does extraction of such descriptions * from a web page. This component uses a POS tagger, and looks for where * to find a path to it in the file * <code>descextractor.init</code> in the current directory. So, * you should be in the root directory of the current archive, * which has such a file. Double click on the included * <code>MergeExtractDemo.bat</code> in that directory, or by hand * one can equivalently do: * <code>java edu.stanford.nlp.ie.ExtractDemo -m</code> * </p> * <ul> * <li>Select as Extractor Directory the folder: * <code>serialized-extractors/description</code></li> * <li>Select as an Ontology the one in * <code>serialized-extractors/description/Entity-NameDescription.kaon</code> * </li> * <li>Click on "Select Merger" and then navigate to * <code>serialized-extractors/mergers</code> and Select the file * <code>unscoredmerger.obj</code>.</li> * <li>Enter <code>Entity</code> as the Concept to extract. * <li>You can then do various searches for people or animals by entering * words in the "Search for words" box and pressing Extract: * <ul> * <li><code>Gareth Evans</code></li> * <li><code>Tawny Frogmouth</code></li> * <li><code>Christopher Manning</code></li> * <li><code>Joshua Nkomo</code></li> * </ul> * The first search will be slower than subsequent searches, as it takes a * while to load the part of speech tagger. * </li> * </ul> */ package edu.stanford.nlp.ie;