/** * <h1>Multi-pass Sieve Coreference Resolution System</h1> * <a href="#authors">[authors]</a> * <a href="#current">[current results]</a> * <a href="#changes">[changes]</a> * <a href="#usage">[usage]</a> * <p> * This system implements the multi-pass sieve coreference resolution system of Raghunathan et al. at EMNLP 2010. * <p> * Note that the current code in this package does not implement mention detection. All results reported here use gold mentions (just as in the paper). * However, the DeterministicCorefAnnotator in StanfordCoreNLP implements a simple mention detection component, so this code can be used to perform coreference resolution on raw text. * <p> * Note that this code is already different from the system reported in the paper. * After the EMNLP paper, two additional sieves were included. The current code gives slightly better scores than those in the paper. * <h2><a name="authors">Authors</a></h2> * <ul> * <li>Karthik Raghunathan * <li>Heeyoung Lee * <li>Sudarshan Rangarajan * <li>Jenny Finkel * <li>Nathanael Chambers * <li>Mihai Surdeanu * <li>Dan Jurafsky * <li>Christopher Manning * </ul> * <h2><a name="current">Current Results</a></h2> * <pre> * ---------------------------------------------------------------------------- * MUC B cubed Pairwise * P R F1 P R F1 P R F1 * ---------------------------------------------------------------------------- * ACE2004 dev | 84.5 75.7 79.8 | 88.0 75.8 81.4 | 78.6 53.8 63.9 * ACE2004 test | 80.4 72.9 76.4 | 85.1 76.4 80.5 | 68.7 48.9 57.1 * ACE2004 nwire | 83.8 74.3 78.8 | 86.9 73.7 79.7 | 78.1 51.7 62.2 * MUC6 test | 90.5 69.0 78.3 | 90.5 62.5 73.9 | 89.3 56.1 68.9 * ---------------------------------------------------------------------------- * </pre> * <h2><a name="changes">Changes</a></h2> * <h3>August 26, 2010</h3> * <p> * This release is generally similar to the code used for EMNLP 2010, * with one additional sieve: relaxed exact string match.<br> * The score may differ also due to the change in Parser or NER. * <p> * Results: * <pre> * ---------------------------------------------------------------------------- * MUC B cubed Pairwise * P R F1 P R F1 P R F1 * ---------------------------------------------------------------------------- * ACE2004 dev | 84.1 73.9 78.7 | 88.3 74.2 80.7 | 80.0 51.0 62.3 * ACE2004 test | 80.5 72.3 76.2 | 85.4 75.9 80.4 | 68.7 47.8 56.4 * ACE2004 nwire | 83.8 72.8 77.9 | 87.5 72.1 79.0 | 79.3 47.6 59.5 * MUC6 test | 90.3 68.9 78.2 | 90.5 62.3 73.8 | 89.4 55.5 68.5 * ---------------------------------------------------------------------------- * </pre> * <h2><a name="usage">Usage</a></h2> * <p> * <h3> Running coreference resolution on raw text </h3> * This software is now fully incorporated in StanfordCoreNLP, so all you have to do is add the dcoref annotator to the "annotators" property in StanfordCoreNLP. * For example: * <pre> * annotators = tokenize, ssplit, pos, lemma, ner, parse, dcoref * </pre> * The required properties for dcoref are the following: * <pre> * dcoref.demonym * dcoref.animate * dcoref.inanimate * dcoref.male * dcoref.neutral * dcoref.female * dcoref.plural * dcoref.singular * sievePasses // If omitted, default value will be used. * </pre> * <p> * See StanfordCoreNLP for more details. * </p> * <p> * <h3> How to replicate the results in our EMNLP2010 paper</h3> * To replicate the results in the paper run: * <pre> * java -Xmx8g edu.stanford.nlp.dcoref.SieveCoreferenceSystem -props <properties file> * </pre> * A sample properties file (coref.properties) is included in dcoref package. * The properties file includes the following: * <pre> * annotators = pos, lemma, ner // annotators needed for coreference resolution * pos.model // For POS model * ner.model.3class * ner.model.7class // For NER * ner.model.MISCclass * parser.model // For parser * parser.maxlen = 100 * dcoref.demonym // The path for a file that includes a list of demonyms * dcoref.animate // The list of animate/inanimate mentions (Ji and Lin, 2009) * dcoref.inanimate * dcoref.male // The list of male/neutral/female mentions (Bergsma and Lin, 2006) * dcoref.neutral // Neutral means a mention that is usually referred by 'it' * dcoref.female * dcoref.plural // The list of plural/singular mentions (Bergsma and Lin, 2006) * dcoref.singular * sievePasses // Sieve passes - each class is defined in dcoref/sievepasses/ * logFile // Path for log file for coref system evaluation * ace2004 or mucfile // Use either ace2004 or mucfile (not both) * // ace2004: path for the directory containing ACE2004 files * // mucfile: path for the MUC file * </pre> * This system can process both ACE2004 and MUC6 corpora in their original formats. * Examples of corpus are given below. * MUC6: * <pre> * ... * <s> By/IN proposing/VBG <COREF ID="13" TYPE="IDENT" REF="6" MIN="date"> a/DT meeting/NN date/NN</COREF> ,/, <COREF ID="14" TYPE="IDENT" REF="0"> * <ORGANIZATION> Eastern/NNP</ORGANIZATION></COREF> moved/VBD one/CD step/NN closer/JJR toward/IN reopening/VBG current/JJ high-cost/JJ contract/NN agreements/NNS with/IN <COREF ID="15" TYPE="IDENT" REF="8" MIN="unions"><COREF ID="16" TYPE="IDENT" REF="14"> its/PRP$</COREF> unions/NNS</COREF> ./. </s> * ... * </pre> * ACE2004: * <pre> * ... * <document DOCID="20001115_AFP_ARB.0212.eng"> * <entity ID="20001115_AFP_ARB.0212.eng-E1" TYPE="ORG" SUBTYPE="Educational" CLASS="SPC"> * <entity_mention ID="1-47" TYPE="NAM" LDCTYPE="NAM"> * <extent> * <charseq START="475" END="506">the Globalization Studies Center</charseq> * </extent> * <head> * <charseq START="479" END="506">Globalization Studies Center</charseq> * </head> * </entity_mention> * ... * </pre> */ package edu.stanford.nlp.dcoref;