AndData.java example

Explorer
morphadorner-opensource-master
- src
package net.sf.jlinkgrammar;

/**
 *   
 *                               Notes about AND
 *  <p>
 *    A large fraction of the code of this parser seems to deal with handling
 *    conjunctions.  This comment (combined with reading the paper) should
 *    give an idea of how it works.
 *  <p>
 *    First of all, we need a more detailed discussion of strings, what they
 *    match, etc.  (This entire discussion ignores the labels, which are
 *    semantically the same as the leading upper case letters of the
 *    connector.)
 *  <p>
 *    We'll deal with infinite strings from an alphabet of three types of
 *    characters: "*". "^" and ordinary characters (denoted "a" and "b").
 *    (The end of a string should be thought of as an infinite sequence of
 *    "*"s).
 *  <p>
 *    Let match(s) be the set of strings that will match the string s.  This
 *    is defined as follows. A string t is in match(s) if (1) its leading
 *    upper case letters exactly match those of s.  (2) traversing through
 *    both strings, from left to right in step, no missmatch is found
 *    between corresponding letters.  A missmatch is a pair of differing
 *    ordinary characters, or a "^" and any ordinary letter or two "^"s.
 *    In other words, a match is exactly a "*" and anything, or two
 *    identical ordinary letters.
 *  <p>
 *    Alternative definition of the set match(s):
 *    {t | t is obtained from s by replacing each "^" and any other characters
 *    by "*"s, and replacing any original "*" in s by any other character
 *    (or "^").}
 *  <p>
 *    Theorem: if t in match(s) then s in match(t).
 *  <p>
 *    It is also a theorem that given any two strings s and t, there exists a
 *    unique new string u with the property that:
 *  <p>
 *            match(u) = match(s) intersect match(t)
 *  <p>
 *    This string is called the GCD of s and t.  Here are some examples.
 *  <ul>
 *  <li>          GCD(N*a,Nb) = Nba
 *  <li>          GCD(Na, Nb) = N^
 *  <li>          GCD(Nab,Nb) = N^b
 *  <li>          GCD(N^,N*a) = N^a
 *  <li>          GCD(N^,  N) = N^
 *  <li>          GCD(N^^,N^) = N^^
 *  </ul>
 *  <p>  
 *     We need an algorithm for computing the GCD of two strings.  Here is
 *    one.
 *   <p> 
 *    First get by the upper case letters (which must be equal, otherwise
 *    there is no intersection), issuing them.  Traverse the rest of the
 *    characters of s and t in lockstep until there is nothing left but
 *    "*"s.  If the two characters are:
 *  <p>
 * <ul>  
 *  <li>            "a" and "a", issue "a"
 *  <li>            "a" and "b", issue "^"
 *  <li>            "a" and "*", issue "a"
 *  <li>            "*" and "*", issue "*"
 *  <li>            "*" and "^", issue "^"
 *  <li>            "a" and "^", issue "^"
 *  <li>            "^" and "^", issue "^"
 * </ul> 
 *  <p> 
 *    A simple case analysis suffices to show that any string that matches
 *    the right side, must match both of the left sides, and any string not
 *    matching the right side must not match at least one of the left sides.
 *  <p>  
 *    This proves that the GCD operator is associative and commutative.
 *    (There must be a name for a mathematical structure with these properties.)
 *  <p>  
 *    To elaborate further on this theory, define the notion of two strings
 *    matching in the dual sense as follows: s and t dual-match if
 *    match(s) is contained in match(t) or vice versa---
 *  <p>  
 *    Full development of this theory could lead to a more efficient
 *    algorithm for this problem.  I'll defer this until such time as it
 *    appears necessary.
 *   <p> 
 *   <p> 
 *    We need a data structure that stores a set of fat links.  Each fat
 *    link has a number (called its label).  The fat link operates in liu of
 *    a collection of links.  The particular stuff it is a substitute for is
 *    defined by a disjunct.  This disjunct is stored in the data structure.
 *   <p> 
 *    The type of a disjunct is defined by the sequence of connector types
 *    (defined by their upper case letters) that comprises it.  Each entry
 *    of the label_table[] points to a list of disjuncts that have the same
 *    type (a hash table is uses so that, given a disjunct, we can efficiently
 *    compute the element of the label table in which it belongs).
 *  <p>  
 *    We begin by loading up the label table with all of the possible
 *    fat links that occur through the words of the sentence.  These are
 *    obtained by taking every sub-range of the connectors of each disjunct
 *    (containing the center).  We also compute the closure (under the GCD
 *    operator) of these disjuncts and store also store these in the
 *    label_table.  Each disjunct in this table has a string which represents
 *    the subscripts of all of its connectors (and their multi-connector bits).
 *  <p>  
 *    It is possible to generate a fat connector for any one of the
 *    disjuncts in the label_table.  This connector's label field is given
 *    the label from the disjunct from which it arose.  It's string field
 *    is taken from the string of the disjunct (mentioned above).  It will be
 *    given a priority with a value of UP_priority or DOWN_priority (depending
 *    on how it will be used).  A connector of UP_priority can match one of
 *    DOWN_priority, but neither of these can match any other priority.
 *    (Of course, a fat connector can match only another fat connector with
 *    the same label.)
 *  <p>  
 *    The paper describes in some detail how disjuncts are given to words
 *    and to "and" and ",", etc.  Each word in the sentence gets many more
 *    new disjuncts.  For each contiguous set of connectors containing (or
 *    adjacent to) the center of the disjunct, we generate a fat link, and
 *    replace these connector in the word by a fat link.  (Actually we do
 *    this twice.  Once pointing to the right, once to the left.)  These fat
 *    links have priority UP_priority.
 *  <p>  
 *    What do we generate for ","?  For each type of fat link (each label)
 *    we make a disjunct that has two down connectors (to the right and left)
 *    and one up connector (to the right).  There will be a unique way of
 *    hooking together a comma-separated and-list.
 *  <p>  
 *    The disjuncts on "and" are more complicated.  Here we have to do just what
 *    we did for comma (but also include the up link to the left), then
 *    we also have to allow the process to terminate.  So, there is a disjunct
 *    with two down fat links, and between them are the original thin links.
 *    These are said to "blossom" out.  However, this is not all that is
 *    necessary.  It's possible for an and-list to be part of another and list
 *    with a different labeled fat connector.  To make this possible, we
 *    regroup the just blossomed disjuncts (in all possible ways about the center)
 *    and install them as fat links.  If this sounds like a lot of disjuncts --
 *    it is!  The program is currently fairly slow on long sentence with and.
 *  <p>  
 *    It is slightly non-obvious that the fat-links in a linkage constructed
 *    from disjuncts defined in this way form a binary tree.  Naturally,
 *    connectors with UP_priority point up the tree, and those with DOWN_priority
 *    point down the tree.
 *  <p>  
 *    Think of the string x on the connector as representing a set X of strings.
 *    X = match(x).  So, for example, if x="S^" then match(x) = {"S", "S*a",
 *    "S*b", etc}.  The matching rules for UP and DOWN priority connectors
 *    are such that as you go up (the tree of ands) the X sets get no larger.
 *    So, for example, a "Sb" pointing up can match an "S^" pointing down.
 *    (Because more stuff can match "Sb" than can match "S^".)
 *    This guarantees that whatever connector ultimately gets used after the
 *    fat connector blossoms out (see below), it is a powerful enough connector
 *    to be able to match to any of the connectors associated with it.
 *  <p>  
 *    One problem with the scheme just descibed is that it sometimes generates
 *    essentially the same linkage several times.  This happens if there is
 *    a gap in the connective power, and the mismatch can be moved around in
 *    different ways.  Here is an example of how this happens.
 *   <p> 
 *    (Left is DOWN, right is UP)
 *  <p>
 * <ul>   
      *  <li>   Sa <--. S^ <--. S            or             Sa <--. Sa <--. S 
      *  <li>   fat      thin                                 fat      thin
 *  </ul>  
 *    Here two of the disjunct types are given by "S^" and "Sa".  Notice that
 *    the criterion of shrinking the matching set is satisfied by the the fat
 *    link (traversing from left to right).  How do I eliminate one of these?
 *   <p> 
 *    I use the technique of canonization.  I generate all the linkages.  There
 *    is then a procedure that can check to see of a linkage is canonical.
 *    If it is, it's used, otherwise it's ignored.  It's claimed that exactly
 *    one canonical one of each equivalence class will be generated.
 *    We basically insist that the intermediate fat disjuncts (ones that
 *    have a fat link pointing down) are all minimal -- that is, that they
 *    cannot be replaced by by another (with a strictly) smaller match set.
 *    If one is not minimal, then the linkage is rejected.
 *   <p> 
 *    Here's a proof that this is correct.  Consider the set of equivalent
 *    linkages that are generated.  These Pick a disjunct that is the root of
 *    its tree.  Consider the set of all disjuncts which occur in that positon
 *    among the equivalent linkages.  The GCD of all of these can fit in that
 *    position (it matches down the tree, since its match set has gotten
 *    smaller, and it also matches to the THIN links.)  Since the GCD is put
 *    on "and" this particular one will be generated.  Therefore rejecting
 *    a linkage in which a root fat disjunct can be replaced by a smaller one
 *    is ok (since the smaller one will be generated separately).  What about
 *    a fat disjunct that is not the root.  We consider the set of linkages in 
 *    which the root is minimal (the ones for which it's not have already been
 *    eliminated).  Now, consider one of the children of the root in precisely
 *    the way we just considered the root.  The same argument holds.  The only
 *    difference is that the root node gives another constraint on how small
 *    you can make the disjunct -- so, within these constraints, if we can go
 *    smaller, we reject.
 *   <p> 
 *    The code to do all of this is fairly ugly, but I think it works.
 *    <p>
 *  
 *  Problems with this stuff:
 *    <p>
 *    1) There is obviously a combinatorial explosion that takes place.
 *       As the number of disjuncts (and the number of their subscripts
 *       increase) the number of disjuncts that get put onto "and" will
 *       increase tremendously.  When we made the transcript for the tech
 *       report (Around August 1991) most of the sentence were processed
 *       in well under 10 seconds.  Now (Jan 1992), some of these sentences
 *       take ten times longer.  As of this writing I don't really know the
 *       reason, other than just the fact that the dictionary entries are
 *       more complex than they used to be.   The number of linkages has also
 *       increased significantly.
 *    <p>
 *    2) Each element of an and list must be attached through only one word.
 *       This disallows "there is time enough and space enough for both of us", 
 *       and many other reasonable sounding things.  The combinatorial
 *       explosion that would occur if you allowed two different connection
 *       points would be tremendous, and the number of solutions would also
 *       probably go up by another order of magnitude.  Perhaps if there
 *       were strong constraints on the type of connectors in which this
 *       would be allowed, then this would be a conceivable prospect.
 *   <p> 
 *    3) A multi-connector must be either all "outside" or all "inside" the and.
 *       For example, "the big black dog and cat ran" has only two ways to
 *       linkages (instead of three).
 *    <p>
 *  Possible bug: It seems that the following two linkages should be the
 *  same under the canonical linkage test.  Could this have to do with the
 *  pluralization system?
 *    <p>
 *  > I am big and the bike and the car were broken
 *  Accepted (4 linkages, 4 with no P.P. violations) at stage 1
 *    Linkage 1, cost vector = (0, 0, 18)
 * <ul>   
 *  <li>                                      +------Spx-----+      
 *  <li>          +-----CC-----+------Wd------+-d^^*i^-+     |      
 *  <li>     +-Wd-+Spi+-Pa+    |   +--Ds-+d^^*+   +-Ds-+     +--Pv-+
 *  <li>     |    |   |   |    |   |     |    |   |    |     |     |
 *  <li>   //    /// I.p am big.a and the bike.n and the car.n were broken 
 *  <li>   
 *  <li>          /////          RW      <---RW---.  RW        /////
 *  <li>          /////          Wd      <---Wd---.  Wd        I.p
 *  <li>          I.p            CC      <---CC---.  CC        and
 *  <li>          I.p            Sp*i    <---Spii-.  Spi       am
 *  <li>          am             Pa      <---Pa---.  Pa        big.a
 *  <li>          and            Wd      <---Wd---.  Wd        and
 *  <li>          bike.n         d^s**  6<---d^^*i.  d^^*i  6  and
 *  <li>          the            D       <---Ds---.  Ds        bike.n
 *  <li>          and            Sp      <---Spx--.  Spx       were
 *  <li>          and            d^^*i  6<---d^^*i.  d^s**  6  car.n
 *  <li>          the            D       <---Ds---.  Ds        car.n
 *  <li>          were           Pv      <---Pv---.  Pv        broken
 * </ul> 
 *  <p>
  * (press return for another)
 *  <p>
  * > 
 *  <p>
  *   Linkage 2, cost vector = (0, 0, 18)
  * <ul> 
  *  <li>                                     +------Spx-----+      
  *  <li>         +-----CC-----+------Wd------+-d^s**^-+     |      
  *  <li>    +-Wd-+Spi+-Pa+    |   +--Ds-+d^s*+   +-Ds-+     +--Pv-+
  *  <li>    |    |   |   |    |   |     |    |   |    |     |     |
  *  <li>   //    /// I.p am big.a and the bike.n and the car.n were broken 
  *  <li>  
  *  <li>         /////          RW      <---RW---.  RW        /////
  *  <li>         /////          Wd      <---Wd---.  Wd        I.p
  *  <li>         I.p            CC      <---CC---.  CC        and
  *  <li>         I.p            Sp*i    <---Spii-.  Spi       am
  *  <li>         am             Pa      <---Pa---.  Pa        big.a
  *  <li>         and            Wd      <---Wd---.  Wd        and
  *  <li>         bike.n         d^s**  6<---d^s**.  d^s**  6  and
  *  <li>         the            D       <---Ds---.  Ds        bike.n
  *  <li>         and            Sp      <---Spx--.  Spx       were
  *  <li>         and            d^s**  6<---d^s**.  d^s**  6  car.n
  *  <li>         the            D       <---Ds---.  Ds        car.n
  *  <li>         were           Pv      <---Pv---.  Pv        broken
  * </ul>  
    
 *
 */
public class AndData {
    int          LT_bound;
    int          LT_size;
    Disjunct   label_table[];
    LabelNode  hash_table[]=new LabelNode[GlobalBean.HT_SIZE];

}