package net.sf.jlinkgrammar; /** * * Notes about AND * <p> * A large fraction of the code of this parser seems to deal with handling * conjunctions. This comment (combined with reading the paper) should * give an idea of how it works. * <p> * First of all, we need a more detailed discussion of strings, what they * match, etc. (This entire discussion ignores the labels, which are * semantically the same as the leading upper case letters of the * connector.) * <p> * We'll deal with infinite strings from an alphabet of three types of * characters: "*". "^" and ordinary characters (denoted "a" and "b"). * (The end of a string should be thought of as an infinite sequence of * "*"s). * <p> * Let match(s) be the set of strings that will match the string s. This * is defined as follows. A string t is in match(s) if (1) its leading * upper case letters exactly match those of s. (2) traversing through * both strings, from left to right in step, no missmatch is found * between corresponding letters. A missmatch is a pair of differing * ordinary characters, or a "^" and any ordinary letter or two "^"s. * In other words, a match is exactly a "*" and anything, or two * identical ordinary letters. * <p> * Alternative definition of the set match(s): * {t | t is obtained from s by replacing each "^" and any other characters * by "*"s, and replacing any original "*" in s by any other character * (or "^").} * <p> * Theorem: if t in match(s) then s in match(t). * <p> * It is also a theorem that given any two strings s and t, there exists a * unique new string u with the property that: * <p> * match(u) = match(s) intersect match(t) * <p> * This string is called the GCD of s and t. Here are some examples. * <ul> * <li> GCD(N*a,Nb) = Nba * <li> GCD(Na, Nb) = N^ * <li> GCD(Nab,Nb) = N^b * <li> GCD(N^,N*a) = N^a * <li> GCD(N^, N) = N^ * <li> GCD(N^^,N^) = N^^ * </ul> * <p> * We need an algorithm for computing the GCD of two strings. Here is * one. * <p> * First get by the upper case letters (which must be equal, otherwise * there is no intersection), issuing them. Traverse the rest of the * characters of s and t in lockstep until there is nothing left but * "*"s. If the two characters are: * <p> * <ul> * <li> "a" and "a", issue "a" * <li> "a" and "b", issue "^" * <li> "a" and "*", issue "a" * <li> "*" and "*", issue "*" * <li> "*" and "^", issue "^" * <li> "a" and "^", issue "^" * <li> "^" and "^", issue "^" * </ul> * <p> * A simple case analysis suffices to show that any string that matches * the right side, must match both of the left sides, and any string not * matching the right side must not match at least one of the left sides. * <p> * This proves that the GCD operator is associative and commutative. * (There must be a name for a mathematical structure with these properties.) * <p> * To elaborate further on this theory, define the notion of two strings * matching in the dual sense as follows: s and t dual-match if * match(s) is contained in match(t) or vice versa--- * <p> * Full development of this theory could lead to a more efficient * algorithm for this problem. I'll defer this until such time as it * appears necessary. * <p> * <p> * We need a data structure that stores a set of fat links. Each fat * link has a number (called its label). The fat link operates in liu of * a collection of links. The particular stuff it is a substitute for is * defined by a disjunct. This disjunct is stored in the data structure. * <p> * The type of a disjunct is defined by the sequence of connector types * (defined by their upper case letters) that comprises it. Each entry * of the label_table[] points to a list of disjuncts that have the same * type (a hash table is uses so that, given a disjunct, we can efficiently * compute the element of the label table in which it belongs). * <p> * We begin by loading up the label table with all of the possible * fat links that occur through the words of the sentence. These are * obtained by taking every sub-range of the connectors of each disjunct * (containing the center). We also compute the closure (under the GCD * operator) of these disjuncts and store also store these in the * label_table. Each disjunct in this table has a string which represents * the subscripts of all of its connectors (and their multi-connector bits). * <p> * It is possible to generate a fat connector for any one of the * disjuncts in the label_table. This connector's label field is given * the label from the disjunct from which it arose. It's string field * is taken from the string of the disjunct (mentioned above). It will be * given a priority with a value of UP_priority or DOWN_priority (depending * on how it will be used). A connector of UP_priority can match one of * DOWN_priority, but neither of these can match any other priority. * (Of course, a fat connector can match only another fat connector with * the same label.) * <p> * The paper describes in some detail how disjuncts are given to words * and to "and" and ",", etc. Each word in the sentence gets many more * new disjuncts. For each contiguous set of connectors containing (or * adjacent to) the center of the disjunct, we generate a fat link, and * replace these connector in the word by a fat link. (Actually we do * this twice. Once pointing to the right, once to the left.) These fat * links have priority UP_priority. * <p> * What do we generate for ","? For each type of fat link (each label) * we make a disjunct that has two down connectors (to the right and left) * and one up connector (to the right). There will be a unique way of * hooking together a comma-separated and-list. * <p> * The disjuncts on "and" are more complicated. Here we have to do just what * we did for comma (but also include the up link to the left), then * we also have to allow the process to terminate. So, there is a disjunct * with two down fat links, and between them are the original thin links. * These are said to "blossom" out. However, this is not all that is * necessary. It's possible for an and-list to be part of another and list * with a different labeled fat connector. To make this possible, we * regroup the just blossomed disjuncts (in all possible ways about the center) * and install them as fat links. If this sounds like a lot of disjuncts -- * it is! The program is currently fairly slow on long sentence with and. * <p> * It is slightly non-obvious that the fat-links in a linkage constructed * from disjuncts defined in this way form a binary tree. Naturally, * connectors with UP_priority point up the tree, and those with DOWN_priority * point down the tree. * <p> * Think of the string x on the connector as representing a set X of strings. * X = match(x). So, for example, if x="S^" then match(x) = {"S", "S*a", * "S*b", etc}. The matching rules for UP and DOWN priority connectors * are such that as you go up (the tree of ands) the X sets get no larger. * So, for example, a "Sb" pointing up can match an "S^" pointing down. * (Because more stuff can match "Sb" than can match "S^".) * This guarantees that whatever connector ultimately gets used after the * fat connector blossoms out (see below), it is a powerful enough connector * to be able to match to any of the connectors associated with it. * <p> * One problem with the scheme just descibed is that it sometimes generates * essentially the same linkage several times. This happens if there is * a gap in the connective power, and the mismatch can be moved around in * different ways. Here is an example of how this happens. * <p> * (Left is DOWN, right is UP) * <p> * <ul> * <li> Sa <--. S^ <--. S or Sa <--. Sa <--. S * <li> fat thin fat thin * </ul> * Here two of the disjunct types are given by "S^" and "Sa". Notice that * the criterion of shrinking the matching set is satisfied by the the fat * link (traversing from left to right). How do I eliminate one of these? * <p> * I use the technique of canonization. I generate all the linkages. There * is then a procedure that can check to see of a linkage is canonical. * If it is, it's used, otherwise it's ignored. It's claimed that exactly * one canonical one of each equivalence class will be generated. * We basically insist that the intermediate fat disjuncts (ones that * have a fat link pointing down) are all minimal -- that is, that they * cannot be replaced by by another (with a strictly) smaller match set. * If one is not minimal, then the linkage is rejected. * <p> * Here's a proof that this is correct. Consider the set of equivalent * linkages that are generated. These Pick a disjunct that is the root of * its tree. Consider the set of all disjuncts which occur in that positon * among the equivalent linkages. The GCD of all of these can fit in that * position (it matches down the tree, since its match set has gotten * smaller, and it also matches to the THIN links.) Since the GCD is put * on "and" this particular one will be generated. Therefore rejecting * a linkage in which a root fat disjunct can be replaced by a smaller one * is ok (since the smaller one will be generated separately). What about * a fat disjunct that is not the root. We consider the set of linkages in * which the root is minimal (the ones for which it's not have already been * eliminated). Now, consider one of the children of the root in precisely * the way we just considered the root. The same argument holds. The only * difference is that the root node gives another constraint on how small * you can make the disjunct -- so, within these constraints, if we can go * smaller, we reject. * <p> * The code to do all of this is fairly ugly, but I think it works. * <p> * * Problems with this stuff: * <p> * 1) There is obviously a combinatorial explosion that takes place. * As the number of disjuncts (and the number of their subscripts * increase) the number of disjuncts that get put onto "and" will * increase tremendously. When we made the transcript for the tech * report (Around August 1991) most of the sentence were processed * in well under 10 seconds. Now (Jan 1992), some of these sentences * take ten times longer. As of this writing I don't really know the * reason, other than just the fact that the dictionary entries are * more complex than they used to be. The number of linkages has also * increased significantly. * <p> * 2) Each element of an and list must be attached through only one word. * This disallows "there is time enough and space enough for both of us", * and many other reasonable sounding things. The combinatorial * explosion that would occur if you allowed two different connection * points would be tremendous, and the number of solutions would also * probably go up by another order of magnitude. Perhaps if there * were strong constraints on the type of connectors in which this * would be allowed, then this would be a conceivable prospect. * <p> * 3) A multi-connector must be either all "outside" or all "inside" the and. * For example, "the big black dog and cat ran" has only two ways to * linkages (instead of three). * <p> * Possible bug: It seems that the following two linkages should be the * same under the canonical linkage test. Could this have to do with the * pluralization system? * <p> * > I am big and the bike and the car were broken * Accepted (4 linkages, 4 with no P.P. violations) at stage 1 * Linkage 1, cost vector = (0, 0, 18) * <ul> * <li> +------Spx-----+ * <li> +-----CC-----+------Wd------+-d^^*i^-+ | * <li> +-Wd-+Spi+-Pa+ | +--Ds-+d^^*+ +-Ds-+ +--Pv-+ * <li> | | | | | | | | | | | | * <li> // /// I.p am big.a and the bike.n and the car.n were broken * <li> * <li> ///// RW <---RW---. RW ///// * <li> ///// Wd <---Wd---. Wd I.p * <li> I.p CC <---CC---. CC and * <li> I.p Sp*i <---Spii-. Spi am * <li> am Pa <---Pa---. Pa big.a * <li> and Wd <---Wd---. Wd and * <li> bike.n d^s** 6<---d^^*i. d^^*i 6 and * <li> the D <---Ds---. Ds bike.n * <li> and Sp <---Spx--. Spx were * <li> and d^^*i 6<---d^^*i. d^s** 6 car.n * <li> the D <---Ds---. Ds car.n * <li> were Pv <---Pv---. Pv broken * </ul> * <p> * (press return for another) * <p> * > * <p> * Linkage 2, cost vector = (0, 0, 18) * <ul> * <li> +------Spx-----+ * <li> +-----CC-----+------Wd------+-d^s**^-+ | * <li> +-Wd-+Spi+-Pa+ | +--Ds-+d^s*+ +-Ds-+ +--Pv-+ * <li> | | | | | | | | | | | | * <li> // /// I.p am big.a and the bike.n and the car.n were broken * <li> * <li> ///// RW <---RW---. RW ///// * <li> ///// Wd <---Wd---. Wd I.p * <li> I.p CC <---CC---. CC and * <li> I.p Sp*i <---Spii-. Spi am * <li> am Pa <---Pa---. Pa big.a * <li> and Wd <---Wd---. Wd and * <li> bike.n d^s** 6<---d^s**. d^s** 6 and * <li> the D <---Ds---. Ds bike.n * <li> and Sp <---Spx--. Spx were * <li> and d^s** 6<---d^s**. d^s** 6 car.n * <li> the D <---Ds---. Ds car.n * <li> were Pv <---Pv---. Pv broken * </ul> * */ public class AndData { int LT_bound; int LT_size; Disjunct label_table[]; LabelNode hash_table[]=new LabelNode[GlobalBean.HT_SIZE]; }