BulletParser.java example

Explorer
blazegraph-master
- database-master
package it.unimi.dsi.parser;

/*		 
 * DSI utilities
 *
 * Copyright (C) 2005-2009 Sebastiano Vigna 
 *
 *  This library is free software; you can redistribute it and/or modify it
 *  under the terms of the GNU Lesser General Public License as published by the Free
 *  Software Foundation; either version 2.1 of the License, or (at your option)
 *  any later version.
 *
 *  This library is distributed in the hope that it will be useful, but
 *  WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
 *  or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public License
 *  for more details.
 *
 *  You should have received a copy of the GNU Lesser General Public License
 *  along with this program; if not, write to the Free Software
 *  Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
 *
 */


import it.unimi.dsi.fastutil.objects.Reference2ObjectArrayMap;
import it.unimi.dsi.fastutil.objects.Reference2ObjectMap;
import it.unimi.dsi.fastutil.objects.ReferenceArraySet;
import it.unimi.dsi.fastutil.objects.ReferenceSet;
import it.unimi.dsi.fastutil.objects.ReferenceSets;
import it.unimi.dsi.lang.MutableString;
import it.unimi.dsi.util.TextPattern;
import it.unimi.dsi.parser.callback.Callback;


/** A fast, lightweight, on-demand (X)HTML parser.
 * 
 * <p>The bullet parser has been written with two specific goals in mind:
 * web crawling and targeted data extraction from massive web data sets. 
 * To be usable in such environments, a parser must obey a number of 
 * restrictions:
 * <ul>
 * <li>it should avoid excessive object creation (which, for instance,
 * forbids a significant usage of Java strings);
 * <li>it should tolerate invalid syntax and recover reasonably; in fact,
 * it should never throw exceptions;
 * <li>it should perform actual parsing only on a settable feature subset:
 * there is no reason to parse the attributes of a <samp>P</samp>
 * element while searching for links;
 * <li>it should parse HTML as a <em>regular language</em>, and leave context-free
 * properties (e.g., stack maintenance and repair) to suitably designed callbacks.
 * </ul>
 * 
 * <p>Thus, in fact the bullet parser is not a parser. It is a bunch of
 * spaghetti code that analyses a stream of characters pretending that
 * it is an (X)HTML document. It has a very defensive attitude against
 * the stream character it is parsing, but at the same time it is
 * forgiving with all typical (X)HTML mistakes.
 * 
 * <p>The bullet parser is officially StringFree™. 
 * <a href="http://dsiutils.dsi.unimi.it/docs/it/unimi/dsi/lang/MutableString.html"><code>MutableString</code>s</a>
 * are used for internal processing, and Java strings are used only to return attribute
 * values. All internal maps are {@linkplain it.unimi.dsi.fastutil.objects.Reference2ObjectMap reference-based maps}
 * from <a href="http://fastutil.dsi.unimi.it/"><samp>fastutil</samp></a>, which
 * helps to accelerate further the parsing process.
 * 
 * <h2>HTML data</h2>
 * 
 * <p>The bullet parser uses attributes and methods of {@link it.unimi.dsi.parser.HTMLFactory},
 * {@link it.unimi.dsi.parser.Element}, {@link it.unimi.dsi.parser.Attribute}
 * and {@link it.unimi.dsi.parser.Entity}.
 * Thus, for instance, whenever an element is to be passed around it is one
 * of the shared objects contained in {@link it.unimi.dsi.parser.Element}
 * (e.g., {@link it.unimi.dsi.parser.Element#BODY}).
 * 
 * <h2>Callbacks</h2>
 * 
 * <p>The result of the parsing process is the invocation of a callback.
 * The {@linkplain it.unimi.dsi.parser.callback.Callback callback interface}
 * of the bullet parser remembers closely SAX2, but it has some additional
 * methods targeted at (X)HTML, such as {@link it.unimi.dsi.parser.callback.Callback#cdata(it.unimi.dsi.parser.Element,char[],int,int)},
 * which returns characters found in a CDATA section (e.g., a stylesheet).
 * 
 * <p>Each callback must configure the parser, by requesting to perform
 * the analysis and the callbacks it requires. A callback that wants to
 * extract and tokenise text, for instance, will certainly require
 * {@link #parseText(boolean) parseText(true)}, but not {@link #parseTags(boolean) parseTags(true)}.
 * On the other hand, a callback wishing to extract links will require
 * to {@linkplain #parseAttribute(Attribute) parse selectively} certain attribute types.
 * 
 * <p>A more precise description follows.
 * 
 * <h2>Writing callbacks</h2>
 * 
 * <p>The first important issue is what has to be required to the parser. A newly
 * created parser does not invoke any callback. It is up to every callback
 * to add features so that it can do its job. Remember that since many
 * callbacks can be {@linkplain it.unimi.dsi.parser.callback.ComposedCallbackBuilder composed},
 * you must always <em>add</em> features, never <em>remove</em> them, and moreover
 * your callbacks must be ready to be invoked with features they did not
 * request (e.g., attribute types added by another callback).
 * 
 * <p>The following parse features
 * may be configured; most of them are just boolean features, a.k.a. flags:
 * unless otherwise specified, by default all flags are set to false (e.g., by
 * the default the parser will <em>not</em> parse tags):
 * <ul>
 * <li><em>tags</em> ({@link #parseTags(boolean)} method): whether tags
 * should be parsed;
 * <li><em>attributes</em> ({@link #parseAttributes(boolean)} and
 * {@link #parseAttribute(Attribute) methods)}:
 * whether attributes should be parsed (of course, setting this flag is useless
 * if you are not parsing tags); note that setting this flag will just
 * activate the attribute parsing feature, but you must also
 * {@linkplain #parseAttribute(Attribute) register} every attribute 
 * whose value you want to obtain.
 * <li><em>text</em> ({@link #parseText(boolean)}method): whether text
 * should be parsed; if this flag is set, the parser will call the
 * {@link it.unimi.dsi.parser.callback.Callback#characters(char[], int, int, boolean)}
 * method for every text chunk found.
 * <li><em>CDATA sections</em> ({@link #parseCDATA(boolean)}method): whether CDATA
 * sections (stylesheets & scripts)
 * should be parsed; if this flag is set, the parser will call the
 * {@link it.unimi.dsi.parser.callback.Callback#cdata(Element,char[],int,int)}
 * method for every CDATA section found.
 * </ul>
 * 
 * <h2>Invoking the parser</h2>
 * 
 * <p>After {@linkplain #setCallback(Callback) setting the parser callback}, 
 * you just call {@link #parse(char[], int, int)}.
 */

public class BulletParser {

	private static final boolean DEBUG = false;

	/** Scanning text.. */
	protected static final int STATE_TEXT = 0;
	/** Scanning attribute name/value pairs. */
	protected static final int STATE_BEFORE_START_TAG_NAME = 1;
	/** Scanning a closing tag. */
	protected static final int STATE_BEFORE_END_TAG_NAME = 2;
	/** Scanning attribute name/value pairs. */
	protected static final int STATE_IN_START_TAG = 3;
	/** Scanning a closing tag. */
	protected static final int STATE_IN_END_TAG = 4;

	/** The maximum Unicode value accepted for a numeric entity. */
	protected static final int MAX_ENTITY_VALUE = 65535;
	/** The base for non-decimal entity. */
	protected static final int HEXADECIMAL = 16;
	/** The maximum number of digits of a hexadecimal numeric entity. */
	protected static final int MAX_HEX_ENTITY_LENGTH = 8;
	/** The maximum number of digits of a decimal numeric entity. */
	protected static final int MAX_DEC_ENTITY_LENGTH = 9;

	/** Closing tag for a script element. */
	protected static final TextPattern SCRIPT_CLOSE_TAG_PATTERN = new TextPattern( "</script>", TextPattern.CASE_INSENSITIVE );
	/** Closing tag for a style element. */
	protected static final TextPattern STYLE_CLOSE_TAG_PATTERN = new TextPattern( "</style>", TextPattern.CASE_INSENSITIVE );

	/** An array containing the non-space whitespace. */
	protected static final char[] NONSPACE_WHITESPACE = { '\n', '\r', '\t' };
	/** An array, parallel to {@link #NONSPACE_WHITESPACE}, containing spaces. */
	protected static final char[] SPACE = { ' ', ' ', ' ' };
	
	/** Closed comment. It should be "-->", but mistakes are common. */
	protected static final TextPattern CLOSED_COMMENT = new TextPattern( "->" );
	/** Closed ASP or similar tag. */
	protected static final TextPattern CLOSED_PERCENT = new TextPattern( "%>" );
	/** Closed processing instruction. */
	protected static final TextPattern CLOSED_PIC = new TextPattern( "?>" );
	/** Closed section (conditional, etc.). */
	protected static final TextPattern CLOSED_SECTION = new TextPattern( "]>" );
	/** Closed section (conditional, CDATA, etc.). */
	protected static final TextPattern CLOSED_CDATA = new TextPattern( "]]>" );
	/** TODO: what is this?. */
	//protected static final TextPattern CLOSED_BOH = new TextPattern( "!>" );

	/** The parsing factory used by this parser. */
	public final ParsingFactory factory;
	
	/** The callback of this parser. */
	protected Callback callback;
	/** A map from attributes to attribute values. */
	protected Reference2ObjectMap<Attribute,MutableString> attrMap;
	/** Whether we should invoke the text handler. */
	protected boolean parseText;
	/** Whether we should invoke the CDATA section handler. */
	protected boolean parseCDATA;
	/** Whether we should parse tags. */
	protected boolean parseTags;
	/** Whether we should parse attributes. */
	protected boolean parseAttributes;
	/**
	 * The subset of attributes whose values will be actually parsed (if, of
	 * course, {@link #parseAttributes}is true).
	 */
	protected ReferenceArraySet<Attribute> parsedAttrs = new ReferenceArraySet<Attribute>();
	/**
	 * An externally visible, immutable subset of attributes whose values will
	 * be actually parsed.
	 */
	public ReferenceSet<Attribute> parsedAttributes = ReferenceSets.unmodifiable( parsedAttrs );
	/** The character represented by the last scanned entity. */
	protected char lastEntity;
	
	/** Creates a new bullet parser. */
	public BulletParser( final ParsingFactory factory ) {
		this.factory = factory;
	}

	/** Creates a new bullet parser using the default factory {@link HTMLFactory#INSTANCE}. */
	public BulletParser() {
		this( HTMLFactory.INSTANCE );
	}

	/**
	 * Returns whether this parser will invoke the text handler.
	 * 
	 * @return whether this parser will invoke the text handler.
	 * @see #parseText(boolean)
	 */
	public boolean parseText() {
		return parseText;
	}

	/**
	 * Sets the text handler flag.
	 * 
	 * @param parseText
	 *            the new value.
	 * @return this parser.
	 */
	public BulletParser parseText( final boolean parseText ) {
		this.parseText = parseText;
		return this;
	}

	/**
	 * Returns whether this parser will invoke the CDATA-section handler.
	 * 
	 * @return whether this parser will invoke the CDATA-section handler.
	 * @see #parseCDATA(boolean)
	 */
	public boolean parseCDATA() {
		return parseCDATA;
	}

	/**
	 * Sets the CDATA-section handler flag.
	 * 
	 * @param parseCDATA
	 *            the new value.
	 * @return this parser.
	 */
	public BulletParser parseCDATA( final boolean parseCDATA ) {
		this.parseCDATA = parseCDATA;
		return this;
	}

	/**
	 * Returns whether this parser will parse tags and invoke element handlers.
	 * 
	 * @return whether this parser will parse tags and invoke element handlers.
	 * @see #parseTags(boolean)
	 */
	public boolean parseTags() {
		return parseTags;
	}

	/**
	 * Sets whether this parser will parse tags and invoke element handlers.
	 * 
	 * @param parseTags
	 *            the new value.
	 * @return this parser.
	 */
	public BulletParser parseTags( final boolean parseTags ) {
		this.parseTags = parseTags;
		return this;
	}

	/**
	 * Returns whether this parser will parse attributes.
	 * 
	 * @return whether this parser will parse attributes.
	 * @see #parseAttributes(boolean)
	 */
	public boolean parseAttributes() {
		return parseAttributes;
	}

	/**
	 * Sets the attribute parsing flag.
	 * 
	 * @param parseAttributes
	 *            the new value for the flag.
	 * @return this parser.
	 */
	public BulletParser parseAttributes( final boolean parseAttributes ) {
		this.parseAttributes = parseAttributes;
		return this;
	}

	/**
	 * Adds the given attribute to the set of attributes to be parsed.
	 * 
	 * @param attribute
	 *            an attribute that should be parsed.
	 * @throws IllegalStateException
	 *             if {@link #parseAttributes(boolean) parseAttributes(true)}
	 *             has not been invoked on this parser.
	 * @return this parser.
	 */
	public BulletParser parseAttribute( final Attribute attribute ) {
		parsedAttrs.add( attribute );
		return this;
	}

	/** Sets the callback for this parser, resetting at the same time all parsing flags.
	 * 
	 * @param callback the new callback.
	 * @return this parser.
	 */
	public BulletParser setCallback( final Callback callback ) {
        this.callback = callback;
        parseCDATA = parseText = parseAttributes = parseTags = false;
        parsedAttrs.clear();
        callback.configure( this );
        return this;
	}

	/** Returns the character corresponding to a given entity name.
	 *
	 * @param name the name of an entity.
	 * @return the character corresponding to the entity, or an ASCII NUL if no entity with that name was found.
	 */
	protected char entity2Char( final MutableString name ) {
		final Entity e = factory.getEntity( name );
		return e == null ? (char)0 : e.character;
	}

	/** Searches for the end of an entity.
	 * 
	 * <P>This method will search for the end of an entity starting at the given offset (the offset
	 * must correspond to the ampersand).
	 * 
	 * <P>Real-world HTML pages often contain hundreds of misplaced ampersands, due to the
	 * unfortunate idea of using the ampersand as query separator (<em>please</em> use the comma
	 * in new code!). All such ampersand should be specified as <samp>&amp;</samp>. 
	 * If named entities are delimited using a transition
	 * from alphabetical to non-alphabetical characters, we can easily get false positives. If the parameter
	 * <code>loose</code> is false, named entities can be delimited only by whitespace or by a comma.
	 * 
	 * @param a a character array containing the entity.
	 * @param offset the offset at which the entity starts (the offset must point at the ampersand).
	 * @param length an upper bound to the maximum returned position.
	 * @param loose if true, named entities can be terminated by any non-alphabetical character 
	 * (instead of whitespace or comma).
	 * @param entity a support mutable string used to query {@link ParsingFactory#getEntity(MutableString)}.
	 * @return the position of the last character of the entity, or -1 if no entity was found.
	 */
	protected int scanEntity( final char[] a, final int offset, final int length, final boolean loose, final MutableString entity ) {

		int i, c = 0;
		String tmpEntity;

		if ( length < 2 ) return -1;
		
		if ( a[ offset + 1 ] == '#' ) {
			if ( length > 2 && a[ offset + 2 ] == 'x' ) {
				for( i = 3; i < length && i < MAX_HEX_ENTITY_LENGTH && Character.digit( a[ i + offset ], HEXADECIMAL ) != -1; i++ );
				tmpEntity =  new String( a, offset + 3, i - 3 );
				if ( i != 3 ) c = Integer.parseInt( tmpEntity, HEXADECIMAL );
			}
			else {
				for( i = 2; i < length && i < MAX_DEC_ENTITY_LENGTH && Character.isDigit( a[ i + offset ] ); i++ );
				tmpEntity = new String( a, offset + 2, i - 2 );
				if ( i != 2 ) c = Integer.parseInt( tmpEntity );
			}
			
			if ( c > 0 && c < MAX_ENTITY_VALUE ) {
				lastEntity = (char)c;
				if ( i < length && a[ i + offset ] == ';' ) i++;
				return i + offset;
			}
		}
		else {
			if ( Character.isLetter( a[ offset + 1 ] ) ) {
				for( i = 2; i < length && Character.isLetterOrDigit( a[ offset + i ] ); i++ );
				if ( i != 1 && ( loose || ( i < length && ( Character.isWhitespace( a[ offset + i ] ) || a[ offset + i ] == ';' ) ) ) && ( lastEntity = entity2Char( entity.length( 0 ).append( a, offset + 1, i - 1 ) ) ) != 0 ) {
					if ( i < length && a[ i + offset ] == ';' ) i++;
					return i + offset;
				}
			}
		}

		return -1;
	}

	/**
	 * Replaces entities with the corresponding characters.
	 * 
	 * <P>This method will modify the mutable string <code>s</code> so that all legal occurrences
	 * of entities are replaced by the corresponding character.
	 * 
	 * @param s a mutable string whose entities will be replaced by the corresponding characters.
	 * @param entity a support mutable string used by {@link #scanEntity(char[], int, int, boolean, MutableString)}.
	 * @param loose a parameter that will be passed to {@link #scanEntity(char[], int, int, boolean, MutableString)}.
	 */
	protected void replaceEntities( final MutableString s, final MutableString entity, final boolean loose ) {

		final char[] a = s.array();
		int length = s.length();

		/* We examine the string *backwards*, so that i is always a valid index. */

		int i = length, j;
		while( i-- > 0 )
			if ( a[ i ] == '&' && ( j = scanEntity( a, i, length - i, loose, entity ) ) != -1 ) 
				length = s.replace( i, j, lastEntity ).length();
	}

	/** Handles markup.
	 * 
	 * @param text the text.
	 * @param pos the first character in the markup after <samp><!</samp>.
	 * @param end the end of <code>text</code>.
	 * @return the position of the first character after the markup.
	 */
	
	protected int handleMarkup( final char[] text, int pos, final int end ) {
		// A markup instruction (doctype, comment, etc.).
		switch( text[ ++pos ] ) {
		case 'D':
		case 'd':
			// DOCTYPE
			while(  pos < end && text[ pos++ ] != '>' );
			break;

		case '-':
			// comment
			if ( ( pos = CLOSED_COMMENT.search( text, pos, end ) ) == -1 ) pos = end;
			else pos += CLOSED_COMMENT.length();
			break;
		
		default:
			if ( pos < end - 6 && 
					text[ pos ] == '[' && text[ pos + 1 ] == 'C' && text[ pos + 2 ] == 'D' && text[ pos + 3 ] == 'A' && text[ pos + 4 ] == 'T' && text[ pos + 5 ] == 'A' && text[ pos + 6 ] == '[' ) {
				// CDATA section
				final int last = CLOSED_CDATA.search( text, pos, end );
				if ( parseCDATA ) callback.cdata( null, text, pos + 7, ( last == -1 ? end : last ) - pos - 7 );
				pos = last == -1 ? end : last + CLOSED_CDATA.length();
			}
			//  Generic markup
			else while( pos < end && text[ pos++ ] != '>' );
			break;
		}

		return pos;
	}
	
	/** Handles processing instruction, ASP tags etc.
	 * 
	 * @param text the text.
	 * @param pos the first character in the markup after <samp><%</samp>.
	 * @param end the end of <code>text</code>.
	 * @return the position of the first character after the processing instruction.
	 */
	
	protected int handleProcessingInstruction( final char[] text, int pos, final int end ) {

		switch( text[ ++pos  ] ) {
		case '%':
			if ( ( pos = CLOSED_PERCENT.search( text, pos, end ) ) == -1 ) pos = end;
			else pos += CLOSED_PERCENT.length();
			break;
			
		case '?':
			if ( ( pos = CLOSED_PIC.search( text, pos, end ) ) == -1 ) pos = end;
			else pos += CLOSED_PIC.length();
			break;
		case '[':
			if ( ( pos = CLOSED_SECTION.search( text, pos, end ) ) == -1 ) pos = end;
			else pos += CLOSED_SECTION.length();
			break;
		default:
			//  Generic markup
			while( pos < end && text[ pos++ ] != '>' );
			break;
		}
		return pos;
	}

	
	/**
	 * Analyze the text document to extract information.
	 * 
	 * @param text a <code>char</code> array of text to be parsed.
	 */
	public void parse( final char[] text ) {
		parse( text, 0, text.length );
	}
		
	/**
	 * Analyze the text document to extract information.
	 * 
	 * @param text a <code>char</code> array of text to be parsed.
	 * @param offset the offset in the array from which the parsing will begin.
	 * @param length the number of characters to be parsed.
	 */
	public void parse( final char[] text, final int offset, final int length ) {
		MutableString tagElemTypeName = new MutableString(); 
		MutableString attrName = new MutableString(); 
		MutableString attrValue = new MutableString(); 
		MutableString entity = new MutableString();
		MutableString characters = new MutableString();

		/* During the analysis of attribute we need a separator for values */
		char delim;
		/* The current character */
		char currChar;
		/* The state of the switch */
		int state;
		/* Others integer values used in the parsing process */
		int start, k;
		/* This boolean is set true if we have words to handle */
		boolean flowBroken = false, parseCurrAttr;
		
		/* The current element. */
		Element currentElement;
		/* The current attribute object */
		Attribute currAttr = null; 
		attrMap = new Reference2ObjectArrayMap<Attribute,MutableString>( 16 );
	
		callback.startDocument();

		tagElemTypeName.length( 0 ); 
		attrName.length( 0 ); 
		attrValue.length( 0 ); 
		entity.length( 0 ); 

		state = STATE_TEXT;
		currentElement = null;
		final int end = offset + length;
		int pos = offset;
		
		/* This is the main loop. */
		while ( pos < end ) {
			
			switch( state ) {
			case STATE_TEXT:
				currChar = text[ pos ];
				if ( currChar == '&' ) {
					
					// We handle both the case of an entity, and that of a stray '&'.
					if ( ( k = scanEntity( text, pos, end - pos, true, entity ) ) == -1 ) {
						currChar = '&';
						pos++;
					}
					else {
						currChar = lastEntity;
						pos = k;
						if ( DEBUG ) System.err.println( "Entity at: " + pos + " end of entity: " + k + " entity: " + entity + " char: " + currChar );
					}
					if ( parseText ) characters.append( currChar );
					continue;
				}
				
				// No tags can happen later than end - 2.
				if ( currChar != '<' || pos >= end - 2 ) {
					if ( parseText ) characters.append( currChar );
					pos++;
					continue;
				}
				
				switch( text[ ++pos ] ) {
				case '!':
					pos = handleMarkup( text, pos, end );
					break;

				case '%':
				case '?':
					pos = handleProcessingInstruction( text, pos, end );
					break;

				default:
					// Actually a tag. Note that we allow for </> and that we skip false positives
					// due to sloppy HTML writing (e.g., "<-- hello! -->" ).
					if ( Character.isLetter( text[ pos ] ) ) state = STATE_BEFORE_START_TAG_NAME;
					else if ( text[ pos ] == '/' && ( Character.isLetter( text[ pos + 1 ] ) || text[ pos + 1 ] == '>' ) ) {
						state = STATE_BEFORE_END_TAG_NAME;
						pos++;
					}
					else {
						// Not really a tag.
						if ( parseText ) characters.append( '<' );
						continue;
					}
					break;
				}
				if ( parseText && characters.length() != 0 ) {
					callback.characters( characters.array(), 0, characters.length(), flowBroken );
					characters.length( 0 );
				}

				flowBroken = false;				
				break;

			case STATE_BEFORE_START_TAG_NAME:
			case STATE_BEFORE_END_TAG_NAME:
				// Let's get the name.
				tagElemTypeName.length( 0 );
				for( start = pos; pos < end && ( Character.isLetterOrDigit( text[ pos ] ) || text[ pos ] == ':' || text[ pos ] == '_' ||text[ pos ] == '-' || text[ pos ] == '.' ); pos++ );
				
				tagElemTypeName.append( text, start, pos - start );
				tagElemTypeName.toLowerCase();
				
				currentElement = factory.getElement( tagElemTypeName );
				if ( DEBUG ) System.err.println( ( state == STATE_BEFORE_START_TAG_NAME ? "Opening" : "Closing" ) + " tag for " + tagElemTypeName + " (element: " + currentElement+ ")" );
				
				if ( currentElement != null && currentElement.breaksFlow ) flowBroken = true;
				while( pos < end && Character.isWhitespace( text[ pos ] ) ) pos++;
				state = state == STATE_BEFORE_START_TAG_NAME ? STATE_IN_START_TAG : STATE_IN_END_TAG;
				break;
				
			case STATE_IN_START_TAG:
				currChar = text[ pos ];
				if ( currChar != '>' && ( currChar != '/' || pos == end - 1 || text[ pos + 1 ] != '>' ) ) {
					// We got attributes.
					if ( Character.isLetter( currChar ) ) {
						parseCurrAttr = false;
						attrName.length( 0 );
						for( start = pos; pos < end && ( Character.isLetter( text[ pos ] ) || text[ pos ] == '-' ); pos++ );
						if ( currentElement != null && parseAttributes ) { 
							attrName.append( text, start, pos - start );
							attrName.toLowerCase();
							if ( DEBUG ) System.err.println( "Got attribute named \"" + attrName + "\"" );
							currAttr = factory.getAttribute( attrName );
							parseCurrAttr = parsedAttrs.contains( currAttr );
						}
						// Skip whitespace
						while ( pos < end && Character.isWhitespace( text[ pos ] ) ) pos++;
						if ( pos == end ) break;
						if ( text[ pos ] != '=' ) {
							// We found an attribute without explicit value.
							// TODO: can we avoid another string?
							if ( parseCurrAttr ) attrMap.put( currAttr, new MutableString( currAttr.name ) );
							break;
						}
						
						pos++;
						while ( pos < end && Character.isWhitespace( text[ pos ] ) ) pos++;
						if ( pos == end ) break;
						
						attrValue.length( 0 );
						if ( pos < end && ( ( delim = text[ pos ] ) == '"' || ( delim = text[ pos ] ) == '\'' ) ) {
							// An attribute value with delimiters.
							for( start = ++pos; pos < end && text[ pos ] != delim; pos++ );
							if ( parseCurrAttr ) attrValue.append( text, start, pos - start ).replace( NONSPACE_WHITESPACE, SPACE );
							if ( pos < end ) pos++;
						}
						else {
							// An attribute value without delimiters. Due to very common errors, we 
							// gather characters up to the first occurrence of whitespace or '>'.
							for( start = pos; pos < end && !Character.isWhitespace( text[ pos ] ) && text[ pos ] != '>'; pos++ ); 
							if ( parseCurrAttr ) attrValue.append( text, start, pos - start );
						}

						if ( parseCurrAttr ) {
							replaceEntities( attrValue, entity, false );
							attrMap.put( currAttr, attrValue.copy() );
							if ( DEBUG ) System.err.println( "Attribute value: \"" + attrValue + "\"" );
						}
						// Skip whitespace
						while ( pos < end && Character.isWhitespace( text[ pos ] ) ) pos++;
					}
					else {
						// It's a mess. Our only reasonable chance is to try to resync on the first
						// whitespace, or alternatively to get to the end of the tag.
						do pos++; while ( pos < end && text[ pos ] != '>' && ! Character.isWhitespace( text[ pos ] ) );
						// Skip whitespace
						while ( pos < end && Character.isWhitespace( text[ pos ] ) ) pos++;
						continue;
					}
				}
				else {
					if ( parseTags && ! callback.startElement( currentElement, attrMap ) ) break;
					if ( attrMap != null ) attrMap.clear();
					
					if ( currentElement == Element.SCRIPT || currentElement == Element.STYLE ) {
						final TextPattern pattern = currentElement == Element.SCRIPT ? SCRIPT_CLOSE_TAG_PATTERN : STYLE_CLOSE_TAG_PATTERN; 
						start = pos + 1;
						pos = pattern.search( text, start, end );
						if ( pos == -1 ) pos = end;
						if ( parseText ) callback.cdata( currentElement, text, start, pos - start );
						if ( pos < end ) {
							if ( parseTags ) callback.endElement( currentElement );
							pos += pattern.length();
						}
					}
					else pos += currChar == '/' ? 2 : 1;
					state = STATE_TEXT;
				}
				break;
				
			case STATE_IN_END_TAG:
				while ( pos < end && text[ pos ] != '>' ) pos++;
				if ( parseTags && currentElement != null && ! callback.endElement( currentElement ) ) break;
				state = STATE_TEXT;
				pos++;
				break;
				
			default:
			}
			
		}

		// We do what we can to invoke tag handlers in case of a truncated text.
		if ( state == STATE_IN_START_TAG && parseTags && currentElement != null ) callback.startElement( currentElement, attrMap );
		if ( state == STATE_IN_END_TAG && parseTags && currentElement != null ) callback.endElement( currentElement );
		
		if ( state == STATE_TEXT && parseText && characters.length() > 0 ) 
			callback.characters( characters.array(), 0, characters.length(), flowBroken );
		
		callback.endDocument();
	}
}