// Copyright 2006 Google Inc. // // Licensed under the Apache License, Version 2.0 (the "License"); // you may not use this file except in compliance with the License. // You may obtain a copy of the License at // // http://www.apache.org/licenses/LICENSE-2.0 // // Unless required by applicable law or agreed to in writing, software // distributed under the License is distributed on an "AS IS" BASIS, // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. // See the License for the specific language governing permissions and // limitations under the License. package com.google.enterprise.connector.spi; /** * Interface for implementing query-based traversal. * <p> * Query-based traversal is a scheme whereby a repository is traversed according * to a query that visits each document in a natural order that is efficiently * supported by the underlying repository and can be easily checkpointed and * restarted. * <p> * A good use case is a repository that supports access to documents in * last-modified-date order. In particular, suppose a repository supports a * query analogous to the following SQL query (the repository need not support * SQL, SQL is used here only as an example): * <pre> * select documentid, lastmodifydate from documents * where lastmodifydate < <b><i>date-constant</i></b> * order by lastmodifydate * </pre> * * <p> * Such a repository can easily be traversed by lastmodifydate, and the state of * the traversal is easily encapsulated in a single, small data item: the date * of the last document processed. Increasing last-modified-date order is * convenient because if a document is processed during traversal, but then * later modified, then it will be picked up again later in the traversal * process. Thus, this traversal is appropriate both for initial load and for * incremental update. * <p> * For such a repository, the implementor is urged to let the Connector Manager * (the caller) maintain the traversal state. This is achieved by implementing * the interface methods as follows: * <ul> * <li>{@code startTraversal()} Run a query that starts from the * beginning, such as * <pre> * select documentid, lastmodifydate from documents order by lastmodifydate * </pre></li> * <li>{@code resumeTraversal(String checkpoint)} Run a query that * resumes traversal from the supplied checkpoint</li> * </ul> * Checkpoints are supplied by the * {@link DocumentList#checkpoint()} method. * <p> * Please observe that the Connector Manager (the caller) makes no guarantee * to consume the entire {@code DocumentList} returned by either the * {@code startTraversal} or {@code resumeTraversal} calls. * The Connector Manager will consume as many it chooses, depending on load, * schedule and other factors. The Connector Manager guarantees to call * {@code checkpoint} after handling the last document it has * successfully processed from the {@code DocumentList} it was using. * Thus, the implementor is free to use a query that only returns a small * number of results, if that gets better performance. * <p> * For example, to continue the SQL analogy, a query like this could be used: * <pre> * select TOP 10 documentid, lastmodifydate from documents ... * </pre> * * <p> * The {@code setBatchHint} method is provided so that the Connector * Manager can tell the implementation that it only wants that many results per * call. This is a hint - the implementation need not observe it. The * implementation is free to return a DocumentList with fewer or more * results. For example, the traversal may be completely up to date, so perhaps * there are no results to return. Or, for internal reasons, the implementation * may not want to return the full batchHint number of results. When returning * more results than the hint, some or all of the extra documents may be * ignored. * <p> * The Connector Manager makes a distinction between the return of a * {@code null} DocumentList and an empty DocumentList (a DocumentList with * zero entries). Returning a {@code null} DocumentList will have an impact on * scheduling - the Connector Manager may choose to wait longer after receiving * a {@code null} result before it calls again. Also, if a {@code null} result * is returned, the Connector Manager will not [indeed, cannot] call * {@code checkpoint} before calling start or resume traversal again. Returning * a {@code null} DocumentList is suitable when a traversal is completely up to * date, with no new documents available and no new checkpoint state. * <p> * Returning an empty DocumentList will probably not have an impact on * scheduling. The Connector Manager will call {@code checkpoint}, * and will likely call {@code resumeTraversal} again immediately. * Returning an empty DocumentList is not appropriate if a traversal is * completely up to date, as it would effectively induce a spin, constantly * calling {@code resumeTraversal} when it has no work to do. * Returning an empty DocumentList is a convenient way to indicate to the * Connector Manager, that although no documents were provided in this * batch, the Connector wishes to continue searching the repository for * suitable content. The call to {@code checkpoint} allows the * Connector to record its progress through the repository. This mechanism * is suitable for cases when the search for suitable content may exceed * the Connector Manager's timeout. * <p> * If the Connector returns a non-{@code null} {@code DocumentList}, even * one with zero entries, the Connector Manager will nearly always call * {@code checkpoint} when it has finished processing the DocumentList. * <p> * An implementation need not let the Connector Manager store the traversal * state, it may choose to store the state itself. Implementors are discouraged * from using this technique unless necessary, because it makes transactionality * more difficult and it introduces resource dependencies of which the Connector * Manager is unaware. However, there may be repositories which have a natural * traversal order, but this state of this traversal is not easily expressed in * a small data item. For example, a repository may consist of a large number of * named sub-repositories, each of which can be traversed in modify date order, * but for which there is no convenient way of traversing them all in one query. * In this case, the implementation may choose to maintain state itself, as a * table of pairs: (sub-repository-name, per-repository-date-stamp). In such a * case, the implementor may implement the interface methods as follows: * <ul> * <li>{@code startTraversal()} Clear the internal state. Return the * first few documents</li> * <li>{@code resumeTraversal(String checkpoint)} Resume traversal * according to the internal state of the implementation. The Connector Manager * will pass in whatever checkpoint String was returned by the last call to * {@link DocumentList#checkpoint()} but the implementation is free to ignore * this and use its internal state. However, even in this case, * {@code checkpoint} must not return a {@code null} String.</li> * </ul> * The implementation must be careful about when and how it commits its internal * state to external storage. Remember again that the Connector Manager makes no * guarantee to consume the entire result set return by a traversal call. If the * Connector Manager does not call checkpoint, the implementation should not * assume that the documents returned by {@link DocumentList#nextDocument} have * been processed. The implementation should wait until the checkpoint call, and * only commit the state up to the last document returned. * <p> * <strong>Note on "Metadata and URL" feeds vs. Content feeds:</strong> * <p> * Some repositories are fully web-enabled but are difficult or impossible for * the Search Appliance to crawl, because they make heavy use of ASP or JSP, or * they have a metadata model that is not conveniently accessible with the * content in a single page. Such repositories are good candidates for * connectors. However, a developer may not choose to implement authentication * and authorization through a connector. It may be sufficient to use standard * web mechanisms for these tasks. * <p> * The developer can achieve this by following these steps. In the document list * returned by the traversal methods, specify the * {@link SpiConstants#PROPNAME_SEARCHURL} * property. The value should be a URL. If this property is specified, the * Connector Manager will use a "URL Feed" rather than a "Content Feed" for * that document. In this case, the implementor should <strong>not</strong> * supply the content of the document. The Search Appliance will fetch the * content from the specified URL. Also, this URL will be used to trigger * normal authentication and authorization for that document. For more details, * see the documentation on Metadata and URL Feeds. * <p> * <strong>Note on Documents returned by traversal calls:</strong> * <p> * The {@code Document} objects returned by the queries defined here * must contain special properties according to the following rules: * <ul> * <li> {@link SpiConstants#PROPNAME_DOCID} This property must be present.</li> * <li> {@link SpiConstants#PROPNAME_SEARCHURL} If present, this means that the * Connector Manager will generate a Metadata and URL feed, with the specified * URL. If this is present, then the {@link SpiConstants#PROPNAME_CONTENT} * property should <strong>not</strong> be.</li> * <li> {@link SpiConstants#PROPNAME_CONTENT} This property should hold the * content of the document. If present, the connector framework will base-64 * encode the value and present it to the Search Appliance as the primary * content to be indexed. If this is present, then the * {@link SpiConstants#PROPNAME_SEARCHURL} property should <strong>not</strong> * be.</li> * <li> {@link SpiConstants#PROPNAME_DISPLAYURL} If present, this will be used * as the primary link on a results page. This should <strong>not</strong> * be used with {@link SpiConstants#PROPNAME_SEARCHURL}.</li> * </ul> * * @since 1.0 */ public interface TraversalManager { /** * Starts (or restarts) traversal from the beginning. This action will return * objects starting from the very oldest, or with the smallest IDs, or * whatever natural order the implementation prefers. The caller may consume * as many or as few of the results as it wants, but it guarantees to call * {@link DocumentList#checkpoint()} passing in the last object * it has successfully processed. * * @return A DocumentList of documents from the repository in natural order, * or {@code null} if there are no documents. * @throws RepositoryException if the Repository is unreachable or similar * exceptional condition. */ public DocumentList startTraversal() throws RepositoryException; /** * Continues traversal from a supplied checkpoint. The checkPoint parameter * will have been created by a call to the * {@link DocumentList#checkpoint()} method. The * DocumentList object returns objects from the repository in natural order * starting just after the document that was used to create the checkpoint * string. * * @param checkPoint String that indicates from where to resume traversal. * @return DocumentList object that returns documents starting just after the * checkpoint, or {@code null} if there are no documents. * @throws RepositoryException */ public DocumentList resumeTraversal(String checkPoint) throws RepositoryException; /** * Sets the preferred batch size. The caller advises the implementation that * the result sets returned by startTraversal or resumeTraversal should be * as close to this number as is reasonable. The implementation may ignore * this call or do its best to return approximately this number. * * @param batchHint * @throws RepositoryException */ public void setBatchHint(int batchHint) throws RepositoryException; }