/** * Licensed to the Apache Software Foundation (ASF) under one * or more contributor license agreements. See the NOTICE file * distributed with this work for additional information * regarding copyright ownership. The ASF licenses this file * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ /** * <h1>Convert XML Documents to Avro, and Back, through XML Schema</h1> * * <p> * {@link org.apache.avro.xml.XmlDatumWriter} generates an Avro * {@link org.apache.avro.Schema} from one or more XML Schemas, and * will write XML Documents into Avro format using that Avro schema. * </p> * * <p> * {@link org.apache.avro.xml.XmlDatumReader} will read Avro data using an Avro * schema generated by <code>XmlDatumWriter</code>, and use it to reconstruct * the original XML document. Conversion from XML to Avro is lossy (more * details below), and the Avro schema generated by <code>XmlDatumWriter</code> * contains the locations of the XML Schemas used to generate it. * </p> * * <p> * {@link org.apache.avro.xml.XmlDatumConfig} is used to configure * <code>XmlDatumWriter</code>. The {@link java.net.URL}s and * {@link java.io.File}s containing XML Schemas are defined there, * as well as the root node in the XML Schema to use to generate * the corresponding Avro <code>Schema</code>. * </p> * * <h2>Avro Schema Generation</h2> * * <p> * The following describes how an Avro Schema will be generated from an XML * Schema. * </p> * * <h3>XML Elements Map to Avro Records</h3> * * <p> * XML elements are represented as Avro records. Each of the element's * attributes are stored as a field in the record. The element's content is * stored as a field named after the element. If the element has simple * content, that content will be stored directly. If the element has child * elements, they are stored as an array of union of those children. * </p> * * <p> * The content of empty mixed elements will be stored as a string, while the * content of non-empty mixed elements will be an array of union of the child * element types, along with string. * </p> * * <p> * <b>Note:</b> Unlike XML attributes, Avro fields do not have their own * namespace. This means that two attributes with the same name but different * namespaces cannot co-exist in the same Avro record, and an error will be * thrown when the element's record is generated. * </p> * <p> * In addition, because the children of the element are stored in a field under * the element's name, no attribute in the element can have the same name as * the element itself. * </p> * * <h3>XML Simple Type Mapping to Avro Types</h3> * * <p> * The following is a mapping of XML Schema simple types to their Avro * counterparts. Any derived types of these XML Schema simple types will also * be represented using this type. * </p> * * <table border="1"> * <thead> * <tr> * <th>XML Schema Type</th> * <th>Avro Schema Type</th> * <th>Logical Type / Record Structure</th> * </tr> * </thead> * <tbody> * <tr> * <td><code>boolean</code></td> * <td>{@link org.apache.avro.Schema.Type.BOOLEAN}</td> * <td /> * </tr> * <tr> * <td><code>decimal</code></td> * <td>{@link org.apache.avro.Schema.Type.BYTES}</td> * <td>Logical Type <code>decimal</code></td> * </tr> * <tr> * <td><code>double</code></td> * <td>{@link org.apache.avro.Schema.Type.DOUBLE}</td> * <td /> * </tr> * <tr> * <td><code>float</code></td> * <td>{@link org.apache.avro.Schema.Type.FLOAT}</td> * <td /> * </tr> * <tr> * <td><code>base64</code></td> * <td>{@link org.apache.avro.Schema.Type.BYTES}</td> * <td /> * </tr> * <tr> * <td><code>hexBinary</code></td> * <td>{@link org.apache.avro.Schema.Type.BYTES}</td> * <td /> * </tr> * <tr> * <td><code>long</code></td> * <td>{@link org.apache.avro.Schema.Type.LONG}</td> * <td /> * </tr> * <tr> * <td><code>unsignedInt</code></td> * <td>{@link org.apache.avro.Schema.Type.LONG}</td> * <td /> * </tr> * <tr> * <td><code>int</code></td> * <td>{@link org.apache.avro.Schema.Type.INT}</td> * <td /> * </tr> * <tr> * <td><code>unsignedShort</code></td> * <td>{@link org.apache.avro.Schema.Type.INT}</td> * <td /> * </tr> * <tr> * <td><code>QName</code></td> * <td>{@link org.apache.avro.Schema.Type.RECORD}</td> * <td> * <table border="1"> * <thead> * <tr> * <th>Field</th> * <th>Type</th> * <th>Value</th> * </tr> * </thead> * <tbody> * <tr> * <td>namespace</td> * <td><code>string</code></td> * <td>The <code>QName</code>'s namespace</td> * </tr> * <tr> * <td>localPart</td> * <td><code>string</code></td> * <td>The <code>QName</code>'s local name.</td> * </tr> * </tbody> * </table> * </td> * </tr> * <tr> * <td><code>list</code></td> * <td>{@link org.apache.avro.Schema.Type.ARRAY}</td> * <td/> * </tr> * <tr> * <td><code>union</code></td> * <td>{@link org.apache.avro.Schema.Type.UNION}</td> * <td/> * </tr> * </tbody> * </table> * * <h4><code>decimal</code></h4> * * The <code>totalDigits</code> and <code>fractionDigits</code> facets will be * used to define the <code>decimal</code>'s precision and scale, respectively. * If not defined, the default precision is 34 (following the IEEE 754R * Decimal128 format), and the default scale is 8. * * <h4><code>Enums</code></h4> * * If all of the <code>enumeration</code> facet values can be represented as an * Avro {@link org.apache.avro.Schema.Type.ENUM}, an Avro enum will be used. * Otherwise, the original type will be used instead. * * <h3>Avro Map Generation</h3> * * <p> * If an element has exactly one non-optional attribute of type * <code>ID</code>, an Avro {@link org.apache.avro.Schema.Type.MAP} will be * generated for that element, and its direct siblings. * </p> * * <p> * If multiple differently-named children of the same element can be * represented as maps, an Avro map of union of those elements will be * generated instead. However, only elements of the same name and type * will exist in the same map instance. * </p> * * <p> * XML Elements will not be re-ordered in the Avro document, so if elements of * the same name and type are not direct siblings, they will not co-exist in * the same map. Separate maps will be generated instead. Consider the * following: * </p> * * <pre> * <!-- In XML Schema --> * <element name="map"> * <complexType> * <simpleContent type="string" /> * <attribute name="id" type="ID" /> * </complexType> * </element> * <element name="record" type="string" /> * * <!-- In XML Document --> * <map id="id1">This is the first record in a map.</map> * <map id="id2">This is the second record in the same map.</map> * <record>This ends the previous map.</record> * <map id="id3">This is the start of a new map.</map> * </pre> * * <h3>Wildcard Elements and Attributes</h3> * * Wildcard elements (<code><any></code>) and attributes * (<code><anyAttribute></code>) do not have an equivalent concept in * Avro, and likewise are skipped over. Any elements and attributes acting * as wildcards in the XML document will not appear in the Avro document. * * <h3>Optional Attributes and Nillable Elements</h3> * * Optional attributes and nillable elements will be represented as a * union of null and the simple type, as per Avro's handling of optional * values. If the element or attribute was already a union, the null * type will be added to that union. * * <h2>Generating an Avro Document From XML</h2> * * <p> * {@link org.apache.avro.xml.XmlDatumWriter} will generate an Avro schema * from one or more XML Schemas using the above specification, and write * an XML {@link org.w3c.dom.Document} to an Avro {org.apache.avro.io.Encoder} * accordingly. The generated Avro <code>Schema</code> can be retrieved from * {@link org.apache.avro.xml.XmlDatumWriter#getSchema()} before encoding the * first XML <code>Document</code>. * </p> * * <p> * A {@link org.apache.avro.xml.XmlDatumConfig} is required to set up the * <code>XmlDatumWriter</code>. This is used to indicate where to read the * XML Schemas from, and also to define the root element in the corresponding * XML Documents. (XML Schemas do not have a way to indicate what their root * element is.) * </p> * * <p> * <code>XmlDatumWriter</code> will encode the * {@link org.apache.avro.xml.XmlDatumConfig} in the resulting Avro * <code>Schema</code>, allowing for <code>XmlDatumReader</code> to reconstruct * the XML <code>Document</code> as best it can. (Wildcard elements and * attributes are lost, and will not reappear in regenerated XML Documents.) * </p> * * <h2>Generating an XML Document From Avro</h2> * * <p> * {@link org.apache.avro.xml.XmlDatumReader} will construct an XML * {@link org.w3c.dom.Document} from an Avro schema generated by * <code>XmlDatumWriter</code> and a {@link org.apache.avro.io.Decoder}. * The <code>XmlDatumWriter</code>'s generated {@link org.apache.avro.Schema} * is required as it contains information on how to retrieve the corresponding * XML Schemas. * </p> * * <p> * However, the resulting document will not be precisely reconstructed. Any * wildcard elements and attributes were not encoded in Avro, and likewise * cannot be reconstructed. In addition, namespace prefixes will not match, * as they are also not encoded in Avro. Of course, the new prefixes will * map namespaces and scopes correctly. * </p> */ package org.apache.avro.xml;