/**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/**
* <h1>Convert XML Documents to Avro, and Back, through XML Schema</h1>
*
* <p>
* {@link org.apache.avro.xml.XmlDatumWriter} generates an Avro
* {@link org.apache.avro.Schema} from one or more XML Schemas, and
* will write XML Documents into Avro format using that Avro schema.
* </p>
*
* <p>
* {@link org.apache.avro.xml.XmlDatumReader} will read Avro data using an Avro
* schema generated by <code>XmlDatumWriter</code>, and use it to reconstruct
* the original XML document. Conversion from XML to Avro is lossy (more
* details below), and the Avro schema generated by <code>XmlDatumWriter</code>
* contains the locations of the XML Schemas used to generate it.
* </p>
*
* <p>
* {@link org.apache.avro.xml.XmlDatumConfig} is used to configure
* <code>XmlDatumWriter</code>. The {@link java.net.URL}s and
* {@link java.io.File}s containing XML Schemas are defined there,
* as well as the root node in the XML Schema to use to generate
* the corresponding Avro <code>Schema</code>.
* </p>
*
* <h2>Avro Schema Generation</h2>
*
* <p>
* The following describes how an Avro Schema will be generated from an XML
* Schema.
* </p>
*
* <h3>XML Elements Map to Avro Records</h3>
*
* <p>
* XML elements are represented as Avro records. Each of the element's
* attributes are stored as a field in the record. The element's content is
* stored as a field named after the element. If the element has simple
* content, that content will be stored directly. If the element has child
* elements, they are stored as an array of union of those children.
* </p>
*
* <p>
* The content of empty mixed elements will be stored as a string, while the
* content of non-empty mixed elements will be an array of union of the child
* element types, along with string.
* </p>
*
* <p>
* <b>Note:</b> Unlike XML attributes, Avro fields do not have their own
* namespace. This means that two attributes with the same name but different
* namespaces cannot co-exist in the same Avro record, and an error will be
* thrown when the element's record is generated.
* </p>
* <p>
* In addition, because the children of the element are stored in a field under
* the element's name, no attribute in the element can have the same name as
* the element itself.
* </p>
*
* <h3>XML Simple Type Mapping to Avro Types</h3>
*
* <p>
* The following is a mapping of XML Schema simple types to their Avro
* counterparts. Any derived types of these XML Schema simple types will also
* be represented using this type.
* </p>
*
* <table border="1">
* <thead>
* <tr>
* <th>XML Schema Type</th>
* <th>Avro Schema Type</th>
* <th>Logical Type / Record Structure</th>
* </tr>
* </thead>
* <tbody>
* <tr>
* <td><code>boolean</code></td>
* <td>{@link org.apache.avro.Schema.Type.BOOLEAN}</td>
* <td />
* </tr>
* <tr>
* <td><code>decimal</code></td>
* <td>{@link org.apache.avro.Schema.Type.BYTES}</td>
* <td>Logical Type <code>decimal</code></td>
* </tr>
* <tr>
* <td><code>double</code></td>
* <td>{@link org.apache.avro.Schema.Type.DOUBLE}</td>
* <td />
* </tr>
* <tr>
* <td><code>float</code></td>
* <td>{@link org.apache.avro.Schema.Type.FLOAT}</td>
* <td />
* </tr>
* <tr>
* <td><code>base64</code></td>
* <td>{@link org.apache.avro.Schema.Type.BYTES}</td>
* <td />
* </tr>
* <tr>
* <td><code>hexBinary</code></td>
* <td>{@link org.apache.avro.Schema.Type.BYTES}</td>
* <td />
* </tr>
* <tr>
* <td><code>long</code></td>
* <td>{@link org.apache.avro.Schema.Type.LONG}</td>
* <td />
* </tr>
* <tr>
* <td><code>unsignedInt</code></td>
* <td>{@link org.apache.avro.Schema.Type.LONG}</td>
* <td />
* </tr>
* <tr>
* <td><code>int</code></td>
* <td>{@link org.apache.avro.Schema.Type.INT}</td>
* <td />
* </tr>
* <tr>
* <td><code>unsignedShort</code></td>
* <td>{@link org.apache.avro.Schema.Type.INT}</td>
* <td />
* </tr>
* <tr>
* <td><code>QName</code></td>
* <td>{@link org.apache.avro.Schema.Type.RECORD}</td>
* <td>
* <table border="1">
* <thead>
* <tr>
* <th>Field</th>
* <th>Type</th>
* <th>Value</th>
* </tr>
* </thead>
* <tbody>
* <tr>
* <td>namespace</td>
* <td><code>string</code></td>
* <td>The <code>QName</code>'s namespace</td>
* </tr>
* <tr>
* <td>localPart</td>
* <td><code>string</code></td>
* <td>The <code>QName</code>'s local name.</td>
* </tr>
* </tbody>
* </table>
* </td>
* </tr>
* <tr>
* <td><code>list</code></td>
* <td>{@link org.apache.avro.Schema.Type.ARRAY}</td>
* <td/>
* </tr>
* <tr>
* <td><code>union</code></td>
* <td>{@link org.apache.avro.Schema.Type.UNION}</td>
* <td/>
* </tr>
* </tbody>
* </table>
*
* <h4><code>decimal</code></h4>
*
* The <code>totalDigits</code> and <code>fractionDigits</code> facets will be
* used to define the <code>decimal</code>'s precision and scale, respectively.
* If not defined, the default precision is 34 (following the IEEE 754R
* Decimal128 format), and the default scale is 8.
*
* <h4><code>Enums</code></h4>
*
* If all of the <code>enumeration</code> facet values can be represented as an
* Avro {@link org.apache.avro.Schema.Type.ENUM}, an Avro enum will be used.
* Otherwise, the original type will be used instead.
*
* <h3>Avro Map Generation</h3>
*
* <p>
* If an element has exactly one non-optional attribute of type
* <code>ID</code>, an Avro {@link org.apache.avro.Schema.Type.MAP} will be
* generated for that element, and its direct siblings.
* </p>
*
* <p>
* If multiple differently-named children of the same element can be
* represented as maps, an Avro map of union of those elements will be
* generated instead. However, only elements of the same name and type
* will exist in the same map instance.
* </p>
*
* <p>
* XML Elements will not be re-ordered in the Avro document, so if elements of
* the same name and type are not direct siblings, they will not co-exist in
* the same map. Separate maps will be generated instead. Consider the
* following:
* </p>
*
* <pre>
* <!-- In XML Schema -->
* <element name="map">
* <complexType>
* <simpleContent type="string" />
* <attribute name="id" type="ID" />
* </complexType>
* </element>
* <element name="record" type="string" />
*
* <!-- In XML Document -->
* <map id="id1">This is the first record in a map.</map>
* <map id="id2">This is the second record in the same map.</map>
* <record>This ends the previous map.</record>
* <map id="id3">This is the start of a new map.</map>
* </pre>
*
* <h3>Wildcard Elements and Attributes</h3>
*
* Wildcard elements (<code><any></code>) and attributes
* (<code><anyAttribute></code>) do not have an equivalent concept in
* Avro, and likewise are skipped over. Any elements and attributes acting
* as wildcards in the XML document will not appear in the Avro document.
*
* <h3>Optional Attributes and Nillable Elements</h3>
*
* Optional attributes and nillable elements will be represented as a
* union of null and the simple type, as per Avro's handling of optional
* values. If the element or attribute was already a union, the null
* type will be added to that union.
*
* <h2>Generating an Avro Document From XML</h2>
*
* <p>
* {@link org.apache.avro.xml.XmlDatumWriter} will generate an Avro schema
* from one or more XML Schemas using the above specification, and write
* an XML {@link org.w3c.dom.Document} to an Avro {org.apache.avro.io.Encoder}
* accordingly. The generated Avro <code>Schema</code> can be retrieved from
* {@link org.apache.avro.xml.XmlDatumWriter#getSchema()} before encoding the
* first XML <code>Document</code>.
* </p>
*
* <p>
* A {@link org.apache.avro.xml.XmlDatumConfig} is required to set up the
* <code>XmlDatumWriter</code>. This is used to indicate where to read the
* XML Schemas from, and also to define the root element in the corresponding
* XML Documents. (XML Schemas do not have a way to indicate what their root
* element is.)
* </p>
*
* <p>
* <code>XmlDatumWriter</code> will encode the
* {@link org.apache.avro.xml.XmlDatumConfig} in the resulting Avro
* <code>Schema</code>, allowing for <code>XmlDatumReader</code> to reconstruct
* the XML <code>Document</code> as best it can. (Wildcard elements and
* attributes are lost, and will not reappear in regenerated XML Documents.)
* </p>
*
* <h2>Generating an XML Document From Avro</h2>
*
* <p>
* {@link org.apache.avro.xml.XmlDatumReader} will construct an XML
* {@link org.w3c.dom.Document} from an Avro schema generated by
* <code>XmlDatumWriter</code> and a {@link org.apache.avro.io.Decoder}.
* The <code>XmlDatumWriter</code>'s generated {@link org.apache.avro.Schema}
* is required as it contains information on how to retrieve the corresponding
* XML Schemas.
* </p>
*
* <p>
* However, the resulting document will not be precisely reconstructed. Any
* wildcard elements and attributes were not encoded in Avro, and likewise
* cannot be reconstructed. In addition, namespace prefixes will not match,
* as they are also not encoded in Avro. Of course, the new prefixes will
* map namespaces and scopes correctly.
* </p>
*/
package org.apache.avro.xml;