/*******************************************************************************
* Australian National University Data Commons
* Copyright (C) 2013 The Australian National University
*
* This file is part of Australian National University Data Commons.
*
* Australian National University Data Commons is free software: you
* can redistribute it and/or modify it under the terms of the GNU
* General Public License as published by the Free Software Foundation,
* either version 3 of the License, or (at your option) any later
* version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program. If not, see <http://www.gnu.org/licenses/>.
******************************************************************************/
/**
* This package contains classes that provide data storage functionality in ANU Data Commons.
*
* <p>This functionality is primarily
* provided by the class {@link au.edu.anu.datacommons.storage.DcStorage} by providing methods for the following:
*
* <ul>
* <li>Add a file to collection record
* <li>Delete a file from a collection record
* <li>Create a directory to which files can be added
* <li>Add and delete external references
* <li>Verify the tag files of a that contain supplementary information about files in a record
* <li>Request a {@link au.edu.anu.datacommons.storage.info.FileInfo} object containing all info about a single file
* in a record
* <li>Request a {@link au.edu.anu.datacommons.storage.info.RecordDataSummary} object containing a collection of FileInfo
* objects each containing all info about a single file in the record
* <li>Request the contents of a file in a collection as an InputStream
* <li>Request the contents of multiple files in a collection as a ZipStream
* </ul>
*
* <p>The files are stored on disk in accordance with the
* <a href="http://tools.ietf.org/pdf/draft-kunze-bagit-06.pdf">BagIt specification</a> .
* DcStorage adds/updates/deletes files in the payload directory, or its subdirectories. Completion of the tag files is
* performed by other classes once a file's in the payload directory. Completion of tag files is essential to comply
* with BagIt specifications. The files on disk are stored in the following directory structure:
*
* <pre>
* [BAGS_ROOT]/[PID]/data
* </pre>
*
* <ul>
* <li>BAGS_ROOT: is the directory containing a subdirectory for each collection record.
* <li>PID: The subdirectory containing files of a specific collection record. The name is generated by replacing
* all disk unsafe characters in the record's identifier with '_' (underscore). E.g. anudc:1234 will have its files
* stored in anudc_1234. This directory contains BagIt specific files; tag files as specified by the BagIt
* specification. Users do not access files in this
* directory directly but view information stored in these files.
* <li>data: Directory containing files of the collection. The collection of files associated with a record are stored
* in this directory. The BagIt Specification refers to this directory as the payload directory. This directory contains
* files that users upload, download and modify.
* </ul>
*
* <p>When a file is added/updated/removed the following actions are performed pre-event:
*
* <ul>
* <li>Verifies that the filepath where the file will be saved is valid and doesn't contain parts whose name starts
* with a '.'. Files and folders with names that start with '.' are reserved for internal files such as preservation
* format files associated with original files that are stored in the hidden directory '.preserve' in the payload
* directory.
* <li>Verifies that the source file already exists in a staging area from where it will be moved to a record's payload
* directory.
* <li>If a file's being updated or deleted, the old file is archived if the archive directory is specified. If not
* specified, then the old file is deleted.
* </ul>
*
* <p>After a file is added/updated/removed the following actions are performed post-event:
*
* <ul>
* <li>A preservation format file is created, if possible. For example, a BMP file is preserved as a PNG file, MP3 as
* FLAC, office documents to ODx files etc. Video files are not preserved due to high processing and (potentially) high
* storage requirements.
* <li>Its MD5 checksum is calculated if not already provided in the
* {@link au.edu.anu.datacommons.storage.temp.UploadedFileInfo} object. The MD5 is then stored in the payload manifest
* as described in the BagIt specification.
* <li>Any metadata contained in the file is extracted using <a href="http://tika.apache.org">Apache Tika</a>
* The format of the file is identified using <a href="https://github.com/openplanets/fido">FIDO</a> . The file
* format's unique <a href="http://www.nationalarchives.gov.uk/PRONOM">Pronom Identifier</a> is stored in a tagfile
* along with the textual string describing the file format.
* <li>The file's timestamp is stored in a tag file. This acts as a secondary form of data integrity check if the MD5
* wasn't stored correctly or at all.
* <li>The file is indexed in Apache Solr if the collection is published and the files-public flag is set.
* <li>The file is scanned using ClamAV and its results are stored in a tag file.
* <li> Once all the above tasks have been performed resulting in tag files being updated, the bag completion process
* runs that calculates the MD5 of all the tag files updated in the aforementioned steps and updates the tag manifest.
* <li> The bag directory is 'touched' for easy identification when the directory was last updated.
* </ul>
*/
package au.edu.anu.datacommons.storage;