/******************************************************************************* * Australian National University Data Commons * Copyright (C) 2013 The Australian National University * * This file is part of Australian National University Data Commons. * * Australian National University Data Commons is free software: you * can redistribute it and/or modify it under the terms of the GNU * General Public License as published by the Free Software Foundation, * either version 3 of the License, or (at your option) any later * version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program. If not, see <http://www.gnu.org/licenses/>. ******************************************************************************/ /** * This package contains classes that provide data storage functionality in ANU Data Commons. * * <p>This functionality is primarily * provided by the class {@link au.edu.anu.datacommons.storage.DcStorage} by providing methods for the following: * * <ul> * <li>Add a file to collection record * <li>Delete a file from a collection record * <li>Create a directory to which files can be added * <li>Add and delete external references * <li>Verify the tag files of a that contain supplementary information about files in a record * <li>Request a {@link au.edu.anu.datacommons.storage.info.FileInfo} object containing all info about a single file * in a record * <li>Request a {@link au.edu.anu.datacommons.storage.info.RecordDataSummary} object containing a collection of FileInfo * objects each containing all info about a single file in the record * <li>Request the contents of a file in a collection as an InputStream * <li>Request the contents of multiple files in a collection as a ZipStream * </ul> * * <p>The files are stored on disk in accordance with the * <a href="http://tools.ietf.org/pdf/draft-kunze-bagit-06.pdf">BagIt specification</a> . * DcStorage adds/updates/deletes files in the payload directory, or its subdirectories. Completion of the tag files is * performed by other classes once a file's in the payload directory. Completion of tag files is essential to comply * with BagIt specifications. The files on disk are stored in the following directory structure: * * <pre> * [BAGS_ROOT]/[PID]/data * </pre> * * <ul> * <li>BAGS_ROOT: is the directory containing a subdirectory for each collection record. * <li>PID: The subdirectory containing files of a specific collection record. The name is generated by replacing * all disk unsafe characters in the record's identifier with '_' (underscore). E.g. anudc:1234 will have its files * stored in anudc_1234. This directory contains BagIt specific files; tag files as specified by the BagIt * specification. Users do not access files in this * directory directly but view information stored in these files. * <li>data: Directory containing files of the collection. The collection of files associated with a record are stored * in this directory. The BagIt Specification refers to this directory as the payload directory. This directory contains * files that users upload, download and modify. * </ul> * * <p>When a file is added/updated/removed the following actions are performed pre-event: * * <ul> * <li>Verifies that the filepath where the file will be saved is valid and doesn't contain parts whose name starts * with a '.'. Files and folders with names that start with '.' are reserved for internal files such as preservation * format files associated with original files that are stored in the hidden directory '.preserve' in the payload * directory. * <li>Verifies that the source file already exists in a staging area from where it will be moved to a record's payload * directory. * <li>If a file's being updated or deleted, the old file is archived if the archive directory is specified. If not * specified, then the old file is deleted. * </ul> * * <p>After a file is added/updated/removed the following actions are performed post-event: * * <ul> * <li>A preservation format file is created, if possible. For example, a BMP file is preserved as a PNG file, MP3 as * FLAC, office documents to ODx files etc. Video files are not preserved due to high processing and (potentially) high * storage requirements. * <li>Its MD5 checksum is calculated if not already provided in the * {@link au.edu.anu.datacommons.storage.temp.UploadedFileInfo} object. The MD5 is then stored in the payload manifest * as described in the BagIt specification. * <li>Any metadata contained in the file is extracted using <a href="http://tika.apache.org">Apache Tika</a> * The format of the file is identified using <a href="https://github.com/openplanets/fido">FIDO</a> . The file * format's unique <a href="http://www.nationalarchives.gov.uk/PRONOM">Pronom Identifier</a> is stored in a tagfile * along with the textual string describing the file format. * <li>The file's timestamp is stored in a tag file. This acts as a secondary form of data integrity check if the MD5 * wasn't stored correctly or at all. * <li>The file is indexed in Apache Solr if the collection is published and the files-public flag is set. * <li>The file is scanned using ClamAV and its results are stored in a tag file. * <li> Once all the above tasks have been performed resulting in tag files being updated, the bag completion process * runs that calculates the MD5 of all the tag files updated in the aforementioned steps and updates the tag manifest. * <li> The bag directory is 'touched' for easy identification when the directory was last updated. * </ul> */ package au.edu.anu.datacommons.storage;