Hierarchical Data Format
From Wikipedia, the free encyclopedia
Filename extension | hdf, h4, hdf4, h5, hdf5, he4, he5 |
---|---|
Type of format | scientific data format |
Website | http://www.hdfgroup.org |
Hierarchical Data Format, commonly abbreviated HDF, HDF4, or HDF5 is the name of a set of file formats and libraries designed to store and organize large amounts of numerical data. Originally developed at the NCSA, it is currently supported by the non-profit HDF Group, whose mission is to ensure continued development of HDF5 technologies, and the continued accessibility of data currently stored in HDF.
In keeping with this goal, the HDF format, libraries and associated tools are available under a liberal, BSD-like license for general use. HDF is supported by many commerical and non-commercial software platforms, including Java, Matlab, IDL, and Python. The freely available HDF distribution consists of the library, command-line utilities, test suite source, Java interface, and the Java-based HDF Viewer (HDFView)[1].
There currently exist two major versions of HDF, HDF4 and HDF5, which differ significantly in design and API.
Contents |
[edit] HDF4
HDF4 is the older version of the format, although still actively supported by the HDF Group. It supports a proliferation of different data models, including multidimensional arrays, raster images, and tables. Each defines a specific aggregate data type and provides an API for reading, writing, and organizing the data and metadata. New data models can be added by the HDF developers or users.
HDF is self-describing, allowing an application to interpret the structure and contents of a file without any outside information. One HDF file can hold a mixture of related objects which can be accessed as a group or as individual objects. Users can create their own grouping structures called "vgroups."
The HDF4 format has many limitations.[2][3] It lacks a clear object model, which makes continued support and improvement difficult. Supporting many different interface styles (images, tables, arrays) leads to a complex API. Support for metadata depends on which interface is in use; SD (Scientific Dataset) objects support arbitrary named attributes, while other types only support predefined metadata. Perhaps most importantly, the use of 32-bit signed integers for addressing limits HDF4 files to a maximum of 2GB, which is unacceptable in many modern scientific applications.
[edit] HDF5
The HDF5 format is designed to address some of the limitations of the HDF4 library, and to address current and anticipated requirements of modern systems and applications. In 2002 it won an R&D 100 award.[4] [5].
HDF5 simplifies the file structure to include only two major types of object:
- Datasets, which are multidimensional arrays of a homogenous type
- Groups, which are container structures which can hold datasets and other groups
This results in a truly hierarchical, filesystem-like data format. In fact, resources in an HDF5 file are even accessed using the POSIX-like syntax /path/to/resource. Metadata is stored in the form of user-defined, named attributes attached to groups and datasets. More complex storage APIs representing images and tables can then be built up using datasets, groups and attributes.
In addition to these advances in the file format, HDF5 includes an improved type system, and dataspace objects which represent selections over dataset regions. The API is also object-oriented with respect to datasets, groups, attributes, types, dataspaces and property lists.
The next version of NetCDF, version 4, is based on HDF5.
Because it uses B-trees to index table objects, it works well for Time series data like stock market ticks or network monitoring data. The bulk of the data goes into straightforward arrays (the table objects) that can be accessed much more quickly than the rows of a SQL database. But you still have B-Tree access for non-array data. If you find yourself designing a Star schema to fit your data into SQL, then you might want to investigate HDF5 as a simpler, faster alternative storage mechanism.
[edit] Interfaces
[edit] Low-level APIs
- C
- C++
- Fortran
- F90
- Java
- Perl
- MATLAB – uses HDF5 as primary storage format in recent releases
- IDL & GDL
- PyTables – an interface for Python
- h5py - another Python interface
[edit] High-level APIs
- HDF5 Lite (H5LT) – a light-weight interface for C
- HDF5 Image (H5IM) – a C interface for images or rasters
- HDF5 Table (H5TB) – a C interface for tables
- HDF5 Packet Table (H5PT) – interfaces for C and [C++] to handle "packet" data, accessed at high-speeds
- HDF5 Dimension Scale (H5DS) – allows dimension scales to be added to HDF5; to be introduced in the HDF5-1.8 release
- JHDF5 – a HDF5 library for Java 5 and later, (includes HDF5 1.8 libraries)
- Mathematica[6] immediate analysis of HDF and HDF5 data
[edit] See also
- Common Data Format (CDF)
- NetCDF
- FITS, a data format used in astronomy
- GRIB (GRIdded Binary), a data format used in meteorology
- Q5cost a FORTRAN API to use hdf5 in quantum chemistry
[edit] References
[edit] External links
- The HDF Group home page
- What is HDF5?
- "An Introduction to Distributed Visualization"; section 4.2 contains a comparison of CDF, HDF, and netCDF.
- A presentation on how to handle large datasets in Quantum Chemistry using hdf5
- Tools
- HDFView A browser and editor for HDF files
This article was originally based on material from the Free On-line Dictionary of Computing, which is licensed under the GFDL.