In response to this use of several computers, NCSA developed the Hierarchical Data Format (HDF) in 1988. NCSA HDF is a portable, self-describing data format for moving and sharing scientific data in networked, heterogeneous computing environments. HDF can store several different kinds of data objects: multidimensional arrays, raster images, color palettes, and tables. It allows individual scientists to mix and group different kinds of data in one file, according to their needs.
NCSA provides a library of application programming interfaces (APIs) for reading and writing HDF as well as workstation tools for visualizing data stored in HDF files. With the library and tools providing easy access to HDF, an enthusiastic user community emerged almost immediately. Users included NCSA's scientific community, but it also extended to other organizations and institutions, as well as several international users.

Figure 1. Multidimensional array of elements.
Early extensions to HDF: a second-generation HDF
HDF's designers understood that over time new, unforeseen requirements would
emerge that HDF would not be able to handle and that it needed to be
extendable. They designed HDF in such a way that new types of data could be
added when original structures were insufficient.
Indeed soon after scientists began using HDF, they asked for enhancements: a table structure, a way of grouping objects within HDF files, support for data compression, and more data types. They also asked for changes in the HDF library, including more powerful APIs, user-defined attributes for HDF objects, and support within the library for another popular scientific data file format called netCDF.
From these needs a "second generation" HDF was created. The second generation of HDF is compatible with the first generation of HDF, but includes these new features:

Figure 2. Derivation of existing HDF objects from new pbject type.
New Requirements:
EOSDIS and Grand Challenges
Recently two new classes of HDF users have pushed the limits of the current
implementation of HDF -- the Grand Challenge and global change research
communities.
Grand Challenge projects address problems in science and engineering whose solutions can be advanced by applying HPCC technologies. These problems involve very large datasets and typically run on fast, multiprocessing machines that require very fast I/O.
Global change research collects, organizes, and processes large amounts of data in order to understand how the Earth works. A fundamental component of global change research is the Earth Observing System (EOS), a space-based observing system with instruments that will ultimately collect terabytes of data daily. The EOS Data and Information System (EOSDIS) will use HDF as a standard format for storing much EOS data.
Some of these applications call for data files with thousands of data structures. Others store very large images, arrays, or tables. Some will have complex collections of interrelated data and metadata. These applications also frequently use computing technologies, such as object oriented approaches, to manage and manipulate data.
EOSDIS and Grand Challenge projects pose a whole new set of needs for HDF, needs that HDF 4.0 does not satisfy, including:

Figure 3. Hierarchical file structure.
BigHDF: Accommodating variety
Current plans call for three fundamental changes in HDF: a unified data model,
a simpler file structure, and a new I/O library.
New data model
HDF now supports several different objects, but the proposed data model will
support only one: a multidimensional array of elements (see Figure 1).
The new object will have two required attributes: dimensionality (the number and sizes of dimensions) and a data type (a definition of the array elements type). More data types will be supported, including complex numbers, date and time, pointers, and record structures.
Objects will include optional user-defined attributes of the form "parameter = value." Users will specify optional physical storage schemes for the data. By default, objects will be stored contiguously in a file, but alternative physical formats will be available, giving data producers some control over the physical organization of their data. Alternative physical formats will include external, linked-block, chunked, or compressed storage and an indexed structure.
For backward compatibility, the new HDF object is designed so that all current objects can be defined as subtypes of this basic object type (see Figure 2 ).
New file structure
The new file structure will support files and objects of any size and any
number of objects. The internal structure for describing objects is simpler
than the current structure and should provide faster, easier access to objects.
The three object parts are an object ID (OID), a header record with information
required to describe the object, and the object itself.
The OID will be a large number, making it possible to represent a correspondingly large number of objects in an HDF file. The header record will contain or point to information about the object, including user-defined attributes and a description of the physical storage scheme used to store the object.
The entire file structure will consist of a boot block, followed by one or more objects. The boot block will begin with two text blocks that provide the user and the HDF library space for storing messages. It also will include the sizes of numbers that represent the offsets and lengths of objects, making it possible to support objects of virtually any size in the file. The last item in the boot block points to the first object in the file. If more than one object is in the file, the first object will be an array of OIDs. This kind of grouping can be repeated to create a hierarchical organization of objects within the file (see Figure 3).
Focus on interoperability
In planning the next generation HDF library, NCSA developers hope to exploit
similarities between HDF and other popular scientific data formats by building
a system that understands a variety of different data models and formats (see
Figure 4). APIs at the top level allow programs to view data according to
a variety of different data models. These APIs communicate with the middle
layer through "object brokers" that rewrite their requests in terms of a common
model. The middle layer also determines which service needs to be invoked to
read or write the data and then invokes the necessary service.
The service layer consists of different file format drivers, each of which reads from or writes to one file format. Each driver has a well-documented interface for transferring objects and lists of objects to the higher arbitration layer. Possible drivers in the first implementation include HDF, BigHDF, netCDF, and FITS.

Figure 4. Proposed architecture for next generation HDF library.
Prototyping planned for '96
NCSA's development team is eager for feedback on the
BigHDF
proposal Turning the proposal's key features into a prototype is the goal for 1996.
Return to the Table of Contents.