Skip all navigation and jump to content Jump to site navigation Jump to section navigation
NASA Logo - Goddard Space Flight Center NASA Home Page Goddard Space Flight Center Home Page

     + Visit NASA.gov

a directory of Earth science data and services
header 2 bullet Links bullet FAQ bullet Contact Us bullet Site Map
Home Data Sets Data Services Collaborations Add new dataset and data service records to GCMD What's New Participate Calendar About GCMD
Metadata Sharing Among Distributed Heterogeneous Databases Using Java Technology: A Turn-key Solution

D. Kendig, C. Gokey, R.T. Northcutt, L. Olsen (Goddard Space Flight Center, Greenbelt MD, USA) O. Bukhres, S. Sikkuppparbathyam (Purdue University)

A challenging area in building a directory of Earth science metadata is the exchange of metadata content among partner organizations. The complex issues of heterogeneous metadata schema, database schema, database implementation and platforms are difficult to overcome when building a turn-key solution. At the GCMD, we are building a distributed system which is not limited to a particular database architecture. This design will allow automated exchange of metadata content among earth science collaborators. The system will include a local database agent for each partner that captures database updates and broadcasts them to other cooperating nodes. While the exchange of metadata and the validation of the content is distributed over the network of partners, the management of controlled vocabulary will be managed by one administrative node.  JDBC will be used as an abstract  interface to an  RDBMS system to build a more generic and portable solution. RMI will be used as a lightweight object request broker to exchange metadata content within the network.



 Introduction

The number of Earth science data sets available each year to the public is increasing exponentially. The descriptions of these data sets are distributed over the internet at a variety of sites with a variety of search capabilities.  This distributed aspect makes it challenging to search for metadata entries in an efficient way.  Partner directories are also interested in sharing content in order to improve the visibility of metadata records within their scientific communities that they serve. The distributed, heterogeneous nature of the internet make it difficult to exchange and share content in an easy manner. What compounds the difficulty in locating data sets and exchanging metadata records is the fact that there are varying metadata formats.  Although progress is being made toward the ISO 19115 metadata standard, a large number of legacy metadata repositories will remain. Many repositories may also differ widely in how the metadata are physically stored on their local sites.  In addition to differing metadata schema, the database implementation, database schema and platform often vary greatly among metadata archives.  These heterogeneous aspects have made it difficult to do distributed searches and to exchange metadata among collaborating organizations.
 

Search Architectures

There are two standard architectures for searching distributed databases. A distributed search places a query to each of the sites that are to be searched. Either all nodes in the system must be searched or the user must choose which nodes are to be queried.  As the number of nodes becomes large, it may not be obvious to the user which nodes to query and searching all nodes is not practical. This architecture also tends to be slow due to networking latencies between the users site and each of the sites that are queried. In addition,  the search  results and their timeliness will only be as good as the search engines that are available on each of the sites.  If even one node being searched is slow or has limited search capabilities, the distributed query is affected.  There is the additional issue of dealing with multiple listings for a record that may be registered with more than one archive.  The same metadata records may be hosted at more than one site and thus appear multiple times within a search result.

The other type of searching is a centralized search where a query is only placed to one central node. In this case, data is passed from subordinate nodes to the central node and consistency is encouraged. While the centralized approach allows for faster queries, it limits the autonomy of the subordinate nodes and creates a single point of failure. The exchange of metadata to the central node is an additional overhead in this architecture.

The IDN Approach to Metadata Sharing

The International Directory Network (IDN) chose to use a distributed architecture for input but a centralized architecture for searching. Distributed input allows metadata to be registered with the local node and then mirrored to a centralized node for search. This avoids the distributed search issues such as network latency and redundant search results.  The IDN partners will be enabled to share metadata content among themselves and control which content they chose to host at their local site.  This allows an individual node to grow in content, yet retain its topical theme by filtering which records are accepted at its site. The GCMD's will continue its role to assure the quality and consistency of metadata content. The GCMD administers a set of valid search terms and reviews metadata content for consistency.  Having a well defined set of search terms that are universally applied to the metadata results in a more consistent, directed search with a narrower and more focused selection of metadata records.

Scientific agencies and researchers are encouraged to share metadata with their partners to populate the network of directories.  Additional advantages of sharing metadata content are improved quality of the metadata, increased cooperation within agencies and among researchers and improved search results. The current exchange of metadata among IDN partners tends to be manual and labor intensive. No automated mechanism is used to monitor IDN databases for content updates and notify other sites that may have an interest in these updates.  No one has time to be bothered with auditing all database modifications and propagating the modifications to others.

To integrate metadata directories and to allow metadata to flow between directories and clusters of directories, an automated solution needed to be found. The solution that is currently being implemented will federate metadata directories by capturing metadata updates, additions and deletions and automatically propagating the modifications throughout the network of directories.  The database administrator is freed from the mundane duties of tracking and broadcasting content updates while errors are eliminated through automation. The translation of metadata records between schemas is also be needed. A translation component will perform on-the-fly conversions between varying metadata schema when content is exchanged between members using differing metadata schema.  The level of automation is high and the ownership of the metadata content is preserved within the system. Careful planning was given to the issue of metadata ownership.  A site that does not own a metadata record is prevented from modifying a record which is owned by another node. The local node also has the autonomy to decide which IDN content it chooses to include at its site. If a local node does not want to host a metadata record, they may choose not to allow its entry into their system, but they will not be able to remove it from the entire system. This prevents the catastrophic case of one node accidentally deleting another node's content from the entire network. When a successful solution is built, it will result in better populated directories that are faster, require less complex searches, and have the most up to date content.

The MD8 Architecture

A three tier distributed database system is being built for the IDN that will allow data sharing among partners within a network of nodes. The first tier is the client that is comprised of a lightweight user interface.  The second tier is an object-oriented logic component built around Java servlets, and the third tier is an information store such as a RDBMS.  A distributed architecture was chosen to avoid the single point of failure risk of a centralized architecture.  This architecture is also more compatible with a design requirement that allows for a node to retain ownership and autonomy of its content.

Within the network of nodes it is essential that each metadata entry will have its own unique identifier.  Each local node must already retain a unique entry ID for their metadata record within its local autonomous database. While this guarantees a unique identifier within a specific site, to expand the uniqueness onto the network of nodes, each node is assigned a unique three digit code.  This is used as a prefix to its local unique identifier.  The compound ID is called the Global Entry ID (GE_ID) and is unique over the network of all nodes.  The intent is to make this global identifier synonymous with the data set in order to allow researchers the ability to use it as a citation within journals and publications.

GCMD's Version 8 (MD8) is being designed and implemented in Java. There are a number of reasons that Java was chosen.  Paramount among these is its portability. The operating environment of the nodes on the network is heterogeneous, and Java allows for a 'write once, run anywhere' application.  Java's object-oriented features are essential to a design that is robust, extensible and allows a node's local database implementation to be encapsulated.  The persistence model and the Local Database Agent (LDA) are both good examples of components that take advantage of OO features to abstract away some of the local database complexities and therefore are not locked into a particular database implementation.    Java DataBase Connectivity (JDBC) strengthens the portability even further by creating a generic API to the local database.  This creates an even more flexible application that is not reliant on a particular database implementation. Another feature of Java that is utilized is the Remote Method Invocation (RMI). This supplies the backbone of the connectivity between the nodes. Using serialization to pass objects with RMI allows for a simplistic object request broker (ORB) without requiring other cooperating partners to purchase a commercial, heavy weight ORB.  Additional features of Java such as  built-in security, garbage collection, Swing components, stability and wide acceptance all contribute to a quick development schedule with a robust design.

The metadata content that is passed among nodes is in GCMD Directory Interchange Format (DIF) schema and stored as an XML document.  XML support is advantageous for a number of reasons.  XML allows for a more open, extensible system, so that others may use the DTD or style sheets to build their own utilities.  The DTD will allow for the fundamental validation of the metadata content.  It will verify that all the required fields are present and that the order and type of the fields are correct.  XML  has a number of support utilities already available in Java that the developer is not required to build. There is a Java implemented SAX parser freely available that will take the XML document, parse it, compare it against the definitions and constraints defined in the DTD and finally build the Document Object Model (DOM) object.  This DOM is an application object that is a tree representation of the XML document that was just parsed.  The developer may then use a freely available library of methods to access and manipulate the DOM object in order to create a persistent DIF object.

The LDA Component

One of the key components is the Local Database Agent (LDA).  It resides on each node and will encapsulate the local database.  The LDA serves to hide some of the implementation complexities by providing a standard API for receiving incoming content from other nodes, monitoring local database content and broadcasting new or modified local content. It is designed to be extensible so that it may also work with text based storage mechanisms.  Java's portability and OO allows for extensibility and reuse as some components may need to be slightly customized for a local node. The LDA is intended to be a plug-and-play type application that acts as a mediator between the local database implementation and the network of related nodes. It will have a messaging component that will listen for incoming content from other nodes and an announcer component that broadcasts local modifications to the network of nodes. Java's RMI is utilized to communicate among nodes and serves as a light weight object request broker for exchanging serialized DIF objects between LDA's.  The RMI server will be the broker for three types of objects, the DIF, a personnel record and a list of controlled vocabulary.

The LDA also has a sub-component that monitors database activity.  When the main entry table in the local database is modified, a trigger is activated.  Most RDBMS have this triggering capability built-in.  The trigger causes the updated content to be captured by the LDA handler.  The handler then uses this event to build a message that is entered into a schedule table for transmission on the network.  The scheduler has both synchronous and asynchronous message-passing capabilities. While the message is propagated at the time of the event, an entry is retained in the scheduler table until an acknowledgment is received.  If no acknowledgment is received in a reasonable time due to something such as network problems or a node being down, the scheduler wakes up and tries to send the message again at a later time. At the other nodes, the LDA receives the Java serialized DIF object in XML format and passes it to the local database application.

Persistence Model for Saving Metadata

A persistence model for the DIF was built which is based on the model presented in the book Database Programming with JDBC and Java by George Reese.  This model is transaction based.  When a set of objects is created or modified, a lock is placed on each modified object and then the object is passed to a transaction class.  When the user is done with a transaction, the modifications are committed to the database in an atomic operation.

Incoming DIF's, packaged in XML are received from the LDA or from users that are using metadata authoring tools.  First the XML (or colon delimited) document is parsed and validated against the Document Type Definition (DTD).  This DTD specifies rules, syntax and type of the DIF document.  After the document is parsed,  a Document Object Model (DOM) object is built. This DOM object is then used to build persistent objects that are mapped to the local database schema.

One other important sub-component of the persistence model is the peer classes.  Each persistent object has a peer.  The business logic of each object is retained in the persistent class, and all the logic associated with saving the object into a particular type of data store is retained in the peer. Isolating the business logic from the logic that interfaces directly with the data store allows more extensibility and flexibility. Only the peer is tied to a particular data store architecture.  Furthermore, the peers for RDBMS data stores use JDBC.  This allows the peers for RDBMS data stores to work properly with most all data base implementations.

Translator Between Schema's

When nodes do not maintain their metadata in the same schema, an added step is required to successfully exchange data. The metadata must be translated between schema.  Problems may arise from the fact that there is seldom a perfect one to one mapping between metadata fields.  In the worst case it is even possible that imprecise mapping between fields can result in content being lost. GCMD currently uses a tool called DOCMORPH that was developed to 'morph' documents between a set of metadata schema which allows DIF records to be presented in a number of metadata schema. XML enables easier translation because the DTD defines and describes in some detail, what each field contains. This makes it much easier to build automated tools to map between fields of two different DTDs.  In fact many such COTS already exist (such Bluestone Software's Visual-XML) and should only improve with time. There could also be more consistency in metadata presentation, since the originating organizations XML style sheet could be used by all IDN nodes to present the XML metadata record.

Project Goals

The new system that is being developed will allow IDN partners to share their metadata among nodes.  The barriers to sharing content will be removed to allow easier access to more current metadata. The automated exchange of metadata is an important step to achieve a system where data discovery is more friendly to the novice user.  The amount of maintenance and support for this new system must be minimized.  From initial installation of the software to periodic upgrades and everyday operations, the application must be low maintenance if it is to be successful.  The IDN does not have the manpower to manage a network of nodes where the installation and daily operations are not straight-forward and easy for each local node to perform.  Each node must therefore be relatively self sufficient. If the software is difficult to manage,  partners will be difficult to find.  An Install Shield for installing the application is an important goal. This would allow the entire application may be installed with a few clicks of the mouse.  Java also supports auto upgrades so that when a new upgrade of the software is released by the GCMD, each individual node can easily upgrade the application without having to do a complete reinstall.

This scale of this project is ambitious for the IDN participants. To manage the inherent risks, we have divided the development into two phases.  The first stage is to create a small pilot project of 3 or 4 homogeneous nodes.  This is the simpler case where all the nodes are employing DIF as their metadata schema. Currently the state of the project is in the design and implementation stage.  We expect to release phase 1 later this year. The subsequent phases will build the translators which will allow enable the handling of heterogeneous metadata content.
 
 
 
 
 
 

USA dot gov - The U.S. Government's Official Web Portal
+ Privacy Policy and Important Notices
NASA
Webmaster:  Monica Holland
Responsible NASA Official:  Lola Olsen
Last Updated: October 2008