D. Kendig, C. Gokey, R.T. Northcutt,
L. Olsen (Goddard Space Flight Center, Greenbelt MD, USA) O.
Bukhres, S. Sikkuppparbathyam (Purdue University)
A challenging area in building a
directory of Earth science metadata is the exchange of metadata
content among partner organizations. The complex issues of
heterogeneous metadata schema, database schema, database implementation
and platforms are difficult to overcome when building a turn-key
solution. At the GCMD, we are building a distributed system
which is not limited to a particular database architecture.
This design will allow automated exchange of metadata content
among earth science collaborators. The system will include
a local database agent for each partner that captures database
updates and broadcasts them to other cooperating nodes. While
the exchange of metadata and the validation of the content
is distributed over the network of partners, the management
of controlled vocabulary will be managed by one administrative
node. JDBC will be used as an abstract interface
to an RDBMS system to build a more generic and portable
solution. RMI will be used as a lightweight object request
broker to exchange metadata content within the network.
The number of Earth science data sets available each year
to the public is increasing exponentially. The descriptions
of these data sets are distributed over the internet at a
variety of sites with a variety of search capabilities.
This distributed aspect makes it challenging to search for
metadata entries in an efficient way. Partner directories
are also interested in sharing content in order to improve
the visibility of metadata records within their scientific
communities that they serve. The distributed, heterogeneous
nature of the internet make it difficult to exchange and share
content in an easy manner. What compounds the difficulty in
locating data sets and exchanging metadata records is the
fact that there are varying metadata formats. Although
progress is being made toward the ISO 19115 metadata standard,
a large number of legacy metadata repositories will remain.
Many repositories may also differ widely in how the metadata
are physically stored on their local sites. In addition
to differing metadata schema, the database implementation,
database schema and platform often vary greatly among metadata
archives. These heterogeneous aspects have made it difficult
to do distributed searches and to exchange metadata among
collaborating organizations.
There are two standard architectures for searching distributed
databases. A distributed search places a query to each of
the sites that are to be searched. Either all nodes in the
system must be searched or the user must choose which nodes
are to be queried. As the number of nodes becomes large,
it may not be obvious to the user which nodes to query and
searching all nodes is not practical. This architecture also
tends to be slow due to networking latencies between the users
site and each of the sites that are queried. In addition,
the search results and their timeliness will only be
as good as the search engines that are available on each of
the sites. If even one node being searched is slow or
has limited search capabilities, the distributed query is
affected. There is the additional issue of dealing with
multiple listings for a record that may be registered with
more than one archive. The same metadata records may
be hosted at more than one site and thus appear multiple times
within a search result.
The other type of searching is a centralized search where
a query is only placed to one central node. In this case,
data is passed from subordinate nodes to the central node
and consistency is encouraged. While the centralized approach
allows for faster queries, it limits the autonomy of the subordinate
nodes and creates a single point of failure. The exchange
of metadata to the central node is an additional overhead
in this architecture.
The International Directory Network (IDN) chose to use a
distributed architecture for input but a centralized architecture
for searching. Distributed input allows metadata to be registered
with the local node and then mirrored to a centralized node
for search. This avoids the distributed search issues such
as network latency and redundant search results. The
IDN partners will be enabled to share metadata content among
themselves and control which content they chose to host at
their local site. This allows an individual node to
grow in content, yet retain its topical theme by filtering
which records are accepted at its site. The GCMD's will continue
its role to assure the quality and consistency of metadata
content. The GCMD administers a set of valid search terms
and reviews metadata content for consistency. Having
a well defined set of search terms that are universally applied
to the metadata results in a more consistent, directed search
with a narrower and more focused selection of metadata records.
Scientific agencies and researchers are encouraged to share
metadata with their partners to populate the network of directories.
Additional advantages of sharing metadata content are improved
quality of the metadata, increased cooperation within agencies
and among researchers and improved search results. The current
exchange of metadata among IDN partners tends to be manual
and labor intensive. No automated mechanism is used to monitor
IDN databases for content updates and notify other sites that
may have an interest in these updates. No one has time
to be bothered with auditing all database modifications and
propagating the modifications to others.
To integrate metadata directories and to allow metadata to
flow between directories and clusters of directories, an automated
solution needed to be found. The solution that is currently
being implemented will federate metadata directories by capturing
metadata updates, additions and deletions and automatically
propagating the modifications throughout the network of directories.
The database administrator is freed from the mundane duties
of tracking and broadcasting content updates while errors
are eliminated through automation. The translation of metadata
records between schemas is also be needed. A translation component
will perform on-the-fly conversions between varying metadata
schema when content is exchanged between members using differing
metadata schema. The level of automation is high and
the ownership of the metadata content is preserved within
the system. Careful planning was given to the issue of metadata
ownership. A site that does not own a metadata record
is prevented from modifying a record which is owned by another
node. The local node also has the autonomy to decide which
IDN content it chooses to include at its site. If a local
node does not want to host a metadata record, they may choose
not to allow its entry into their system, but they will not
be able to remove it from the entire system. This prevents
the catastrophic case of one node accidentally deleting another
node's content from the entire network. When a successful
solution is built, it will result in better populated directories
that are faster, require less complex searches, and have the
most up to date content.
A three tier distributed database system is being built for
the IDN that will allow data sharing among partners within
a network of nodes. The first tier is the client that is comprised
of a lightweight user interface. The second tier is
an object-oriented logic component built around Java servlets,
and the third tier is an information store such as a RDBMS.
A distributed architecture was chosen to avoid the single
point of failure risk of a centralized architecture.
This architecture is also more compatible with a design requirement
that allows for a node to retain ownership and autonomy of
its content.
Within the network of nodes it is essential that each metadata
entry will have its own unique identifier. Each local
node must already retain a unique entry ID for their metadata
record within its local autonomous database. While this guarantees
a unique identifier within a specific site, to expand the
uniqueness onto the network of nodes, each node is assigned
a unique three digit code. This is used as a prefix
to its local unique identifier. The compound ID is called
the Global Entry ID (GE_ID) and is unique over the network
of all nodes. The intent is to make this global identifier
synonymous with the data set in order to allow researchers
the ability to use it as a citation within journals and publications.
GCMD's Version 8 (MD8) is being designed and implemented
in Java. There are a number of reasons that Java was chosen.
Paramount among these is its portability. The operating environment
of the nodes on the network is heterogeneous, and Java allows
for a 'write once, run anywhere' application. Java's
object-oriented features are essential to a design that is
robust, extensible and allows a node's local database implementation
to be encapsulated. The persistence model and the Local
Database Agent (LDA) are both good examples of components
that take advantage of OO features to abstract away some of
the local database complexities and therefore are not locked
into a particular database implementation.
Java DataBase Connectivity (JDBC) strengthens the portability
even further by creating a generic API to the local database.
This creates an even more flexible application that is not
reliant on a particular database implementation. Another feature
of Java that is utilized is the Remote Method Invocation (RMI).
This supplies the backbone of the connectivity between the
nodes. Using serialization to pass objects with RMI allows
for a simplistic object request broker (ORB) without requiring
other cooperating partners to purchase a commercial, heavy
weight ORB. Additional features of Java such as
built-in security, garbage collection, Swing components, stability
and wide acceptance all contribute to a quick development
schedule with a robust design.
The metadata content that is passed among nodes is in GCMD
Directory Interchange Format (DIF) schema and stored as an
XML document. XML support is advantageous for a number
of reasons. XML allows for a more open, extensible system,
so that others may use the DTD or style sheets to build their
own utilities. The DTD will allow for the fundamental
validation of the metadata content. It will verify that
all the required fields are present and that the order and
type of the fields are correct. XML has a number
of support utilities already available in Java that the developer
is not required to build. There is a Java implemented SAX
parser freely available that will take the XML document, parse
it, compare it against the definitions and constraints defined
in the DTD and finally build the Document Object Model (DOM)
object. This DOM is an application object that is a
tree representation of the XML document that was just parsed.
The developer may then use a freely available library of methods
to access and manipulate the DOM object in order to create
a persistent DIF object.
One of the key components is the Local Database Agent (LDA).
It resides on each node and will encapsulate the local database.
The LDA serves to hide some of the implementation complexities
by providing a standard API for receiving incoming content
from other nodes, monitoring local database content and broadcasting
new or modified local content. It is designed to be extensible
so that it may also work with text based storage mechanisms.
Java's portability and OO allows for extensibility and reuse
as some components may need to be slightly customized for
a local node. The LDA is intended to be a plug-and-play type
application that acts as a mediator between the local database
implementation and the network of related nodes. It will have
a messaging component that will listen for incoming content
from other nodes and an announcer component that broadcasts
local modifications to the network of nodes. Java's RMI is
utilized to communicate among nodes and serves as a light
weight object request broker for exchanging serialized DIF
objects between LDA's. The RMI server will be the broker
for three types of objects, the DIF, a personnel record and
a list of controlled vocabulary.
The LDA also has a sub-component that monitors database activity.
When the main entry table in the local database is modified,
a trigger is activated. Most RDBMS have this triggering
capability built-in. The trigger causes the updated
content to be captured by the LDA handler. The handler
then uses this event to build a message that is entered into
a schedule table for transmission on the network. The
scheduler has both synchronous and asynchronous message-passing
capabilities. While the message is propagated at the time
of the event, an entry is retained in the scheduler table
until an acknowledgment is received. If no acknowledgment
is received in a reasonable time due to something such as
network problems or a node being down, the scheduler wakes
up and tries to send the message again at a later time. At
the other nodes, the LDA receives the Java serialized DIF
object in XML format and passes it to the local database application.
A persistence model for the DIF was built which is based
on the model presented in the book Database Programming
with JDBC and Java by George Reese. This model is
transaction based. When a set of objects is created
or modified, a lock is placed on each modified object and
then the object is passed to a transaction class. When
the user is done with a transaction, the modifications are
committed to the database in an atomic operation.
Incoming DIF's, packaged in XML are received from the LDA
or from users that are using metadata authoring tools.
First the XML (or colon delimited) document is parsed and
validated against the Document Type Definition (DTD).
This DTD specifies rules, syntax and type of the DIF document.
After the document is parsed, a Document Object Model
(DOM) object is built. This DOM object is then used to build
persistent objects that are mapped to the local database schema.
One other important sub-component of the persistence model
is the peer classes. Each persistent object has a peer.
The business logic of each object is retained in the persistent
class, and all the logic associated with saving the object
into a particular type of data store is retained in the peer.
Isolating the business logic from the logic that interfaces
directly with the data store allows more extensibility and
flexibility. Only the peer is tied to a particular data store
architecture. Furthermore, the peers for RDBMS data
stores use JDBC. This allows the peers for RDBMS data
stores to work properly with most all data base implementations.
When nodes do not maintain their metadata in the same schema,
an added step is required to successfully exchange data. The
metadata must be translated between schema. Problems
may arise from the fact that there is seldom a perfect one
to one mapping between metadata fields. In the worst
case it is even possible that imprecise mapping between fields
can result in content being lost. GCMD currently uses a tool
called DOCMORPH that was developed to 'morph' documents between
a set of metadata schema which allows DIF records to be presented
in a number of metadata schema. XML enables easier translation
because the DTD defines and describes in some detail, what
each field contains. This makes it much easier to build automated
tools to map between fields of two different DTDs. In
fact many such COTS already exist (such Bluestone Software's
Visual-XML) and should only improve with time. There could
also be more consistency in metadata presentation, since the
originating organizations XML style sheet could be used by
all IDN nodes to present the XML metadata record.
The new system that is being developed will allow IDN partners
to share their metadata among nodes. The barriers to
sharing content will be removed to allow easier access to
more current metadata. The automated exchange of metadata
is an important step to achieve a system where data discovery
is more friendly to the novice user. The amount of maintenance
and support for this new system must be minimized. From
initial installation of the software to periodic upgrades
and everyday operations, the application must be low maintenance
if it is to be successful. The IDN does not have the
manpower to manage a network of nodes where the installation
and daily operations are not straight-forward and easy for
each local node to perform. Each node must therefore
be relatively self sufficient. If the software is difficult
to manage, partners will be difficult to find.
An Install Shield for installing the application is an important
goal. This would allow the entire application may be installed
with a few clicks of the mouse. Java also supports auto
upgrades so that when a new upgrade of the software is released
by the GCMD, each individual node can easily upgrade the application
without having to do a complete reinstall.
This scale of this project is ambitious for the IDN participants.
To manage the inherent risks, we have divided the development
into two phases. The first stage is to create a small
pilot project of 3 or 4 homogeneous nodes. This is the
simpler case where all the nodes are employing DIF as their
metadata schema. Currently the state of the project is in
the design and implementation stage. We expect to release
phase 1 later this year. The subsequent phases will build
the translators which will allow enable the handling of heterogeneous
metadata content.
|