Home arrow Pantheon Tutorials arrow Tutorial: GCP Data Sources
 
   
 
Tutorial: GCP Data Sources PDF Print E-mail
Written by Martin Senger   
Sunday, 17 February 2008

Tutorial: GCP Data Sources

The GCP concept of data sources is crucial for accessing and sharing various data in various applications. This tutorial explains in details how to create and use data sources.

 

In January 2007, the DataSource interface was slightly changed, maturing from version 1.0.0 to 2.0.0. The changes are not dramatic, but they require some changes in existing data source implementations. Please check the new API and read the changes.
In April 2008, and 2009, there were few more changes in PantheonBase. They were more in DataConsumer interface and in ontology. The DataSource remained almost untouched. Please check the new API and see the ChangeLog.
A data source represents any source of the GCP data - it can be a database, or a collection of Biomoby services, or a flat-file, or anything else. Each implementation of a data source is a simple abstraction allowing to access both metadata about this data source and data from this data source.

The basic rule is:

Any data resource - before it can be used within or by GCP - must be represented by a Java implementation of a DataSource interface.
Note that the rule above mentions only Java implementation, and not, for example, Web Services. This is because the implementation itself can get data by calling one or many Web Services internally, but it is hidden from the users of the DataSource.

Let's introduce first the Java interface itself.

Java API for data sources


The gory details are in the Java API. Here are few hints where to look first and how the API parts are connected together. The code examples how to use the API are in the next chapter.
 

org.generationcp.core.datasource.DataSource

This is the center of the GCP data source concept. Every data provider has to implement this interface in order to have her data recognized by GCP applications.

 

          

Very often, one data source will be implemented by several different classes (implementations). Typically, an implementation will be using a direct (JDBC or equivalent, e.g. Hibernate) access to the underlying database, and will be used locally, on the machines that have such access. For example, a web application (producing web interface/pages) running inside a Tomcat, will be getting most of its data using this kind of data source implementation (see the left picture). Another typical implementation will use Web Services. As with any Web Services, there will be two parts: a client and a service part. The service part will be running on a machine where are databases - and this service part is like the web application shown in the previous paragraph: it will be accessing data using the local implementation of this data source. The client part will be again a Java class implementing DataSource interface, but now, instead of calling directly JDBC, it will use the Web Service protocol to get data from the service part (see the right picture).

The Data Source is the only mandatory API to implement. It has few other parts - their details are described in the API. See the next section how to use them.

 

How to discover and instantiate data sources


The discovery and creation of instances of Data Sources is a responsibility of an implementation, and it can be done in many different ways. Two ways, however, are recommended, and there are classes to support it. This section describes both of them, on examples. The examples shown below are available from the Pantheon/Ceres/projects/PantheonBase module.

 

This is optional but highly recommended (your project will be easier to maintain if it follows these recommendations).
Both ways are based on Java-native Service Provider Interface discovery mechanism. It order to facilitate this mechanism, it uses Apache Commons Discovery classes.

 

Discovering directly Data Sources


Put all names of classes that implement org.generationcp.core.datasource.DataSource interface into a file named as the interface itself, and put this file in this directory structure:
META-INF
services
org.generationcp.core.datasource.DataSource
Then create a jar file keeping this structure, and put it on your CLASSPATH.

The PantheonBase module has this structure in src/etc/spi/datasources, and the jar file for the sample DataSource can be created by the spi target of the Ant build.xml file found in PantheonBase:

ant spi
Finally, here is the code how to discover and instantiate data sources listed in the org.generationcp.core.datasource.DataSource file:
import org.generationcp.core.datasource.DataSource;
import org.apache.commons.discovery.tools.Service;
import java.util.ArrayList;
import java.util.Enumeration;
import java.util.List;
...
List dataSources = new ArrayList();
Enumeration spe = Service.providers (DataSource.class);

while (spe.hasMoreElements()) {
dataSources.add (spe.nextElement());
}

// all available DataSource instances are now in 'dataSources'
But don't use it. There is a better way: a DataSourceRegistry, a class representing a registry that maintains a list of the data sources that are available for the current application.

The DataSourceRegistry should be accessed through a singleton instance obtained by calling instance(). This guarantees the same registry in all parts of your applications without explicitly passing it between them. Here is how you should get available data sources:

import org.generationcp.core.datasource.DataSourceRegistry;
import java.util.List;

...

List dataSources = DataSourceRegistry.instance().getDataSources();

 

  There are few points to remember:

  • In order to be instantiated this way, the Data Source implementation classes need to have a non-argument constructor.
  • There is no error message if something goes wrong with data source discovery (e.g. if you forget to create the jar file with class names). The DataSourceRegistry, however, reports about loaded data sources into a log (with a info message level).

Sometimes, however, this way of discovering data sources is not well suited (as noted below). In that case, go for factories...

Discovering Data Sources via their factories


Sometimes various data sources may have almost identical implementation, they differ only in few parameters - that would normally be given them in a constructor. But the DataRegistry approach to DataSource management described above permits only a non-argument (empty) constructor.

For example (taken again from the PantheonBase), a data source fetching documents, or images, from the web needs to specify a URL from which to retrieve the documents, but otherwise it will be identical with other similar data sources. How can you register this kind of configuration constraint for a DataSource?

In such cases, you can use a DataSourceFactory. The same SPI discovery mechanism now discovers factories (instead of data sources). Once instantiated, each factory reads some configuration file where it finds which data sources to create, and what values to give them in their constructors (now data sources can have regular constructors with full of parameters). Such factory then returns all its data sources using the getDataSources() method.

Again, put all names of classes that implement org.generationcp.core.datasource.DataSourceFactory interface into a file named as the interface itself, and put this file in this directory structure:
META-INF
services
org.generationcp.core.datasource.DataSourceFactory
Then create a jar file keeping this structure, and put it on your CLASSPATH.

The good thing is that the DataSourceRegistry can be used with factories, as well. Here is what it does:
  1. First, it discovers all factories, and asks them for all their data sources.
  2. Then, it discovers individual data sources and adds them to those obtained from factories.

Therefore, the code is still the same:

import org.generationcp.core.datasource.DataSourceRegistry;
import java.util.List;

...

List dataSources = DataSourceRegistry.instance().getDataSources();
The code is taken from src/main/org/generationcp/core/samples/MainList.java. You can run this example by calling (on Windows, use backslashes):
ant build/run/run-list
The log file indicates that four data sources were loaded (three came via factories, and one directly). Note that the whole discovery and instantiating process took 46 milliseconds (shown in bold):
2006-06-15 20:59:47,330 0    [main] INFO  DataSourceRegistry - Loaded data source: Plant Images 1
2006-06-15 20:59:47,334 4 [main] INFO DataSourceRegistry - Loaded data source: Plant Images 2
2006-06-15 20:59:47,361 31 [main] INFO DataSourceRegistry - Loaded data source: Animal Images
2006-06-15 20:59:47,376 46 [main] INFO DataSourceRegistry - Loaded data source: Taxonomy @ EBI
Eclipse tip
If you are planning to use pantheon within the Equinox component based framework, we suggest you use the Equinox specific version of the DataSourceRegistry, which is available in the Ceres/plugins/org.generationcp.pantheon.eclipse plugin. This plug-in defines extension points for DataSource and DataSourceFactory. For more information on this please refer to the Tutorial: Developing with Pantheon Eclipse



How to use data sources


The Detailed Way

In this section, the detailed steps to specify a DataSource is described. In the next section, a short-cut way of specifying data sources that only use the standard GCP Demeter models will be described.

In order to show how to use data sources to find what data they provide and then to get data, we will be using an example that can be found in src/main/org/generationcp/core/samples/srs. There is a sample data source DataSourceTaxonomy that fetches data from an SRS server running at EBI.

In its constructor, it first defines what data types it provides, and what search-able attributes can be used to find such data. This is only an example - and that's why it uses its own list of names for data types and attributes (from the file SamplesConstants).

For the real GCP data sources, you should generally use the special static final String constants for data type and data type attribute identification (ontology) from the org.generationcp.ontology interface library. These are structured in the following Assuming that the ClassName is the name of a particular GCP data type interface class (e.g. SimpleIdentifier) and ATTRIBUTE is the name of some specific attribute of that data type (e.g. the NAME of a SimpleIdentifier), then:

  • ClassNameBaseConstants.DATATYPE_ID: is the Java constant for the unique identifier of the data type
  • ClassNameBaseConstants.DATATYPE_NAME: is the Java constant for the human readable name of that data type
  • ClassNameBaseConstants.ATTRIBUTE_DATATYPE_ATTRIBUTE_ID: is the Java constant for the unique identifier of the data type attribute
  • ClassNameBaseConstants.ATTRIBUTE_DATATYPE_ATTRIBUTE_NAME: is the Java constant for the human readable name of the data type attribute

In addition to the Demeter class specific attribute constants available as defined above, some general purpose attributes are defined in the LSID.java class. These attributes are generally not class specific in their usage and semantics, so they are defined in this class instead.

public class DataSourceTaxonomy implements DataSource, SamplesConstants {

...

DataType[] dataTypes;
Map dataTypesMap;

...

/**
* Constructor
*/
public DataSourceTaxonomy() {
...
DataTypeAttribute[] attrs = new DataTypeAttribute[] {
new DefaultDataTypeAttribute (ATTR_ID, "ID"),
new DefaultDataTypeAttribute (ATTR_PID, "Parent ID"),
new DefaultDataTypeAttribute (ATTR_RANK, "Rank"),
new DefaultDataTypeAttribute (ATTR_TAXON, "Scientific name"),
new DefaultDataTypeAttribute (ATTR_SPECIES, "Species"),
};

DefaultDataType taxId = new DefaultDataType (DT_TAXONOMY_ID, "Taxonomy ID");
taxId.setSearchableAttributes (attrs);

DefaultDataType taxAll = new DefaultDataType (DT_TAXONOMY, "Taxonomy");
taxAll.setSearchableAttributes (attrs);

dataTypes = new DataType[] { taxId, taxAll };

// maps for easier access to recognized data types and
// searchable attributes
dataTypesMap = new HashMap();
for (int i = 0; i < dataTypes.length; i++) {
dataTypesMap.put (dataTypes[i].getUniqueIdentifier(), dataTypes[i]);
}

attributeSet = new HashSet();
attributeSet.add (ATTR_ID);
attributeSet.add (ATTR_PID);
attributeSet.add (ATTR_RANK);
attributeSet.add (ATTR_TAXON);
attributeSet.add (ATTR_SPECIES);
...
}

...

}
Note that this taxonomic data source can return two data types: the full taxonomy record (DT_TAXONOMY) and just taxonomy record's identifiers (DT_TAXONOMY_ID), both of them can be filtered using the same set of search-able attributes. The data source indicates it by implementing the following methods:
public DataType[] getDataTypes() {
return dataTypes;
}

public DataType getDataType (String dataTypeIdentifier) {
return (DataType)dataTypesMap.get (dataTypeIdentifier);
}
The most important method is find() - it searches for the real data and returns them back. First of all, it checks that the caller asks for correct data type:
// can I provide the given data type?
DataType dtype = getDataType (dataTypeIdentifier);

if (dtype == null)
throw new IllegalArgumentException ("Illegal data type '" + dataTypeIdentifier +
"' for data source '" + getUniqueIdentifier());
Then it evaluates given filters and converts them into an appropriate query language. The Taxonomy example uses an SRS Query Language (which is not that interesting for this tutorial - you may find it in the code yourself).

What is interesting is the SearchFilter. Search filters are used to specify search criteria when a data source is asked for data. In the API, there is an example of two search filters that together define a question: "Find [publication] records published (date) after 2004 AND containg text 'GCP' in the title OR in keywords OR in abstract."

 

Example


The Taxonomy example takes the following approach (which does not save memory but may save time): It reads always all IDs of search-compliant records, and then - if asked for the full taxonomy records - it starts a background thread to fill the returning list. But it does not wait for the list being filled and returns it at once (half empty, or empty at all). Only when the caller asks for data from the list, the list itself is clever enough to wait until the data are there. This is done by extending the java.util.AbstractSequentialList

This approach is possible because the DataSource specification allows empty elements: This specification allows to return a list where some elements are null. The caller should just ignore these elements. Their existence indicates that there are more data compliant with the search filters but from some reasons they cannot be returned.

 

This is how src/etc/org/generationcp/core/samples/srs/MainTaxonomy.java uses the find method on all available data sources:
for (Iterator it = dataSources.iterator(); it.hasNext(); ) {
DataSource ds = (DataSource)it.next();
List result;
if (onlyIdsWanted) {
result = ds.find (DT_TAXONOMY_ID, filters, null, null);
} else {
result = ds.find (DT_TAXONOMY, filters, null, null);
}

int errorsCount = 0;

for (Iterator it2 = result.iterator(); it2.hasNext(); ) {
Object obj = it2.next();
if (obj == null)
errorsCount++;
else
System.out.println (obj.toString());
}

if (errorsCount > 0)
System.out.println ("WARNING: " + errorsCount + " lost/unretrieved records");

}
You can use all features of this program by typing:
ant build/run/run-taxonomy -help
For example, to retrieve a list of taxonomy IDs for species containing in their names oryza, you can call (on Windows, substitute slashes by backslashes, and, important!, change single quotes to double quotes):
build/run/run-taxonomy -f org.generationcp.samples.attributes:taxon '*oryza*' -lid
347
2342
4527
4528
4529
...
344459
348818
356849
360094

(Well, they may not be all - the example returns only the first one hundred. Change the code if you want to use it for real.)

In order to get the full taxonomic records for the same AND for the rank species (check the SRS server page for details what the 'rank' means) call (note that it combines two search criteria):

build/run/run-taxonomy \
-f1 org.generationcp.samples.attributes:rank species \
-f2 org.generationcp.samples.attributes:taxon '*oryza*'
ID : 347
PARENT ID : 338
RANK : species
GC ID : 11
SCIENTIFIC NAME : Xanthomonas oryzae
SYNONYM : "Xanthomonas oryzae" (Uyeda and Ishiyama 1926) Dowson 1943
SYNONYM : Pseudomonas oryzae
SYNONYM : "Pseudomonas oryzae" Uyeda and Ishiyama in Ishiyama 1926
SYNONYM : Xanthomonas oryzae (ex Ishiyama 1922) Swings et al. 1990 emend.

van den Mooter and Swings 1990
//
ID : 2342
PARENT ID : 36866
RANK : species
GC ID : 11
SCIENTIFIC NAME : primary endosymbiont of Sitophilus oryzae
SYNONYM : Sitophilus oryzae endosymbiont
SYNONYM : Sitophilus oryzae principal endosymbiont
//
ID : 4528
PARENT ID : 4527
RANK : species
GC ID : 1
MGC ID : 1
SCIENTIFIC NAME : Oryza longistaminata
SYNONYM : Oryza longistaminata A.Chev. & Roehr.
COMMON NAME : long-staminate rice
COMMON NAME : red rice
//
...

AbstractDataSource

To help a little bit with some of the housekeeping of DataSource implementations, an AbstractDataSource class is available. This provides for naming, storage of meta-data and management of the list of associated DataType definitions for a DataSource implementing class.

package myDataSourcePackage ;
import org.generationcp.core.datasource.AbstractDataSource ;
...
public class MyDataSource extends AbstractDataSource {
public ICISDataSource(String uid, String name) {
super(uid, name) ;
// optional metadata may be added
// second arg may be any object type, not just a String
this.addMetaData("ICIS URL", "http://www.icis.cgiar.org" ) ;
// You can add any data types you wish
DataType dt =
new DefaultDataType(DT_TAXONOMY,DT_TAXONOMY,java.lang.String.class) ;
addDataType(dt) ;
}

// You need to implement find() because AbstractDataSource doesn't...
// but all the other DataSource methods are taken care of
public List find (
String dataTypeIdentifier,
SearchFilter[] filters,
String[] includedAttributesIdentifiers,
Map options) throws IllegalArgumentException, GCPException {
// do something interesting here...
}
}

Another Level of GCP DataSource Implementation

The previous sections described details on how to specify a DataSource from scratch. At times, this may become a bit tedious. Some additional DataSource classes are available that are envisioned to ease the burden of GCP compliant DataSource development.

The trick is to implement your DataSource adapter as a subclass of GCPDataSource which extends AbstractDataSource by adding a GCP compliant DataType registration function you can use to simply register the list of GCP data types you will use by Class, assuming that you have already imported them into your source file. (See the FAQs on registering Data Types and Searchable Attributes). The following example does this for a DataSource serving Germplasm:

package org.cropinfo.icis ;
import org.generationcp.ceres.datasource.GCPDataSource ;
import org.generationcp.model.data.germplasm.Germplasm ;

...

public class ICISDataSource extends GCPDataSource {
public ICISDataSource(String uid, String name) {

super(uid, name) ;

// register the Germplasm DataType with its standard searchable attributes
// Note: the Demeter package (org.generationcp.demeter) now contains
// autogenerated DATATYPE_ID identifiers (among other things) for all
// GCP domain model classes. Such constants should be used here to register the datatype.

this.registerDataType(new DefaultDataType(
Germplasm.DATATYPE_ID,
Germplasm.DATATYPE_NAME)
) ;
// optional metadata may be added
// second arg may be any object type, not just a String
this.addMetaData("ICIS URL", "http://www.icis.cgiar.org" ) ;

// You can still add additional ad hoc non-GCP standard data types
DataType dt
= new DefaultDataType(DT_TAXONOMY,DT_TAXONOMY,java.lang.String.class) ;
addDataType(dt) ;
}

// Again, you need to implement find() because GCPDataSource doesn't...
// but all the other DataSource methods are taken care of
public List find (
String dataTypeIdentifier,
SearchFilter[] filters,
String[] includedAttributesIdentifiers,
Map options) throws IllegalArgumentException, GCPException {

// GCPDataSource has an argument validation function that
// uses information about registered DataTypes to check
// whether or the find() parameters are generally valid (by DATATYPE_ID, etc.)
if (!validSearchParameters( dataTypeIdentifier, filters, includedAttributesIdentifiers))
throw new IllegalArgumentException("Invalid find() arguments to ICISDataSource?") ;

// do something interesting here,
// returning Germplasm objects data sources
// in response to SearchFilter specified using
// standard Germplasm searcheable attributes
}
}

That is all, folks! You now simply need to implement the find method for this adapter that expects to see (or does not worry if it sees) all the standard searchableAttributes of Germplasm, as defined in the GCP Germplasm model. The GCPDataSource superclass takes care of the housekeeping methods of DataSource for you (by extending from AbstractDataSource of course...) and also specifies all the standard searchable attributes through some behind-the-scenes knowledge about the GCP domain model. If you need to see those searchable attributes for use within, say, the find method, then you can actually access the DataType and its searchable attributes in the following simple manner:

DataType germplasmDataType = this.getDataType(Germplasm.class.getName()) ;

DataTypeAttribute[] attrs = germplasmDataType.getSearchableAttributes();

This works because the GCPDataSource class uses the Germplasm Class name as the unique identifier to identify the autogenerated DataType from Ceres corresponding to that class, that is duly initialized with all its searchable DataTypeAttributes. Thus, you just register all the standard GCP domain model interface classes you intend to serve in find as DataTypes using registerDataType method in your constructor, then focus on the implementation details of the find method, using GCP domain model interface compliant implementation classes (e.g. from Ceres).

 

Performance Issues and Design Considerations with the DataSource find() Method

Once you obtain the result data from your database (or from other sources, such as several combined calls to BioMoby Web Services), you return them as a java.util.List.

Remember that java.util.List is a Java interface. This means that it may be implemented in diverse ways, as long as the semantics of the interface's methods are maintained. The brute force solution is to pull all the data over (from the database or using web services over the web) and load a monolithic ArrayList() to return the data to the program using the DataSource.

Think again! What happens if the user specified query specifies thousands or more hits entries in the database? Such a query may fill easily the whole memory. Also sometimes the caller may not need or use all the data at once, or may initially only wish to know how many data items they have matched, so you can consider to retrieve data only when it is really needed (i.e. "just-in-time" (or "lazy") data retrieval).

A wise strategy for List implementation is to provide sensible proxy implementations for its methods. Consider the following ideas:

  • Generally speaking, return a List implementation that is still, in some sense, knowledgeable about its database context. Even if it has to reconnect every time the user accesses a List method, this is an efficient trade-off against space requirements, which are more likely to be exceeded with a brute force implementation of find.

  • Implement most of the List methods as customized database queries. For example, size() could execute a SQL SELECT COUNT(*) WHERE ... to ascertain how many entries the find query parameters will hit and also, whether or not the list isEmpty(). contains() could also be a targeted SQL search on the database.

  • The List subList() method should be implemented to return a subset "view" of all the results matched by the DataSource find() parameters, since perhaps the user is only interested in the first few hits(?) or the application might provide for some sort of a "paging" through data (in small bite sized chunks).

  • Also provide an intelligent implementation for the Java Iterator returned by the List iterator() method.

And so on... The general paradigm should be clear.

Where to get data sources API and classes

The basic interfaces and implementation classes of data sources are available in the GCP Maven repository. Tutorial how to use Maven in GCP is also available. Use the following definition (you may need to change the version number as new version become available):
<groupId>org.generationcp.pantheon_base</groupId>
<artifactId>pantheon-base-all</artifactId>
<version>2.0.0</version>
The primary source for the full project (with all source code) is the subversion repository at CropForge: https://svn.cropforge.org/svn/pantheon/Ceres/projects/PantheonBase. Once you checked it out, create the jar files and javadoc API pages by calling Ant:
ant bootstrap
ant jars
ant docs
The jar files, created in the build/lib directory, are (the same distribution as described in this table is for the jar files in the Maven repository):
pantheon-base-all.jar has everything
pantheon-base-core.jar

has only few classes that are needed both by data sources and data consumers

(there is a Tutorial on Data consumers)

pantheon-base-dataconsumer.jar has only data consumers and transformers
pantheon-base-datasource.jar has only data sources

Some of the other DataSource implementations noted above (i.e. GCPDataSource, AsynchronousDataSource and GCPAsynchronousDataSource) plus supporting classes are available fully in Ceres/plugins/org.generationcp.ceres.datasource.util module in CropForge, and also in the GCP Maven Repository as:

<groupId>org.generationcp.ceres.datasource</groupId>
<artifactId>ceres-datasource-util</artifactId>
<version>1.0</version>
 
Eclipse tip
If you are planning to use pantheon within the Equinox component based framework, we suggest you use plug-in versions of these jars. The latest plug-ins for pantheon are availible on the GCP update site. For help in using the update site please follow the instructions on the update site. For general guidelines on how to use pantheon within Equinox the there is a  Tutorial: Developing with Pantheon Eclipse
 
Last Updated ( Monday, 16 November 2009 )