| Tutorial: GCP Data Sources |
|
|
|
| Written by Martin Senger | |||||||||
| Sunday, 17 February 2008 | |||||||||
Tutorial: GCP Data SourcesThe GCP concept of data sources is crucial for accessing and sharing various data in various applications. This tutorial explains in details how to create and use data sources.
In January 2007, the DataSource interface was slightly changed, maturing from version 1.0.0 to 2.0.0. The changes are not dramatic, but they require some changes in existing data source implementations. Please check the new API and read the changes.
In April 2008, and 2009, there were few more changes in PantheonBase. They were more in DataConsumer interface and in ontology. The DataSource remained almost untouched. Please check the new API and see the ChangeLog.
A data source represents any source of the GCP data - it can be a database, or a collection of Biomoby services, or a flat-file, or anything else. Each implementation of a data source is a simple abstraction allowing to access both metadata about this data source and data from this data source.
The basic rule is: Any data resource - before it can be used within or by GCP - must be represented by a Java implementation of a DataSource interface.
Let's introduce first the Java interface itself. Java API for data sourcesThe gory details are in the Java API. Here are few hints where to look first and how the API parts are connected together. The code examples how to use the API are in the next chapter. org.generationcp.core.datasource.DataSourceThis is the center of the GCP data source concept. Every data provider has to implement this interface in order to have her data recognized by GCP applications.
Very often, one data source will be implemented by several different classes (implementations). Typically, an implementation will be using a direct (JDBC or equivalent, e.g. Hibernate) access to the underlying database, and will be used locally, on the machines that have such access. For example, a web application (producing web interface/pages) running inside a Tomcat, will be getting most of its data using this kind of data source implementation (see the left picture). Another typical implementation will use Web Services. As with any Web Services, there will be two parts: a client and a service part. The service part will be running on a machine where are databases - and this service part is like the web application shown in the previous paragraph: it will be accessing data using the local implementation of this data source. The client part will be again a Java class implementing DataSource interface, but now, instead of calling directly JDBC, it will use the Web Service protocol to get data from the service part (see the right picture). The Data Source is the only mandatory API to implement. It has few other parts - their details are described in the API. See the next section how to use them.
How to discover and instantiate data sourcesThe discovery and creation of instances of Data Sources is a responsibility of an implementation, and it can be done in many different ways. Two ways, however, are recommended, and there are classes to support it. This section describes both of them, on examples. The examples shown below are available from the Pantheon/Ceres/projects/PantheonBase module.
This is optional but highly recommended (your project will be easier to maintain if it follows these recommendations).
Discovering directly Data SourcesPut all names of classes that implement org.generationcp.core.datasource.DataSource interface into a file named as the interface itself, and put this file in this directory structure: META-INFThen create a jar file keeping this structure, and put it on your CLASSPATH. The PantheonBase module has this structure in src/etc/spi/datasources, and the jar file for the sample DataSource can be created by the spi target of the Ant build.xml file found in PantheonBase: ant spiFinally, here is the code how to discover and instantiate data sources listed in the org.generationcp.core.datasource.DataSource file: import org.generationcp.core.datasource.DataSource;But don't use it. There is a better way: a DataSourceRegistry, a class representing a registry that maintains a list of the data sources that are available for the current application. The DataSourceRegistry should be accessed through a singleton instance obtained by calling instance(). This guarantees the same registry in all parts of your applications without explicitly passing it between them. Here is how you should get available data sources: import org.generationcp.core.datasource.DataSourceRegistry;
There are few points to remember:
Sometimes, however, this way of discovering data sources is not well suited (as noted below). In that case, go for factories... Discovering Data Sources via their factoriesSometimes various data sources may have almost identical implementation, they differ only in few parameters - that would normally be given them in a constructor. But the DataRegistry approach to DataSource management described above permits only a non-argument (empty) constructor. For example (taken again from the PantheonBase), a data source fetching documents, or images, from the web needs to specify a URL from which to retrieve the documents, but otherwise it will be identical with other similar data sources. How can you register this kind of configuration constraint for a DataSource? In such cases, you can use a DataSourceFactory. The same SPI discovery mechanism now discovers factories (instead of data sources). Once instantiated, each factory reads some configuration file where it finds which data sources to create, and what values to give them in their constructors (now data sources can have regular constructors with full of parameters). Such factory then returns all its data sources using the getDataSources() method. Again, put all names of classes that implement org.generationcp.core.datasource.DataSourceFactory interface into a file named as the interface itself, and put this file in this directory structure: META-INFThen create a jar file keeping this structure, and put it on your CLASSPATH. The good thing is that the DataSourceRegistry can be used with factories, as well. Here is what it does:
Therefore, the code is still the same: import org.generationcp.core.datasource.DataSourceRegistry;The code is taken from src/main/org/generationcp/core/samples/MainList.java. You can run this example by calling (on Windows, use backslashes): ant build/run/run-listThe log file indicates that four data sources were loaded (three came via factories, and one directly). Note that the whole discovery and instantiating process took 46 milliseconds (shown in bold): 2006-06-15 20:59:47,330 0 [main] INFO DataSourceRegistry - Loaded data source: Plant Images 1
Eclipse tip
If you are planning to use pantheon within the Equinox component based framework, we suggest you use the Equinox specific version of the DataSourceRegistry, which is available in the Ceres/plugins/org.generationcp.pantheon.eclipse plugin. This plug-in defines extension points for DataSource and DataSourceFactory. For more information on this please refer to the Tutorial: Developing with Pantheon Eclipse
How to use data sourcesThe Detailed WayIn this section, the detailed steps to specify a DataSource is described. In the next section, a short-cut way of specifying data sources that only use the standard GCP Demeter models will be described. In order to show how to use data sources to find what data they provide and then to get data, we will be using an example that can be found in src/main/org/generationcp/core/samples/srs. There is a sample data source DataSourceTaxonomy that fetches data from an SRS server running at EBI. In its constructor, it first defines what data types it provides, and what search-able attributes can be used to find such data. This is only an example - and that's why it uses its own list of names for data types and attributes (from the file SamplesConstants). For the real GCP data sources, you should generally use the special static final String constants for data type and data type attribute identification (ontology) from the org.generationcp.ontology interface library. These are structured in the following Assuming that the ClassName is the name of a particular GCP data type interface class (e.g. SimpleIdentifier) and ATTRIBUTE is the name of some specific attribute of that data type (e.g. the NAME of a SimpleIdentifier), then:
In addition to the Demeter class specific attribute constants available as defined above, some general purpose attributes are defined in the LSID.java class. These attributes are generally not class specific in their usage and semantics, so they are defined in this class instead. public class DataSourceTaxonomy implements DataSource, SamplesConstants {
Note that this taxonomic data source can return two data types: the full taxonomy record (DT_TAXONOMY) and just taxonomy record's identifiers (DT_TAXONOMY_ID), both of them can be filtered using the same set of search-able attributes. The data source indicates it by implementing the following methods:
public DataType[] getDataTypes() {
The most important method is find() - it searches for the real data and returns them back. First of all, it checks that the caller asks for correct data type:
// can I provide the given data type?Then it evaluates given filters and converts them into an appropriate query language. The Taxonomy example uses an SRS Query Language (which is not that interesting for this tutorial - you may find it in the code yourself). What is interesting is the SearchFilter. Search filters are used to specify search criteria when a data source is asked for data. In the API, there is an example of two search filters that together define a question: "Find [publication] records published (date) after 2004 AND containg text 'GCP' in the title OR in keywords OR in abstract."
ExampleThe Taxonomy example takes the following approach (which does not save memory but may save time): It reads always all IDs of search-compliant records, and then - if asked for the full taxonomy records - it starts a background thread to fill the returning list. But it does not wait for the list being filled and returns it at once (half empty, or empty at all). Only when the caller asks for data from the list, the list itself is clever enough to wait until the data are there. This is done by extending the java.util.AbstractSequentialList This approach is possible because the DataSource specification allows empty elements: This specification allows to return a list where some elements are null. The caller should just ignore these elements. Their existence indicates that there are more data compliant with the search filters but from some reasons they cannot be returned. This is how src/etc/org/generationcp/core/samples/srs/MainTaxonomy.java uses the find method on all available data sources: for (Iterator it = dataSources.iterator(); it.hasNext(); ) {
You can use all features of this program by typing:
ant build/run/run-taxonomy -helpFor example, to retrieve a list of taxonomy IDs for species containing in their names oryza, you can call (on Windows, substitute slashes by backslashes, and, important!, change single quotes to double quotes): build/run/run-taxonomy -f org.generationcp.samples.attributes:taxon '*oryza*' -lid(Well, they may not be all - the example returns only the first one hundred. Change the code if you want to use it for real.) In order to get the full taxonomic records for the same AND for the rank species (check the SRS server page for details what the 'rank' means) call (note that it combines two search criteria): build/run/run-taxonomy \ van den Mooter and Swings 1990 AbstractDataSourceTo help a little bit with some of the housekeeping of DataSource implementations, an AbstractDataSource class is available. This provides for naming, storage of meta-data and management of the list of associated DataType definitions for a DataSource implementing class. package myDataSourcePackage ; Another Level of GCP DataSource ImplementationThe previous sections described details on how to specify a DataSource from scratch. At times, this may become a bit tedious. Some additional DataSource classes are available that are envisioned to ease the burden of GCP compliant DataSource development. The trick is to implement your DataSource adapter as a subclass of GCPDataSource which extends AbstractDataSource by adding a GCP compliant DataType registration function you can use to simply register the list of GCP data types you will use by Class, assuming that you have already imported them into your source file. (See the FAQs on registering Data Types and Searchable Attributes). The following example does this for a DataSource serving Germplasm: package org.cropinfo.icis ; That is all, folks! You now simply need to implement the find method for this adapter that expects to see (or does not worry if it sees) all the standard searchableAttributes of Germplasm, as defined in the GCP Germplasm model. The GCPDataSource superclass takes care of the housekeeping methods of DataSource for you (by extending from AbstractDataSource of course...) and also specifies all the standard searchable attributes through some behind-the-scenes knowledge about the GCP domain model. If you need to see those searchable attributes for use within, say, the find method, then you can actually access the DataType and its searchable attributes in the following simple manner: DataType germplasmDataType = this.getDataType(Germplasm.class.getName()) ; This works because the GCPDataSource class uses the Germplasm Class name as the unique identifier to identify the autogenerated DataType from Ceres corresponding to that class, that is duly initialized with all its searchable DataTypeAttributes. Thus, you just register all the standard GCP domain model interface classes you intend to serve in find as DataTypes using registerDataType method in your constructor, then focus on the implementation details of the find method, using GCP domain model interface compliant implementation classes (e.g. from Ceres).
Performance Issues and Design Considerations with the DataSource find() MethodOnce you obtain the result data from your database (or from other sources, such as several combined calls to BioMoby Web Services), you return them as a java.util.List. Remember that java.util.List is a Java interface. This means that it may be implemented in diverse ways, as long as the semantics of the interface's methods are maintained. The brute force solution is to pull all the data over (from the database or using web services over the web) and load a monolithic ArrayList() to return the data to the program using the DataSource. Think again! What happens if the user specified query specifies thousands or more hits entries in the database? Such a query may fill easily the whole memory. Also sometimes the caller may not need or use all the data at once, or may initially only wish to know how many data items they have matched, so you can consider to retrieve data only when it is really needed (i.e. "just-in-time" (or "lazy") data retrieval). A wise strategy for List implementation is to provide sensible proxy implementations for its methods. Consider the following ideas:
And so on... The general paradigm should be clear. Where to get data sources API and classesThe basic interfaces and implementation classes of data sources are available in the GCP Maven repository. Tutorial how to use Maven in GCP is also available. Use the following definition (you may need to change the version number as new version become available):<groupId>org.generationcp.pantheon_base</groupId>The primary source for the full project (with all source code) is the subversion repository at CropForge: https://svn.cropforge.org/svn/pantheon/Ceres/projects/PantheonBase. Once you checked it out, create the jar files and javadoc API pages by calling Ant: ant bootstrapThe jar files, created in the build/lib directory, are (the same distribution as described in this table is for the jar files in the Maven repository):
Some of the other DataSource implementations noted above (i.e. GCPDataSource, AsynchronousDataSource and GCPAsynchronousDataSource) plus supporting classes are available fully in Ceres/plugins/org.generationcp.ceres.datasource.util module in CropForge, and also in the GCP Maven Repository as: <groupId>org.generationcp.ceres.datasource</groupId>
Eclipse tip
If you are planning to use pantheon within the Equinox component based framework, we suggest you use plug-in versions of these jars. The latest plug-ins for pantheon are availible on the GCP update site. For help in using the update site please follow the instructions on the update site. For general guidelines on how to use pantheon within Equinox the there is a Tutorial: Developing with Pantheon Eclipse
|
|||||||||
| Last Updated ( Monday, 16 November 2009 ) | |||||||||



Any data resource - before it can be used within or by GCP - must be represented by a Java implementation of a