Home arrow Pantheon Tutorials arrow Tutorial: GCP Data Transformers
 
   
 
Tutorial: GCP Data Transformers PDF Print E-mail
Written by Martin Senger   
Sunday, 17 February 2008

Tutorial: GCP Data Transformers


Data Transformer is a special kind of a Data Consumer. It shares with it the main functionality: to accept data and to do something with it, and to tell what data can be accepted. But it extends it in the following ways:
  1. It can do several different transformations with the same kind of input data.
  2. It can set additional parameters how to realize such transformation. 
One of the pillars of the data transformer is a new class TransformationDef identifying a transformation (its ID, name, description and so on), defining its acceptable input consumable type and a list of possible output consumable types. Which means that one input type can be transformed by the same transformation into several output types (one at a time), depending on the user-specified transformation parameters.

For example, a sequence analysis data transformer can transform a biological sequence to various formats by applying various analysis tools (transformations) on the same input sequence.

This tutorial will show how to write a simple (but fully equipped, providing several transformations) data transformer. But first things first...
 

Java API for data transformer


The gory details are in the Java API. Be aware, however, that a rich class AbstractDataTransformer, from which many data transformers (if not all) will inherit, has many important and useful methods available only for its sub-classes. These protected methods are not visible in the public API - therefore developers of data transformers are encouraged to dive into the source code (it is documented inside, don't worry).
 

How to discover and instantiate data transformers


The good news is that it is completely the same as with data consumers and data sources. The bad news is that, because of the good news, I am not going to repeat it here.

Read the data source tutorial and replace there mentally all sources to transformers.

Example: How to create a data transformer

The fully functional source for this example is part of the PantheonBase project.
We are going to develop a data transformer that operates on strings and that makes very simple things. After all, important is not what it does but how it makes itself part of the pantheon architecture.

The TextDataTransformer inherits from the AbstractDataTransformer which gives it all the "decoration" needed:
public class TextDataTransformer
extends AbstractDataTransformer {

public TextDataTransformer() {
super ("org.generationcp.samples.text.Strings."+
(++instanceCount),
"Sample Text Transformer",
null,
null);
init();
}
It has two main parts: declaration and the business logic. First, usually done in the init method, it declares what transformations are available and what data and parameters they accept and produce:

// what transformations I support
TransformationDef transTokenizer =
public String consume (PropertyChannel propertyChannel,
String dataPropertyName,
String transformationId,
Map parameters)
throws IllegalArgumentException,
TransformationParameterException,
GCPExceptionnew TransformationDef (TID_TOKENIZER, "Text Tokenizer",
"It splits given text to tokens by given delimiter.")
.setInputType (String.class.getName())
.addOutputType (List.class.getName())
.addParameter
(new TransformationParameterDef (TEXT_PAR_DELIM, char.class)
.setDescription ("A character representing a token delimiter"))
.addParameter
(new TransformationParameterDef (TEXT_PAR_QUOTE, Boolean.class)
.setDescription ("Quoted parts are not tokenized"))
;

TransformationDef transStats =
new TransformationDef (TID_STATS,
"Text Statistics",
"It computes various statistics about the given text.")
.setInputType (String.class.getName())
.addOutputType
(Map.class.getName()) // return stats as properties
.addOutputType
(List.class.getName()) // return three elements as in UNIX's wc(1)
.addParameter
(new TransformationParameterDef (TEXT_PAR_AS_WC,
boolean.class).setDescription ("Result as by UNIX command 'wc'"))
;

// register supported transformations
setTransformationDefs (transTokenizer, transStats);
Note that adding parameters and input and output types returns back the whole transformation itself so it is easy to chain the method calls together, and make it perhaps better readable (note the dots at the beginning of some lines).

In the code above, you may be missing class ConsumableType and ontology terms Context and Category. They are there (must be because any transformation defines its input and output types in the terms of ConsumableTypes, and any consumable type defines itself by context and category ontology terms), but they are hidden in the convenient methods. For example, the setInputType method (if used only with one parameter, as above) assumes the default category Category.CLASS_NAME and context Context.ANALYSIS.

The transformation identifiers (the constants TID_TOKENIZER and TID_STATS above) are defined, in this case, in the class SamplesConstants but they could be taken from another ontology, or directly from somewhere else (the Soaplab data transformer, for example, can get them directly from the Soaplab metadata).

Once you have the transformations defined you register them by the AbstractDataTransformer - the last line in the code above.

The PantheonBase samples are accompanied by a testing command-line program MainText that instantiates the TextDataTransformer and shows its transformations (the same what was just defined above):
PantheonBase$ build/run/run-any-client org.generationcp.core.samples.text.MainText -l
Transformation definitions
--------------------------
TransformationDef@1bf52a5[
TransformationDef[ID=org.generationcp.samples.datatransformers.string.tokenizer,
Name=Text Tokenizer,Description=It splits given text to tokens by given delimiter.,
Classification={}]
Input type=ConsumableType[Context=Analysis,
Category=class/name,Value=java.lang.String,asCollection=true]
Output types={ConsumableType[Context=Analysis,Category=class/name,
Value=java.util.List,asCollection=true]}
Parameters={TransformationParameterDef[Name=delimiter,Type=char,Mandatory=false],T
ransformationParameterDef[Name=quote,Type=java.lang.Boolean,Mandatory=false]}
]
TransformationDef@171732b[
TransformationDef[ID=org.generationcp.samples.datatransformers.string.statistics,
Name=Text Statistics,Description=It computes various statistics about the given text.,
Classification={}]
Input type=ConsumableType[Context=Analysis,Category=class/name,Value=java.lang.String,asCollection=true]
Output types={ConsumableType[Context=Analysis,Category=class/name,Value=java.util.Map,asCollection=true],
ConsumableType[Context=Analysis,Category=class/name,Value=java.util.List,asCollection=true]}
Parameters={TransformationParameterDef[Name=wc,Type=boolean,Mandatory=false]}
One perhaps interesting point: for the TID_STATS transformation, the TextDataTransformer defines two possible output types: a List and a Map. Which one will be returned depends on the boolean parameter TEXT_PAR_AS_WC (as you can see in the code shown later).
 

Methods consume and transform


The main method of any data transformer, similarly to data consumers, is the method consume. It can have an extended signature allowing to specify what transformation to use and allowing to pass parameters for this transformation:
public String consume (PropertyChannel propertyChannel,
String dataPropertyName,
String transformationId,
Map parameters)
throws IllegalArgumentException, TransformationParameterException,
GCPException

public String consume (PropertyChannel propertyChannel,
String dataPropertyName,
String transformationId,
Map parameters)
throws IllegalArgumentException,TransformationParameterException,
GCPException

This method has many things to do before starting the real transformation. It has to check that the input is consistent with the definitions declared by this data transformer, and it has to deal with the fact that any data transformer (indeed, any data consumer) must accept data either as a list (Java class List) or as individual instances. If it is a list, the transformation must be called either once on the whole list, or as many times as many elements the list has. The method also needs to deal with a property channel (getting data from there and putting there the transformation results).

It is a lot of work, and most of it is repeated for most of the data transformers. Therefore, the AbstractDataTransformer itself implements the method, does all the boring stuff just mentioned above, and then calls a protected method transform (once or more times, depending on the input type, see more about it in a minute) that does the business logic. The transform method has a simplified signature, and your data transformer should overwrite it (unless you overwrite the whole consume method, of course):

protected Object transform (Object input,
TransformationDef transDef,
Map<String,Object> parameters)
throws IllegalArgumentException, TransformationParameterException,
GCPException {
return null;
}
The input contains data to be transformed, transDef defines what transformation to use, and parameters indicates how to transform the data.

It raises a TransformationParameterException if the parameters are wrong and an IllegalArgumentException if there is something wrong with the other arguments. But in most cases you do not need to worry about it here because usually all the possible checks were already done. Anyway, if you need any special checks (e.g. that a parameter must be in a given range) you put it into parameter definition itself rather than here. Use the GCPException if you meet problems during the transformation itself (no network access, bad database connection, etc.).

Finally, you see above that the default implementation returns null which you definitely do not want. Your transform method returns a transformed output data.

Here is the contents of the transform method for our example, the business logic of the sample TextDataTransformer:

@Override 
protected Object transform (Object input,
TransformationDef transDef,
Map<String,Object> parameters)
throws IllegalArgumentException,TransformationParameterException,
GCPException {

String transId = transDef.getUniqueIdentifier();
if (TID_TOKENIZER.equals (transId)) {
return trTokenizer (input, parameters);
} else if (TID_STATS.equals (transId)) {
return trStats (input, parameters);
} else {
// should not come here
return input;
}
}

protected Object trTokenizer (Object input,
Map<String,Object> parameters) {

String value = (String)input;
StrMatcher delimiter
= StrMatcher.charMatcher (getChar (TEXT_PAR_DELIM,
' ', parameters));
StrTokenizer tokenizer = null;
if (getBoolean (TEXT_PAR_QUOTE, false, parameters))
tokenizer =
new StrTokenizer (value, delimiter, StrMatcher.quoteMatcher());
else
tokenizer =
new StrTokenizer (value, delimiter);
return tokenizer.getTokenList();
}

protected Object trStats (Object input,
Map<String,Object> parameters) {

String value = (String)input;
if (getBoolean (TEXT_PAR_AS_WC, false, parameters)) {
return Arrays.asList (
StringUtils.countMatches (value, ""),
StringUtils.countMatches (value, " "),
value.length());
} else {
Map<String,String> stats = new HashMap<String,String> ();
stats.put ("LENGTH", "" + value.length());
return stats;
}
}

And, indeed, that is all what you have to do in your data transformer: just declare your transformations, register them and do the transformation itself.

There is only one small bit to explain:

Input and output policies


The data coming from a property channel may be in the form of a List, or as single instances. Depending on that and depending on the input policy, the method transform can be called once or more times.

Similarly, for the output: should the resulting data be wrapped into a list, or not?

There are two (protected) classes (actually Java enums) defined in the AbstractDataTransformer enumerating possible policies:

protected enum InputPolicy  { ITERATE, ALWAYS_AS_LIST, AS_IT_IS }
protected enum OutputPolicy { ALWAYS_AS_LIST, AS_IT_IS }
The data transformer decides what policies to use. In our example:
setInputPolicy (InputPolicy.ITERATE);
setOutputPolicy (OutputPolicy.AS_IT_IS);
InputPolicy.ITERATE - the transform is called as many times as is the number of elements in the input list, or just once if the input is not a list. This is the default input policy.

InputPolicy.ALWAYS_AS_LIST - the transform method is called allways just once, and the data are always passed as a list (even with one element wrapped as a list).

InputPolicy.AS_IT_IS - the transform method is called always just once, and the data are passed, without any change, as they came form the property channel.

Note that setting this policy has almost nothing to do with the asCollection attribute of the input consumable data type. The attribute asCollection specifies whether a data transformer can get only one or more elements. But even one element can be delivered as a list (and that is controlled by the input policy).

The output policies are:
OutputPolicy.ALWAYS_AS_LIST - the data are wrapped in a list before returning back to the original caller - unless they are already of type List.

OutputPolicy.AS_IT_IS - leaves the data as they came from the transform method. This is the default output policy.

If a data transformer wishes to change/set the policy depending on the real resulting data, it does it directly in the transform method (the output policy will be consulted just after the transform method returns).

Using the testing command-line program you can see how our TextDataTransformer behaves. The shown examples are from the transformation TID_STATS which also changes the type of its output depending on the parameter TEXT_PAR_AS_WC which is represented here as a command-line option -wc.

Run it first without -wc option (input data are specified by the argument -inp). The result is of type Map:
PantheonBase$ build/run/run-any-client org.generationcp.core.samples.text.MainText -tid 
org.generationcp.samples.datatransformers.string.statistics -inp 'pantheon base'
Calling transformation 'Text Statistics'
----------------------------------------
{LENGTH=13}
With the -wc option, the result becomes of type List:
PantheonBase$ build/run/run-any-client org.generationcp.core.samples.text.MainText -tid 
org.generationcp.samples.datatransformers.string.statistics -inp 'pantheon base' -wc
Calling transformation 'Text Statistics'
----------------------------------------
[0, 1, 13]
Now, what happens if the input is a list. We can simulate this situation by using bars in the -inp arguments:
PantheonBase$ build/run/run-any-client org.generationcp.core.samples.text.MainText -tid 
org.generationcp.samples.datatransformers.string.statistics -inp 'pantheon base|Genomedium|Koios'
Calling transformation 'Text Statistics'
----------------------------------------
[{LENGTH=13}, {LENGTH=10}, {LENGTH=5}]
And we can see that the result is a list, with elements of type Map. This is because of the iput policy ITERATE. When the iteration of input is in place, the transform method is called more than once (in our case three times) and obviously the result must be a list (otherwise the results from individual calls will be mixed up).

And this is what happens when the transformation itself produces also a list:
PantheonBase$ build/run/run-any-client org.generationcp.core.samples.text.MainText -tid 
org.generationcp.samples.datatransformers.string.statistics -inp 'pantheon base|Genomedium|Koios' -wc
Calling transformation 'Text Statistics'
----------------------------------------
[[0, 1, 13], [0, 0, 10], [0, 0, 5]]

Data Consumer or Data transformer?


Because the data transformer is an extended data consumer they share some features. When it is more appropriate to develop a data consumer and when a data transformer? Here are few hints: 
  • When you need to provide more types of outputs (for the same input), implement data transformer. Remember that the data consumer's method canConsume:
    ConsumableType canConsume (ConsumableType consumableType);
    binds together just one input consumable type with one output consumable type (there may be more such pairs, but always they are pairs). A similar method in data transformer:
    TransformationDef[] canTransform (ConsumableType type) 
    binds together an input consumable type with potentially more output types (because each returned transformation definition can have more output types).

    Of course, even in a data transformer, you can use the canConsume method. But what should such method return if a data transformer has more possible outputs? The AbstractDataTransformer takes care about it (when you register your transformations by calling method addTransformationDef and/or setTransformationDefs) and if the output types are ambiguous the canConsume return an UNKNOWN_BUT_EXISTING_CONSUMABLE_TYPE consumable type (as defined in the class ConsumableType).
  • When you need to pass additional parameters to the transformation, you definitely need to implement a data transformer.
     
  • What happened to the original consume method from the data consumer when I implemented the new one in a data transformer? Well, it is still there, implemented by the AbstractDataTransformer - so you do not need to do it yourself - but its implementation only checks the parameters for correctness and raises an exception:
    public String consume (PropertyChannel propertyChannel,
    String dataPropertyName,
    ConsumableType consumableType)
    throws IllegalArgumentException,
    GCPException {
    checkConsume
    (propertyChannel, dataPropertyName, consumableType);
    throw new IllegalArgumentException
    ("This data transformer does not implement
    this simple consume() method. " +
    "Please use the other one where you
    can specify which transformation to use.");
    }

 

Where to get data sources API and classes


Please check the tutorial for data sources - the places instructions are identical.

 

Last Updated ( Monday, 16 November 2009 )