GeoTools : Dry Run at DataAccess Story


Over the years a number of efforts wanted to provide complex feature support with different levels of success, and lead to the definition of the new GeoAPI interfaces and an implementation sitting as an unsupported module for some time now.

The GeoTools data access layer has moved from its own Feature/Type system to the GeoAPI SimpleFeature/Type interfaces.
GeoAPI SimpleFeature/Type are extensions to the more generic Feature/Type interfaces, that provide for convenient assumptions over the Feature structures being flat.

Now we need to be able of dealing with complex features, generally coming from WFS instances such as the USGS Framework Web Feature Services, as well as to finally give GeoServer a transition path to serve complex features.

And we need to do it in a way that leverages all the knowledge put through the years in the current GeoTools infrastructure and allows a smooth integration of Feature, both for new code to be built upon it, as for existing code (both internally and library client wise) that want to transition to support complex Features.

I'm trying to figure out a way to smoothly introduce Feature with no detriment of all the work spent in SimpleFeature up till now. That is, enable complex feature support from now on, with no even need to deprecate all the SimpleFeature stuff, as that's working well and serving a bunch of use cases, yet enabling for new developments to leverage the use of both new complex-capable data stores as well as the existing simple-feature ones through the more generic Feature/FeatureType.

The big blocker factor is the lack of an appropriate data access API. All the GeoTools code that deals with features is settled up in terms of SimpleFeature and its assumptions. So the current DataStore API. During the work on the community-schema modules, the main glitches found with the DataStore API are:

  • String is not enough to represent a FeatureType name, a qualified name is needed
  • TypeName and Feature name are not the same thing, though a common miss-assumption for SimpleFeatures

Goals / API


At a Source level: Identification should be handled with with a qualified name rather than a "TypeName". We are getting too many collisions. For the GeoAPI Feature to be located correcly we need to get rid of the idea that TypeName == FeatureName (eg, "topp:states" != "topp:states_Type". They may even be in different namespaces!)

SimpleFeature/Type backwards compatibility, out of the box usage as the general case

We want current code to keep backwards compatible. This means the library won't break all the code written with the SimpleFeature assumptions. And we want all the format drivers written in simple terms to be used as the more general Feature/Type out of the box.


There were good debate, see the following email thread

Basically in term of approaches to cover the above goals we managed this three possibilities:
Debate lead to people asking for actual code examples so the three were spiked:

  1. Generics + DataStore superclass
  2. Generics + FeatureSource Hierarchy
  3. Nogenerics
  4. Hybrid
    (no spike, see bellow)

Generics + DataStore superclass

Introduce a superclass for DataStore and parametrize FeatureSource, FeatureCollection etc, based on the type and content it serves (<FeatureType, Feature> vs SimpleFeatureType, SimpleFeature.

1) cleaner abstract api
2) does not incur in naming conflicts (no need to find good names beyond the
ones already in use and a superclass for DataStore)
3) introduces only one more interface, a pull up of DataStore. This allows
for the separation of generic Feature capable datastores and
SimpleFeature-only capable ones.
4) will break some existing client code when upgrading to geotools 2.5 due to
the runtime erasing of generics. Yet, its easily fixable with a regular
expression search and replace.

Generics + FeatureSource Hierarchy

Introduce a full hierarchy by pulling up parametrized versions of DataStore, FeatureSource, FeatureCollection but keep the current interfaces with no generics (ie, DataStore extends DataAccess<SimpleFeatureType, SimpleFeature>, FeatureSource extends Source<SimpleFeatureType, SimpleFeature>, etc.

1) cleaner concrete api for the DataStore case (ie, SimpleFeature)
2) Incurs in naming hell since the good ones are already taken
3) introduces a full layer of abstraction over DataStore,
FeatureSource/Store/Locking, FeatureReader/Writer, FeatureCollection
4) Breaks less/no existing code. Only requires the addition of Name vs.
String (ie. DataStore.getNames():List<Name> vs


Introduce a full hierarchy by pulling up non parametrized versions of DataStore, FeatureSource, FeatureCollection and keep the current interfaces as they are, relying on Java5 return type narrowing to specialize the return types for the SimpleFeature/Type case, and pay the cost of new overloaded methods in the subclasses.

ie, DataStore implementations get two new methods each, which are overloaded versions of the originals:

interface DataStore{
 /** @since 2.4 */
 public void createSchema(SimpleFeatureType featureType) 
 /** @since 2.5 */
 public void createSchema(FeatureType featureType) 

 /** @since 2.4 */
 public void updateSchema(String typeName, SimpleFeatureType featureType) 
 /** @since 2.5 */
 public void updateSchema(String typeName, FeatureType featureType) 

interface FeatureStore{
 /** @since 2.5 */
 public Set<FeatureId> addFeatures(FeatureCollection collection) 
 /** @since 2.4 */
 public Set<FeatureId> addFeatures(SimpleFeatureCollection collection) 

 /** @since 2.4 */
 public void setFeatures(FeatureReader reader) 
 /** @since 2.5 */
 public void setFeatures(Reader reader)


In this approach we try for a hybrid of (1) and (2) above.

  • Use a simple super class (either one super class, or two) it does not matter
  • Use generics only for the Query objects (and any other method parameters we need)

Original idea: if we use one super class:

  • Data<FQ extends FeatureQuery>: is for generic Feature
  • DataStore extends Data<Query>: is for SimpleFeature

The idea is nice, problem being that once you get to code you find yourself being forced to parametrize, at least, Query, FeatureCollection and FeatureReader, the last two ideally being present only in FeatureStore, but forced to propagate back to FeatureSource and DataStore to avoid type safety warnings.



IMHO, Source needs being queries by a Query, not only by Filter.
Sing we're trying to support a wider number of workflows, it becomes a need to get the data access api richer in functionality, not only in level of abstraction.
As an example of not addressed functionalities:

  • uDig needs paged access to content for TableView
  • CSW needs paged access, or asking only for hits
  • A very common need is to get the identifiers of the matching objects, not the object themselves, (this may concern Andrea for his versioning stuff, and concerns me about imlplementing spatial operations in uDig)

So those are just examples, from which I think we can get Source closer to the CSW Discovery service.
First step to support those scenarios would be to have Source.content(Query).
Then Query can be extended for the different specific Source specializations (like adding reprojectCRS for FeatureSource), but at the minimum it should support:

interface Query{
 ResultType getResultType();//HITS, IDENTIFIERS, CONTENT 
 Filter getPredicate();
 List<SortBy> getSortOrder();
 List<Name> getResponseElements(); //which properties to retrieve
 int getCursorPosition(); //start index to retrieve, support for paging
 int getIteratorSize();   //how many objects to retrieve (page size)

interface QueryResponse{
 int getHits(); //number of matching records
 int getCursorPosition(); //start index of retrieved content (< hits)
  * If Query.resultType == HITS: empty
  * If Query.resultType == IDENTIFIERS: Collection<Identifier>
  * If Query.resultType == CONTENT: actual objects depending on concrete source type (i.e.Features, Metadata, etc)
 Collection getRetrievedData(); 

interface Source{
 QueryResponse content(Query query);
Posted by groldan at Dec 22, 2006 05:38

Good Thinking Gabriel - Query looks resonable, however the Open Web Services standard may be a easier place to start from (rather then CAT 2.0). I would really like to make sure we keep things as simple as possible.

Query (+1) but it scares me, cursor position and iterator size scream out to be a "page"
QueryResposne (-1) we have too much overlap here with Collection

  • QueryResponse.getHits() --> Collection.size()
  • QueryResponse.getCursorPosition() --> strange take on "iterator vs cursor"
  • QueryResponse.getRetrivedData() --> this

I would like to see how other libraries handle large collections, I would also rather take the concepts of cursor, iterator size and cursor position and come up with something that does not appear strange to the normal java developer.

One idea:

  • PagedCollection extends Collection
    • iterator() works like normal over all "paged" content
    • pageSet() sorted set of pages
  • Page extends Collection
  • iterator() limited to the contents on that page
  •, page.prev() allows navigation between pages etc...

pageSet().size() takes care of the getHits()/getIteratorSize() calculation etc...

Posted by jive at Dec 22, 2006 15:13

In order to ask for only hits do the following:

Access.contents( filter ).size()

Direct, assume content is not fetched until needed, obvious place for implementors to optimize etc..

Posted by jive at Dec 22, 2006 15:14

Martin finally gave some feed back - from the geoserver-devel list (so some of this feedback is geoserver specific):

Andrea Aime a écrit :
> > For AbstractGridCoverage2DReader, it would be nice to know exactly
> > what's not working for you and eventually fix it on trunk.

My issues are:

Configuration interface expects a File input source
I'm not sure where the restriction come from, but we had to provide a dummy file
for every layer in order to get it working. GeoServer checks for file existence,
which make it impossible to use it if we don't give him a dummy file even if it
will be totally ignored by our code. It make configuration more tedious since
the client must browse through large data directory for nothing. Note:
AbstractGridCoverage2DReader has an ImageInputStream attribute, which is not
applicable to a database connection. It should be splitted in a really abstract
superclass, and have specialized subclasses working on ImageInputStream only
when wanted.

Needs an GridFormatFactorySpi instance for each layer
We were trapped in a "one GridFormatFactorySpi == one 'LAYER' parameter value"
relationship, partially because of the inability to pass a non-File input in the
configuration interface. This relationship is not applicable to a database where
a single GridFormat serves an arbitrary amount of layers. For every row added in
our "Layer" table in the database, we had to create a GridFormatFactorySpi
instance. In practice, it means that the client can not add any layer without
writting a new class and recompiling the code. I guess that we could have
avoided this constraint with some configuration interface work, but we though
that GeoServer configuration was scheduled for refactoring, so we prefer to wait
for this work to be done.

Attributes that should be part of decoding process
AbstractGridCoverage2DReader contains a lot of attributes that should be part of
the decoding process, not properties of a CoverageReader (e.g. crs, envelope,
coverage name, num overviews, raster2model, originalGridRange and more...). I
understand that they may be convenience during the decoding process, in which
case they should not be in the public API. Because those attributes are public
(protected actually), they looks like as if they had to be provided at
construction time. Actually experience suggest that AbstractGridFormat do not
work well in Geoserver if we don't provide at less the CRS and the Envelope near
the construction time. Of course we don't have this information at construction
time in practice.

Lack of encapsulation
AbstractGridCoverage2DReader and its friends (AbstractGridFormat, etc.) expose
totally and without any control all their internal working. The above-cited
attributes should be private so that the implementation can make sure that:

  • They are cleared when the input source change.
  • They are consistent (in current state, absolutly nothing garantee that
    those attributes have the expected dimensions, chains consistently ("grid
    range" --> "raster2model" --> "envelope" --> "crs"), etc.).

Current AbstractGridCoverage2DReader implementation do not applies encapsulation
principles, which increase the risk of bugs and reduce our ability to change the
code in the future without compatibility break. As a side note, we have heard of
users who abandonned GeoTools because the API is changing too much. The lack of
encapsulation in classes like AbstractGridCoverage2DReader increase our
exposition to this kind of situations.

Lack of javadoc
The Coverage I/O stuff has some mysteries and few javadoc explaining them. For
example why the "Crop" operation expects 2 envelopes? I would expect a Crop to
work with a single Envelope parameter value.

While reading some code, I feel like a geologist digging in the Earth's crust
and reading the Earth's history from the geologic layers. I had the feeling to
read a little bit of Class's history by seeing what looks like patchs applied
over patchs. For example CoverageUtilities.prepareSourceForOperation(...) had
many redundancies in its "if ... else" statements, testing again stuff that was
already tested differently before, maybe because some conditions were added at a
later stage without revisiting the big picture. I cleaned this little method on
trunk last week. My feeling (but I may be completly wrong) is that current
AbstractGridCoverage2DReader is in a similar situation.

> For the GeoServer configuration interface, again, it would be
> interesting to know what you were looking for so that the next config
> and UI effort will take it into consideration.

The only thing we need is the parameter connection to a database and remove
everything else. No file to specify, no CRS and no Envelope to provide, no
format to select, not even any layer to declare - all this stuff is in the database.

The way to declare a layer is a little bit "CoverageFormat" specific. For
example in the case of postgrid, the user just need to push a "update this
layer" button. Postgrid will scan some known directories on the server, find any
new files he didn't know about before and update its database accordingly
(possibly asking some file-dependant question to the user).

So we need the ability to start from a blank sheet and put configuration options
that are totally different from what we would have for a classical 2D raster.

> > Anything reproducable that could be reported, analyzed and fixed?

As soon as the "WIDTH" and "HEIGHT" parameter values are smaller than the size
that the cropped image would have if it wasn't scaled, our image disaspear. I'm
not totally sure that the bug isn't ours, which is why we try our code through a
mini-WMS before to bother the mailing list.

My intuition is that some code somewhere performs a division using integer
arithmetic or some other arithmetic that round the result to zero when we should
have a fraction between 0 and 1.

As far as projection is concerned, a quick look in some code suggests that many
projection are performed back and forth, maybe more than necessary (but I need
more investigation to be sure). The result is that we get ProjectionException
for bounding box where some result should be possible. For example we are unable
to draw a Raster in Mercator projection over a world map in WGS84. I understand
that we can't project a WGS84 coordinates to Mercator if we are close to a pole,
but the converse should be possible - currently it is not. Again I need more
investigation on this issue, maybe fix some code in referencing and coverage module.

> A CoverageStore could be modelled against the WCS service interfaces the
> same way we modelled DataStore against the WFS datastore interfaces.
> Any interest in doing a common work in this direction, anyone?

I want to do that, it is on my schedule, but for now we are on a emergency mode.
For now it is scheduled for March to May 2008. But I known that a promized a
coverage I/O review for a very long time and delayed it for years...



Posted by jive at Dec 17, 2007 11:30