Background to the ComplexDataStore project space, where you'll find information about the recently started project to incorporate complex features support to GeoTools
- catalog.xml - want to be able to handle xml like this
This discussion is aimed at arriving at a mutually acceptable design for a evolutionary upgrade of geotools and geoserver - the ability to handle complex data types, including normalised database sources and "community" GML feature schemas. Many of the underlying abstractions have been predicted in the code base, but yet to be fully realised.
Community schemas can be thought of as:
- defined by a process involving more than one interested party
- published externally as a standard that a WFS must support
- comprising a range of related feature types suitable for business applications within the domain of the community
- supporting relationships between such feature types, as part of the business object definitions
This discussion is prompted by a specific, funded, enhancement requirement (see SEEGrid project scope) that has emerged out of several years consideration of the issues with interoperability of data services (see Complex-Features Business Drivers - Use Cases ) and practical experience of deploying WFS systems for a range of purposes.
Relevant geotools resources
The discussions below have been informed by the following documentation resources:
- Discussion on data source design http://vwfs.refractions.net/docs/DataStore_Performance.pdf
- Operations API discussion http://docs.codehaus.org/display/GEOTOOLS/Operations+API
- GeoTools Data access tutorial http://docs.codehaus.org/display/GEOTOOLS/Data%2Baccess%2Bbasic%2Btutorial
- fid-exp - brief explanation: (Andrea can correct as needed.) Fid-exp Branch
Still working up db tables, but here are some examples of GML
- https://www.seegrid.csiro.au/subversion/xmml/trunk/Examples/geochem/GA_1_samples.xml - this has relatively complex elements
- https://www.seegrid.csiro.au/subversion/xmml/trunk/Examples/geochem/GA_1_measurements.xml is more fully normalised with related features rather than inline values
The first strategy is to break the problem down into three aspects:
1) Features - how to map real data stores onto features with non-scalar properties
2) Serialisation - how to serialise features into externally defined schemas, including non-flat structures
3) Query - how to convert incoming queries onto efficient queries against the back-end data structures.
- geotools Feature implementation needs to be checked and extended if required to fully support the required feature model.
- plug-in architecture for mapping queries to certain data store patterns will allow developers to solve local problems
- future work) The same plug-in architecture will allow developers to easily create DB schemas around existing query patterns, and simply reuse a query configuration.
- erialisation will be driven by mapping of features into a public schema, using the configured set of mappings.
- This mapping will be generated through inspection of the external schema and creation of a mapping table. Initially this will be manually configured, but a UI configuration wizard could be created to implement this.
#Change requests to the WFS spec will be created to allow a WFS to advertise specific query templates it can actually support against a feature type. (absence of these would default to the current capabilities where it is assumed that every operation can be used in any combination against any feature property)
- Allow any property to be mapped to a detail table using key (default would be fid) to create multi-value, multi-facet properties.
- Allow "joining" of any feature collections using existing data store abstraction (allow for example a shapefile to be joined to a database table)
- expressions to join and split storage schema elements to create output schema elements may be in the native syntax of the datastore - they are not exposed via WFS.
- The query will need to collect all the matching FeatureIds (fids) and then collect all the related values in additional queries against the mapped property tables.
- per property, run one select query against a list of "fids", then traverse output results, inserting into FeatureSet one value at a time,
- or create set of properties on traverse, then insert the values into an in-memory feature set (seems better than query per fid (lot of queries) and inserting into one feature at a time)
- first cut, we'll allow default query strategy (against only scalar properties of master table)
- extension by dropping in alternative query strategy classes. Map default query strategy to FeatureType when configuring?
Chris Holmes' initial comment:
I think we
could define a 'virtual' datastore, that would take two or more other
datastores as it's inputs. Then we could define virtual FeatureTypes,
whose definitions would be the queries. In GeoServer we developed the
notion of a View of a datastore, and I think this idea should be
extended. With a view you could define attributes to hide or make
mandatory, and you could give a definition filter. Any query to the
featureType would pass through the definition query first. This was
limited to a single FeatureType, however, so the idea needs to be
extended, I think first to 'virtual' featureTypes in the same DataStore
(get some attributes from one table and some from another), and once
we've got that figure out how to make a DataStore from two or more
different datastores, and use the same mechanism to create the
Code base analysis:
The existing geotools JDBC "AttributeReader" stuff looks like a promising start.
NB - AttributeReaders are really "RowSetReaders" or "ResultSetManagers" and maybe need to be refactored into ResultSetManagers and AttributeReaders that operate on a single attribute but do lazy reads against managed resultSets.
one could subclass to specialise attribute readers... but we really want to inherit this capability regardless of whether the geometry handler is SDO, SDE etc...
so, I think we need to be able to exploit the schema -driven attribute readers at the JDBC level...
Yes, it was designed with this very goal in mind.
NB - fid generation changed to an interface, with different strategies, which is a nice improvement that should work well with these changes, as the joins will likely rely on
Answers from Chris...
Q) Is the existing setup capable of handling multi-valued attributes (I suspect not becuase I dont see the ability to specify a join table for an attribute anywhere)
A) It's capable 'in theory', but not in practice. Which is to say we
designed Features to be able to do this. I coded up a
MultiAttributeType, and there is a FeatureAttributeType as well. But
no datastores make use of them, so they have never been tested.
Q) Is there a data store capable of handling point geomtery im multiple columns (x,y,z) - eg for mySQL but also for the many legacy databases out there like this..
A) None currently do, but it's fairly trivial with jdbc. You could just
subclass MySQLDataStore, and overwrite createGeometryReader to return a
special LegacyGeometryReader, that instead of reading WKT it reads the
two or three specified columns and creates the JTS geometry from those.
Q) With the reader architecture, would one see related tables queried only when serialising?
- specifically we would want to do this either at the post query filter to enforce filter clauses that we dont want to have to construct joins for, or at the initial filter, when we really want to filter on complex
properties in the first cut, so we would need to do a join, so could retrieve the data in a denormalised set.
Abstracting a query strategy based on analysis of the query would be nice - i.e. look for named queries (WFS spec extensions required) or patterns referencing related features or internal structures of complex properties could trigger such query strategies.
A) We should be able to do the enforcement of filter clauses twice, indeed
we do it now in geoserver even though it's not so necessary. We just
join the filters together, since they're going to the same datastore,
but it wouldn't be too hard to hold the filter seperately, the
architecture leaves room for it. The filter would be part of the
'virtual' featureType, and you would retrieve a FeatureSource, which
would have knowledge of the filter(s) to be used for its querying of
its components. From those results it would be able to take a Query
object, which may include a Filter, and return the proper
FeatureResults (made available as a Reader).
Crude Configuration example for joins (within a single data store)
obviously this could be
need to have a consistent way of referencing fields in arbitrary feature types - and what if the joined data isnt a featuretype in its own right - is this too heavy an overhead?