This page documents my Google Summer of code project "Caching Data in uDig". You can read my projet proposal at http://rochecrg3.scg.ulaval.ca/crg/dev/soc07/udig.
Here I will try to keep track of ideas, development notes and work progress.
First I propose some user stories, in which I dumped first ideas. I borrowed ideas from Andrea's : Datastore caching
I will update these stories as work goes on, and I get a better idea of the needs.
User story #1 : Datastore caching
Datastore caching should allow to wrap transparently any other
Datastore (ie. the source Datastore) and leverage related subsequent
Datastore cache acts as buffer between the client and the actual
source Datastore. It may be persistent over client sessions.
The maximum capacity can be set by the user.
General purpose framework
Can we propose a general framework for arbitrary data caching ?
Raster and vector data models API are not unified in Geotools.
Data to be stored and queries
One issue is the way features are retrieved from Datastore : a
Datastore usually receives a Query, and features are not retrieved by
id. Slightly different queries will yield a common set of features.
What we will store are features, rather than query results.
But we need to track queries, and some kind of relationships with
result features. How do we want to remember previous queries ? What
relationships with features do we keep track of ?
First, we suggest we keep track of queries spatial bounds and keep a
spatial index of the cached features.
What type of queries the cache should leverage ?
- BBox Filter
- FID Filter
Other queries may require to keep other indices of features (but this
way it will be possible to implement specialized cache for performance
sensitive operations. This is an extension mechanism we should take
care to provide.)
Cache should delegate unhandled queries to source Datastore. Two step
query of the datasource may be used :
1. get fid list from datasource
2. get features by fid that the cache doesn't already hold.
We will have to check this is performance-wise, but at least this will
save bandwidth when working with remote datastore.
Feature is not Serializable.
We have to find some way to serialize data represented by features.
What if Datastore schema changes ?
What if cache get out of sync with source datastore ? How do we detect this ?
When the cache is full, it must make room for new data. We will have
to define an eviction strategy for feature cache and index (or
In the low bandwidth scenario, data may be best copied from the source server and distributed using CD-ROM or DVD-ROM. Clients would appreciate to download only updates from the server, so they will still be able to display up-to-date data.
We would like to set up a datastore that can connect to two (or more ?) source datastores which store the same features, but possibly different versions. For example, one of the datastore would be a local storage, with good performance access, but holding data frozen at time t. The other datastore would represent a remote server.
The multisource datastore would then retrieve features :
- from the local store, if the feature has not been modified
- from the remote store, if the feature has been modified.
It should be able to ask the remote store for what features have been modified, and update the local store.
Since we have to know if the original data has changed since time t, this service seems not be possible without some kind of data versioning system.
This could involve a sequential checkout (based on spatial extent) from a versioned remote datastore.
This is the scenario where a client has a local copy of a data source and wants to download updates from the original source. The local copy is used to initialize the cache, and client asks for updates to the remote source, and updates cache (or local copy).
See also :
- Geoserver http://docs.codehaus.org/display/GEOS/Versioning+WFS
In the low or no bandwidth scenario, users may want to edit locally the data, and commit modifications when possible.
Data cache using persistent storage will allow clients to carry on work when disconnected.
At commit time, the original data may have been modified by another client. This is a well-known issue of concurrent modifications
Scope of intended work, roadmap
The initial goal is to allow client|application to cache data when querying remote sources, and make disconnected mode possible. Handling asynchronous updtaes is a plus.
In this section, I try to detail deliverables and scope of intended work.
1. in-memory data cache, read-only
The most simple scenario.
I will propose an extensible framework for DataCache.
I will propose a test framework to assess DataCache functionalities.
I will test different implementation of DataCache against MemoryDataStore, the reference implementation of DataStores. Different implementations include testing different types of spatial indices.
2. client demo using cache
A demo would be nice to appreciate the actual benefits of DataCache.
First, I can think of a very simple client - use JMapPane, with two panes (cached, control) to visually compare rendering times ?
Second, I can try a demo using uDig (adding programmatically a cached DataStore to catalog).
3. persistent data cache (disk storage), read-only
Extension of 1 to disk-based data cache.
I will consider using Hsql DataStore as cache backend - but I don't know if plugin is mature enough to be used (ask Jody ?). Think of other solutions ?
I will also consider serializing features to byte arrays and using http://research.att.com/~marioh/spatialindex/ spatial index framework.
Different types of DataCache might be chained : typically
Cache must be initializable from its storage.
4. Write-through data cache
Simple handling of edits, where cache behaves as if there was no cache on edits.
5. Write-back data cache, assuming DataStore handles nicely version conflicts, ie the cache does not take care about version issues and commit conflicts
6. - OPTIONAL - sequential updates
We assume client has some way to identify the version of the copy of the source it currently has.
Versioning is beyond the scope of the intended work, because it may take us real far.
But we propose to try some experiments
This could involve setting up a web service (server-side), which :
- keeps track of versions signature (ie feature tree that make up a version)
- is able to look up the original source for modification since given revision, and send updates to client.
I can also test ideas using Geoserver versioned WFS (based on Andrea postgis versioned DataStore)
7. - OPTIONAL - propose uDig implementation of connected/disconnected mode
I would be glad to have time to achieve this, but can't figure out what is the required work and how much time it will need.
Next things I am planning to work on, unordered for the time being :
- FeatureCache : move eviction mechanisms to QueryTracker
- document FeatureCache
- have an in-depth look at Transaction and how it works
- implement cache listens to DataStore for FeatureEvents
- post project javadoc somewhere
- update UML diagram to reflect design changes
- using threads idea adress
The sequence to query the cache I have actually implemented is
User has to wait for all these operations to terminate, whereas he could get at least a first anwer from the cache. Using threads, we can imagine this other sequence :
- Tracking queries within a quadtree idea adress
Most of the cache behavior should be under control of the query tracker, which knows the known parts of the universe and may record how often each part of the universe is accessed.
We can't remember every query, but we can idealize or normalize them by expanding each to a grid. Typically, a tile will be a quadrant of a quadtree, and we approximate a query by the minimum containing quadrant.
We can attach to each quadrant :
- access statistics
- possibly a list of features
On eviction, we pick a leaf quadrant, ie a quadrant that has no child, in the tree and make it invalid. Parent quadrants should be marked invalid or partial. When four quadrants in a quadrant are invalid, they should be removed from that quadrant, which reverts to a leaf. A leaf is always marked partial, unless it has the lowest allowed level.
In order to find the leaf to evict, we traverse the tree by level, looking for a quadrant that matches the criterium (for example, least recently used). If found, we iterate from that quadrant. If not found, ie all quadrants at given level are equals according to the used criterium, we do an exhaustive lookup of the lower level, and so on. If we have to exhaustively traverse many levels, this will be time consuming, and we may revert to random eviction after a given number of levels traversed with no matches.
On look-up, given an input bbox query, we first search for the minimum containing quadrant. Then we test if that quadrant is valid and entire. If not valid, we'll have to fetch this quadrant from the source store. If not entire, we search for invalid inner quadrants. To minimize the number of quadrants to fetch, each quadrant should yield less than a maximum number of quadrants, possibly depending on the level of the quadrant.
Development notes and progress
Some notes of the work being done through this Summer of Code project, and report on progress.
- working on QueryTracker
- tried to have InMemoryDataCache pass AbstractDataStoreTest, then MemoryDataStoreTest (because MemoryDataStore does not pass AbstractDataStoreTest). can't pass testFeatureEvents.
- change in design : feature cache should be an extension of FeatureStore rather than DataStore. So I designed a new interface FeatureCache that extends FeatureStore. A DataCache will then be no more than a collection of FeatureCache, one per FeatureType.
- wrote FilterSplitter : a FilterVisitor that splits a filter into two filters, isolating spatial restrictions from other attribute restrictions.
- tried another R-tree framework : http://research.att.com/~marioh/spatialindex ; this framework provides visitors and query strategies to query the tree, which is great, but tree seems to be slower than org.geotools.index.RTree
- committed first implementation of FeatureCache : InMemoryFeatureCache, using LRU eviction. Does work, but not that fast.
- submitted a presentation proposal for FOSS4G
- planned to continue work on a demo using JMapPane, but task is delayed until I have a better implementation of FeatureCache
- move eviction mechanisms to QueryTracker
- try to understrand better how transactions are working. I will need this when going from read-only cache to write-thru.
- update UML diagram to reflect design changes
Weeks 3 & 4
- updated wiki page
- looked at JMapPane and tried a first demo ; well, this was a try
- off to a seminar, so not much work done.
- worked on : in-memory read-only data cache (continued)
- created module unsupported/caching under trunk, and configured my project build
- documented first code
- drawn UML class diagram of draft implementation
- tried a FeatureMarshaller that will handler DefaultFeature
- Reading : caching (JBoss Cache, Pojo Cache), OGC Reference Model, versionning (GIT, looked at Andrea's postgis versioned DataStore)
- tried to detail scope/extent of the intended work of this Summer fo Code project
- found how to run , added this section to pom.xml :
- think of an eviction strategy for flushing part of a spatial index. Idea would be to walk along the tree to find suitable nodes we can get rid of. But org.geotools.index.RTree implementation does not ease this. Tree could be parsed by recursively splitting top bound box into quadrants.
- talk with Andrea about versioning and his idea of storing features by quadrants, ie tiles.
- post project javadoc somewhere
- worked on : in-memory read-only data cache
- readings : caching, spatial indexing
- draft implementation of classes - see attached UML diagram
- created wiki page with user stories
- implement cache size limit
- handle better multiple types in a DataStore - for now I assumed there was one and only one FeatureType in the DataStore