GeoTools : Shapefile Index Support

From's Dave's email: (see Random Data Access)

You can access an individual row in a shapefile in O(1) time if you
know the shapeid (ie. row number).

a) open up the .shx file for random access. Its a constant size per
row, so you can grab the row you want immediately. This will tell you
the location (in the .shp file) of where the spatial information is.

b) open up the .dbf file for random access. This is also a constant
size per row (the header will tell you this size), so you can easily
access the row you want.

ESRI has a technical specification of the shapefile available online.
There are some subtle issues for shapefile especially if they've been
edited, but I've never seen an .shx lie...

By bounding box is a bit more complex. For no spatial index (this
minimizes the reading from the .shp and .dbf files):

a) open up .shx for sequential reading
b) open up the .shp for random reading
c) open up the .dbf for random reading

d) for each row in the .shx
grab the BBox from the start of the spatial object in the .shp
If it intersects your search BBOX
read the rest of the spatial object from the .shp
read the appropriate .dbf record

With a spatial index, means you replace (d) above with:

d) use the spatial index to find spatial objects that intersect the search BBOX
read the spatial object from the .shp
read the appropriate .dbf record

NOTE: it possible that the spatial index will be larger than the
.shp file (this will always be true for points and two-point
linestrings). Its certainly likely that the index will not fit in

This being said, the UMN WMS mapserver uses a Quad-Tree based index
(.qix) for spatially indexing shapefile. Its very quick! I'm not
sure how well it will perform in a Java environment...

You should also look at the PostGIS LWGEOM indexing - its based on
FLOAT4 coordinates instead of FLOAT8s. Basically, when I convert the
FLOAT8 bbox to FLOAT4, I expand it slightly so the FLOAT8 is contained
inside the FLOAT4 box. (this is easy to do by manipulating the IEEE
binary representation of the float4)

This makes the index slightly less selective, but for practical
purposes its unnoticeable. There are some issues in the RTree index
formation when dealing with numbers so close to the edge of FLOAT4
precision - there's a complete discussion in the postgis developers
mailing list in march/april.

The advantage of this is that your index is half the size and your
disk cache is twice as like to contain the index pages!

If you wanted to index the columns of the .dbf, you'll probably be
able to get decent performance with a simple b-tree implementation.
You can estimate performance by looking at PostGIS performance with
indexes built on the columns you're querying.