A blog about life, technology & databases
The Shapefile 2.0 manifesto
Geographic Information Systems (GIS) are by their nature data driven. The data comes in a wide variety of raster and vector formats. Rasters hold raw, continuous data recorded striaght from the real world. An example is Satellite/aerial imagery, this is a commonly held in an open format with broad support, such as GeoTIFF or GeoJPEG.
Vector formats hold refined, discrete data, which has been manually traced or otherwise derived other data sources. Examples include building outlines, contours, road routes, pipe networks land land parcels and locations. Vector data is usually traced or derived, at great expense from raster data, to encode business information – as a result it’s usually highly valuable.
Unfortunately, there are many GIS vector file formats, and most are proprietary. They can only be used to their full in their native software. Three of the biggest are AutoCAD DXF, MapInfo TAB and ArcGIS Personal Geodatabase. One vector format is unique – both an open standard, and in wide use: Shapefile
Shapefile is publicly documented in ESRI Shapefile Technical Description by ESRI Inc., it’s creator. Any GIS software worth it’s salt can read and write to the format, so it’s become the least common denominator. It is the format for storing and exchanging vector data between teams, departments, businesses and government. In my opinion this makes Shapefile the best thing ever to happen to GIS, without it the GIS market would be a fraction of it’s current size.
Despite it’s popularity, Shapefile does have some serious limitations, mainly due to it’s DBF heritage:
- A shapefile is limited to 2 4 GB or 65535 4 billion/len(record) records.
Where len(record) is greater of either the average feature length in bytes, or the length of a DBF record. - Records are limited to 1000 65536 bytes or 32 between 257 & 2038 fields.
- Field names are limited to 8 10 characters, character fields can hold up to 254 bytes.
- Unicode is not supported not widely supported.
Currently the only real alternative, for data exchange, is Geography Markup Language (GML) as defined by the Open Geospatial Consortium (OGC). An XML dialect, GML has none of the limitations of Shapefile this is why Ordnance Survey use GML to supply MasterMap, a highly detailed vector map of Great Britain. Support for GML in software is growing, but it’s unsuitable as a storage format.
Viewing and editing vector data requires support for random access by attribute and by spatial extent. As an XML dialect GML cannot do this, to find one record, the entire file must be parsed from beginning to end. GML is almost always converted to another format, or loaded into a spatial database before it is used.
A spatial database is a database with data types and functions able to handle geospatial data. For the major databases there is Oracle Spatial, PostgreSQL PostGIS, SQL Server Spatial, MySQL Spatial and DB2 Spatial Extender. All are based on Simple Features for SQL an open standard, meaning spatial data can be queried and updated with SQL like any other data type.
I believe that a portable, standalone spatial database, would make a very good successor to Shapefile. Such a format would drive the GIS market forward, increasing usage of GIS by making it easier to share edit, publish and share GIS data. A portable spatial database would negate the need for the import, view, edit, export cycle that GML imposes.
At the moment I see 3 contenders for the crown:
- File Geodatabase is a format from ESRI, it is natively supported by ArcGIS. ESRI proclaim it “Allow[s] users to easily exchange geodatabases.” That is true only if both users are running ESRI’s ArcGIS software. File Geodatabase is a proprietary format, despite promises by ESRI when it was launched.
- Spatial Data Format (SDF) is a format from Autodesk, it is native support . Support is included as part of their Feature Data Objects library, released as Open Source. SDF is based on the popular SQLite embedded database engine.
- Spatialite is another format based on SQLite, by an Alessandro Furieri. Spatialite is in it’s infancy still, it’s first release was 11 months ago.
Unfortunately none of these looks like it will become a clear winner any time soon. Each is supported by only one application currently. If ESRI releases the specification for File Geodatabase, I expect it will quickly gain widespread support due to their position as market leader. As open source applications such as QGIS gain Spatialite support, it could slowly achieve dominance in a grass roots fashion. SDF seems to be going nowhere.
So ESRI, please publish the details of File Geodatabase. At it’s launch, during the 2006 ESRI User Conference, you promised that File Geodatabase would be an interoperable format. You promised to release a software library, so we could read and write them without ArcGIS. Neither has happened. So File Geodatabase is just another closed format, another pretender to the throne that’s achieved only 1% of it’s true potential.
Publish File Geodatabase, or we’ll take the Shapefile crown by force.
Update 27 Mar 2009: Corrected Shapefile limits, based on Xbase file structure rather than dBASE software specifications.
| Print article | This entry was posted by Alex Willmer on 1 March 2009 at 1:13 pm, and is filed under arcgis, database, open source, rants, standards. Follow any responses to this post through RSS 2.0. You can leave a response or trackback from your own site. |
about 1 year ago
One of the reasons GML was developed was to get away from the idea of ad hoc file exchange as the dominant means of sharing geospatial information. Thus GML is not intended to be a database, but rather a means of encoding the content of geospatial transactions that update geospatial or other databases. I don’t think we need yet another database encoding. Unlike ShapeFiles which have no feature model and are in no way self describing, GML provides multiple mechanisms for data self description including:
* Use of an explicit object model in which features are named elements, the element name expressing the type (semantic) of the feature. We would expect to see in our GML.
* Use of user defined application schemas that describe the features/objects being transferred. Once would create a vocabulary that contained the specific feature types of interest such CongressionalDistrict, CensusTract2000, CensusTract1990 etc. One can verify the structure of a given instance by validating it (through any validating XML parser) against the schema. The schema with which a given data instance is to comply is explicitly specified in the instance itself.
* All relevant data is explicitly linked together. For example a CongressionalDistrict can reference the County element (County feature) from a county property or just provide a name as is done in the Shape file.
* Both geometry-valued and non-geometry valued properties are part of the same feature object (element), hence there is no ambiguity as in Shape files where we do not know a priori which .dbf file goes with a given .shp file.
* The CRS is explicitly specified, for each geometry in an XML feature, via the srsName attribute, where as it is known only by association of file types (e.g. a .prj file is in the same ZIP package as the .shp file) in the case of Shape files.
Since we already have lots of spatial databases – not sure we need yet another one.
Cheers
Ron
about 1 year ago
File-based geodatabase wouldn’t be bad, but isn’t it a multi-file format? That’s always been my main complaint with SHP, well apart from the proprietary spatial indices and character width limits.
My hope is with the SQLite format. Using proper relational design, you get the majority of the benefits that Ron talks about above, built-in spatial indices, etc. Adoption of SQLite is reasonably far advanced in the open source world; OGR and FDO can both access spatial features in SQLite, and I believe that OGR (at least) can read the Spatialite schema.
about 1 year ago
Hi Ron, Thank you for commenting, I’ll be sure to follow your writings.
Putting an end to ad hoc file sharing, and moving to online collaboration is an attractive prospect. It’s working very well for OpenStreetMap. In response to your points:
- I would say that it’s people, rather than a markup language, that give data semantic meaning. I don’t see how a spatial database containing a table with a geometry column called ‘census_tract’ and a date column, has less meaning than a GML document with geometries having a name=’CensusTract2000′ attribute. It’s unfortunate that common practise is to name geometry columns the_geom and shape.
- A proper spatial database holds multiple tables, with referential integrity between them.
- This ambiguity is a quirk of the Shapefile format, not of spatial databases in general.
- Per feature SRS is not a capability I’ve needed so far, though it’s good to know that GML supports it.
It would be great if WFS or some other web centric online editing system were the norm, but in my world the software vendors choose not to support this.They barely support reading GML, and loading it requires special loading software, at extra cost. I work with too many users that aren’t willing to send their precious data over a public network, or organisationally aren’t ready to hook their internal GIS to an external editing system.
So adhoc file exchange is the order of the day, and if I’m going to do that I want to make as hassle free as possible. Of course, everything I just said also makes a good argument for finding better vendors and better users.
about 1 year ago
Nice article, +1 for Spatialite here.
BTW, another serious limitation for shapefile is not being a topological structure.
about 1 year ago
Where is the limit of 65535 records for a Shapefile based on? On a regular basis Shapefiles are delivered to us having up to 1M records. No problem there. I do agree that the format is kinda “retro”…
about 1 year ago
Great post and I am glad to see that this post is getting some attention. This has been an area of contention for me for years.
If you look at what the neogeographers have done, they have abandoned all of the traditional formats and gone to spatial feeds via KML (which actually forced the OGC’s hand in to adopting it as a best practice format) or geoRSS.
Yes i fully realize that the source data is stored in an underlying system but maybe this is where we head.
To me GML has become to academic, the spec is now so large that printed it is a 4″ binder – a little out of touch with reality – sorry Ron – just my opinion.
about 1 year ago
Interesting post. Certainly Shapefiles are dominant — users that opt into the format usage tracking stats collection that we have in the FME product (www.safe.com) over the past several years are overwhelmingly using shape files as either a source or a destination (and surprisingly, the # 1 overall translation that seems to be done by FME is Shape to Shape, which highlights to us that while moving data between formats may be important, transforming data between data models must be even more important.). However, while I’m among the first to agree that the DBF portion of the shapefile portion is an interesting lesson in software archaeology, I’d like to clarify that from our experience you can have 4GB shapefiles, not 2GB, though it is highly probable that most/much software may incorrectly handle this case (it has to do with whether signed or unsigned integers were used in the offset calculating math when the shapefile was produced, or else read. We also see many many many shapefiles with more than 65k records, so I don’t think that limitation is there. As well, though officially it would seem only upper case attribute names are allowed, we also have seen mixed case attrs work with ArcGIS so the format has evolved in that sense. And you definitely can store unicode as well as other encoded attribute data in shape files — there are 2 ways that this information ends up being remembered and these are well supported by ArcGIS. So the format definitely is capable in many ways.
However, all that in mind, it still would be nice to have a public API for “FILE” Geodatabase — I’ll go on the record as saying that Safe Software will be among the first to implement a reader/writer against this once it comes out if it is a C++ API (other languages will introduce a challenge and potential cross-platform issues but one we’d work to overcome). Perhaps at the Dev Summit in a couple weeks time there will be some news.
In any case, there is no denying that a nice self contained file format that is self describing and “operational” (that is, practical to use natively) is also attractive. I agree that SDF isn’t really picking up any steam lately, and it seems to me that the Open Source folks are rallying around the spatial-in-SQLite work (http://trac.osgeo.org/fdo/wiki/FDORfc16) — see also Jason’s blog about it here: http://www.jasonbirch.com/nodes/2008/12/02/229/sqlite-spatial-files-in-fme-2009-through-the-magic-of-fdo/ — and on top of this I heard some very very positive things about the performance of this format at a conference as well (formats being the type of topic that I tend to discuss, often by myself, at social gatherings). So we’ll see if this can get some traction.
And of course there GML plays a role too, and has many attractive aspects to it, but it is not targeted at being operational, and so doesn’t quite hit the same niche as something like shapefiles does.
Interesting days indeed.
Dale
about 1 year ago
So, throwing one’s toys out the pram really does get attention, thank you all for responding. Corrections first:
1. The DBF format, as documented at http://www.clicketyclick.dk/databases/xbase/format/dbase_spec.html is where I got the 65,536 figure. This was based on my understanding that:
– 1 DBF record is required for each feature.
– The DBF of a shapefile is dBASE III
– A shapefile must have exactly 1 DBF.
I need to check these assumptions, perhaps by looking at the shapelib code.
2.
3. The maximum size of the DBF is 2 GB, according the XBase FAQ, but as noted by Dale Shapefiles are defined by real world usage.
4. DBF has a field type C, denoting character data. Whilst ArcGIS could write UTF 8 or even UCS to this field, there is no method to record the chosen encoding (unlike GML). Prior knowledge, heuristics or some external method would be required, to choose the correct encoding when reading.
Dale, I read Jason’s postings on SQLite+FDO/OGR. I need to get straight in my head how many variants of spatial-sqlite there are. Autodesk SDF and Spatialite are distinct, is Jason’s format a third?
Alex
about 1 year ago
Hi,
It is the case that shapefiles will support the storage of an encoding along with the data — this somewhat old technical article from ESRI http://support.esri.com/index.cfm?fa=knowledgebase.techArticles.articleShow&d=21106 contains the details of how this is done (we’ve supported this for a couple of years now in FME). There was a facility in DBF files to store a codepage, and this is supplemented with an external .cpg file to handle encodings that the DBF spec writers didn’t foresee.
With respect to the Spatialite/FDO&OGR/SDF way of doing spatial inside of SQLite, there are at least 3 competing methods out there that have fragmented this universe. I’ve discussed it with Jason Birch some time ago, basically saying that until one of the SQLite spatial methods becomes a clear winner we were a bit averse to implementing something in FME. The SDF is established already (and we’ve supported it for a long time) but comes in at a level below the standard SQLite SQL parsing — from what I understand, it uses the lower level SQLite APIs to manage paging and file structure,but doesn’t produce something easily used by the SQLite SQL parser. The two new ones to watch (from what I know, Jason’s comments would be welcomed) are the “Spatialite” and “FDO/OGR” extensions to SQLite for spatial. And what I’ve heard from a few sources is that the FDO/OGR method really ‘cooks with gas’. If this latter method becomes part of the standard FDO distribution, I suspect Safe will support it in FME.
Dale
about 1 year ago
Very interesting post Alex, and it ties in nicely to an article entitled “The Difficulty of Data Sharing” in the Spring edition of our “Field Views” newsletter which I have included below. We are also interested to see what ESRI do with the File GeoDatabase format and the potential to release an ArcGIS-free API or specification…
====
The Difficulty of Data Sharing
We frequently facilitate data sharing between the different IT systems and departments of our client organisations. It often seems that no single format or transfer mechanism has adequately solved the problem of geospatial data sharing such that it has become the prevailing way to accomplish this task.
Understandably, the data format we most often encounter is the ESRI Shapefile. Despite some limitations, the Shapefile is well established, time-tested and can be used with all major GIS products.
However, the geospatial data we deal with today is becoming larger, more complex and is often required to be available in completely new ways. In time, there may be an opportunity for new data sharing mechanisms to replace the ESRI Shapefile as the most commonly used data format.
The Open Geospatial Consortium (OGC) specify a number of standards which ease interoperability issues when sharing data and provide novel ways of publishing and consuming data via the internet.
Other formats like KML (see Field Views I) have the weight of Google behind them and have rapidly come to be an accepted part of the geospatial community.
Geospatial support is also a hot topic for most major database vendors. Allying modern database systems with traditional geospatial tools represents a powerful combination. For example, by allowing many users to query and even edit the same data concurrently.
Ultimately, if Shapefile loses its top spot in the geospatial data sharing field, predicting its successor is difficult. It may not be the best overall solution which prevails but an underdog which takes a larger market share. Remember Betamax vs. VHS!
====
about 1 year ago
two more limitations of shapefiles: no curves, no text. And they seem to get corrupted easily, and are a bear to fix. Oh, and it would be nice to store suggested symbology in the file, if you wish.
I wouldn’t vote for file geodatabase as a de facto sharing standard. Too much overhead for passing a single layer back and forth. If there were a way of encapsulating a single geodatabase-ese feature class, that would be great. ESRI has a geodatabase XML encoding that could do that, but if you’re going to go that route, may as well do GML or KML. What you want is one that can be easily shared (i.e. single file) *and* efficiently used internally without importing (i.e., binary), and I haven’t seen the ideal candidate for that yet.
One other important use of this ideal format is archival. I want to store my data in a form that I can come back and look at it (with the latest software) 15 years from now. Most grass-roots open source solutions won’t meet that standard.
What about a binary encoding of GML, with a built-in index?
about 1 year ago
Another thought would be the WKB format. Not the greatest in general, but can be used in a variety of structures, such as an MS Access database. We have been using it for a while in MS SQL Server 2K5 and exporting via SQL to MS Access for delivery to clients. Works great;-)
about 1 year ago
Hi all!
I agree in the need for a common standardized GIS format but I think you need to sort some things out first.
Are we talking about a format for transferring data, or a common format for storage? The postings above are not really clear on that point. The transfer format can be as simple as you wish, if you afford to lose system specific properties on the way. Shape files are a good example of the first kind, as the format lacks mapstyles, coordinate info and topology. In fact, that cuts out a big portion of the GIS identity of the data but it’s beside the point.
Good contenders for a common storage format are of course Oracle Spatial and the ESRI flavors of geodatabases. But neither are suitable for common use, as they require very costly licensing and software.
And I seriously doubt that a database is a good container for transferring data between GISes, operating systems, language regions and so on. In my experience, databases are so dependant on their environment that transferring them becomes all but impossible.
But for the sake of argument, let’s assume that it is a file based transfer format we are talking about.
In my mind, a few specifications would be in place first. Apart from the obvious requirements, the format should:
1 be entirely open source and not proprietary. This is a defining requirement.
2 store all kinds of geometries in the same file and not separate information on different kinds of geometry.
3 fully store projection/coordinate system information with the dataset.
4 be able to store all kinds of data types as attributes, and not limited to a number of characters or bytes.
5 store map style attributes such as color, line styles, point symbols, text attributes.
6 store object topology.
7 be designed for true 3D representation (that is, store coordinate triplets even if Z is 0 or level)
There is probably more to think of but this is my top 7 list.
Regards, Mats.E
about 1 year ago
Why not GINA??
An interesting but confused discussion. There is a lot more to an exchange format than features and attributes these days. Apart from topology, there are lots of associated relations such as domains and other extensions that are particular to each software package.
Even shapefiles have 3D extensions that most applications cannot read,and ArcView 3 cannot edit without extensions. There is nothing stopping extensions to the shapefile format to include curves and text but nobody has done it. Why not? Look at the spec, there are plenty of opportunities for geometry, just as the dbf tables have been extended from the original format.
I cannot see any chance of the file geodatabase becoming an exchange format unless everyone has a copy of ArcGIS, including all the other vendors! I’m not holding my breath waiting. I don’t have any special knowledge, other than using geodatabases. I think that its likely that ESRI staff found that it would not work as an exchange format, it would be too difficult to upgrade for future releases and the amount of code required to build a reliable database is non-trivial. I don’t think that its an anti-competitive move. Surely ESRI want another open shapefile success as much as anyone. Its a pain exchanging any databased GIS data between any vendor.
But I have seen an open exchange format that satisfies all of Mats Elfstrom’s wishlist. It was invented in the early 90′s, was called GINA and is now owned by Autocad. How strange that it is not even mentioned. What has happened to it? The NZ land database was first built using GeoVision so we all had a lot of practice in NZ using it. Here is a link to a historic description. Maybe it has been extended since the 90′s.
http://www.linz.govt.nz/docs/dcdb/append3.html
about 1 year ago
Great thread! Here’s what the ArcGIS development team is thinking.
An API to the File Geodatabase is something that is needed. At ESRI, we are working on a low-level (non ArcObjects-based) API for the file geodatabase. This will be delivered as a part of the next major release of ArcGIS, version 9.4.
While the shapefile continues to be great standard for simple data sets, it does have limitations which limit its applicability. These include: size limitations due to 32-bit file offset pointers, the dBase representation of feature properties, and limits for modeling complex feature relationships and integrity rules. The geodatabase information model and the file geodatabase were created to address these issues. In addition to eliminating the scaling issues of shapefiles, a file geodatabase can carry more semantic information. A geodatabase acts as a container for a complete data model, containing multiple entity types (feature and object classes). A file geodatabase can model the full ArcGIS information model: entity relationships, 3D objects, annotation, dimensions, topology, networks, raster data, point clouds, etc. And, like the shapefile, the file geodatabase is designed for direct use – software can operate directly against a geodatabase instance without data loading or semantic translation. A direct use format introduces concepts like query processing, index maintenance, and integrity management. The implications of the direct use requirement (as opposed to a semantic ETL format like GML) are significant.
These factors make the geodatabase more powerful for representing geographic information than the shapefile, but also make the design of a low-level API more complex. In using a geodatabase, a developer works with the geodatabase schema (catalog of datasets, feature classes, relationships, etc.) as well as the geodatabase I/O model (i.e. getting a cursor over query results from a particular feature class). Working with a geodatabase containing a single feature class (a.k.a. a recordset) should be quite simple, about the same as a shapefile – however many real-world information models have more sophisticated schemas involving relationships, annotation, topology, etc. (and that’s before the ontologists have gotten to them).
The geodatabase schema will be exposed using Geodatabase XML to represent the geoinformation schema. (A description of the Geodatabase XML is available here: http://resources.esri.com/help/9.3/ArcGISEngine/dotnet/08017b04-166c-49c7-9334-d5f23007a8b9.htm). We will have a light-weight query API for interacting with the rows of the individual datasets.
We recognize the interest in an API to open the File geodatabase. Because the geodatabase offers a more sophisticated and complete data model over the shapefile, it’s taking us more time to develop a simple data access API which is independent of ArcObjects. But it is coming. We will be publishing more information on what we are planning with this API later this year and we welcome your input on this design as it becomes available.
about 1 year ago
Actually, DBF field name lenght is limited to 10 chars – not 8, the number of Shapefile features or DBF records is not limited to 65535 and Shapefile does support Unicode (UTF-8) (also OGR provides this).
The real drawback of Shapefile due to DBF heritage is a very limited SQL subset it supports and only a few datatypes (real, int, varchar, date). Other than DBF heritage, Shapefile is lacks topology support, which makes it highly error-prone data container.
Does Spatiallite provide any support for topology?
What do you mean by “Records are limited to 1000 bytes or 32 fields”?
Please correct errors. People start reffering to your page here and there while you provide wrong information.