4 GIS data models and file formats

4.1 Data models

GIS data typically come in two data model types vector or raster.

4.1.1 Vector data

The three basic vector data types are points, lines (also sometimes referred to as polylines or linestrings) and polygons. While they are treated as different data types, you can also consider them to be a nested hierarchy. For example, to make a line you need two or more points, while a polygon requires three or more lines.

The hierarchical construction of vector data types.

Figure 4.1: The hierarchical construction of vector data types.


From this we can observe the different properties of the data types:

  • a point is a location in space defined by a set of coordinates based on a coordinate reference system (more about these later)
  • a line is two or more points with straight lines connecting them, where each line has a length
  • a polygon is a set of points connected by lines that form a closed shape, which has an area

Note that these “data types” are also commonly called feature classes, geometric primitives or geometries. Later we’ll see that you get more complicated “types”, but these are generally combinations of the above: multipoint, multilinestring, multipolygon, geometry collection, etc and are largely just different data classes designed to help with handling data than unique geometries.


Vector data models are obviously the best way to represent points and lines. Polygons are usually the best way to represent discrete (categorical) data, especially where they may have complex boundaries.

For example:

Vector (polygon) representation of discrete data; the vegetation types of the Cape Peninsula.

Figure 4.2: Vector (polygon) representation of discrete data; the vegetation types of the Cape Peninsula.


Vector data models are less good for representing continuous data (e.g. elevation, see surface temperature, etc). See further down.


4.1.2 Raster data

Raster data are essentially data stored in a regular grid of pixels (or cells). Digital images like jpeg or png files are essentially rasters without spatial information. The value of each pixel is a number representing a measured value (e.g. continuous data such as sea surface temperature) or a category (e.g. discrete data such as land cover class). All pixels have a value, even if the value is “No Data”.


Raster representation of continuous data; a digital elevation model of the Cape Peninsula.

Figure 4.3: Raster representation of continuous data; a digital elevation model of the Cape Peninsula.

Rasters are particularly useful for representing continuous data. If this was a vector plot of the raw data, each pixel would have to be its own polygon and the legend would have a separate entry for each unique value, >60 000 entries!!!

That said, you can quite effectively represent continuous values visually with a vector data model if you bin the continuous data (from the raster) into classes, such as one can do with a filled contour plot (see below). This is not ideal for analyses though, as the binning results in data loss.

  • You’ll find that you often need to convert data between vector and raster models for various reasons, and that this usually means some tough decisions need to be made about what is acceptable data loss. We’ll cover that later.


Vector representation of continuous data; a filled contour plot of a digital elevation model of the Cape Peninsula using 100m contours.

Figure 4.4: Vector representation of continuous data; a filled contour plot of a digital elevation model of the Cape Peninsula using 100m contours.


Conversely, rasters are usually not that good at representing categorical data. Note that most raster file formats (and GIS software) can only store numeric data, so this plot misleadingly represents the vegetation types as continuous data. You can label and represent categorical data in rasters in R, but this is usually more effort than its worth and is almost always less effective than using a vector format… A common exception is land use and land cover (LULC) maps, where remotely sensed satellite imagery (raster data) are classified into predefined classes (e.g. agriculture, rock, grassland, etc) based on various criteria or algorithms. Even then, these are difficult to interpret visually with static maps and are best visualized as interactive maps so you can make sense of them by zooming in and panning around.


Raster representation the discrete data; the vegetation types of the Cape Peninsula.

Figure 4.5: Raster representation the discrete data; the vegetation types of the Cape Peninsula.


4.2 Attribute data

Attributes are what we know about the objects represented in a layer in addition to their geometry - i.e. each spatial object usually has additional information associated with it. These data are usually stored in an associated Attribute Table.

Here are the first few entries of the attribute table for our Cape Peninsula vegetation vector layer:


AREA_HCTR PRMT_MTR veg type Subtype Community geometry
66 6.774255 1596.83494 Beach - FalseBay BEACH Need to Find Out POLYGON ((-46636.54 -380320…
67 14.151168 3886.68578 Beach - FalseBay BEACH Need to Find Out POLYGON ((-47220.45 -380302…
68 8.575597 2154.00714 Beach - FalseBay BEACH Need to Find Out POLYGON ((-48967.57 -380253…
69 0.000001 23.25575 Beach - FalseBay BEACH Need to Find Out POLYGON ((-49355.61 -380223…
70 5.333203 3589.09436 Beach - FalseBay BEACH Need to Find Out POLYGON ((-50008.26 -380132…
71 24.448116 7378.70451 Beach - FalseBay BEACH Need to Find Out POLYGON ((-52927.7 -3800156…


Note that vector data generally have attribute tables, but they are rare for raster layers, because most raster file formats can store just one attribute per cell (e.g. elevation) and can’t have associated attribute tables.

A handy feature of most GIS systems is that they can treat attribute tables like relational database table structures. Additional information can be joined onto your spatial data by joining two tables with a common key field, as one does when joining two tables of non-spatial data. In GIS, this is called an “Attribute Join”, because you have joined the tables by attribute and haven’t used spatial information (also sometimes called a “non-spatial join”). We’ll learn about “spatial joins” later…


WARNING! The values in attribute tables are typically static and are not recalculated every time you alter the feature of interest. For example, you can crop the Cape Peninsula vegetation layer, but the values in the AREA_HCTR (area) and PRMT_MTR (perimeter) columns of the attribute table will not change, even if the polygons in question are now smaller!


4.3 File formats

Linked to data models, and attributes, is file formats. Generally, there are separate file formats for vector vs raster data. Usually, we even have separate files for the different types of vectors (points, lines, polygons, etc), but this is changing as new “database” formats evolve.

There is a huge variety of GIS file formats, which have proliferated as different software packages have developed their own set of “native” formats. Each of these have different properties in terms of the data they store, whether they can include attribute data, file size and compression, and of course how they actually store (and retrieve) the data. Many of these, like the ESRI formats, are proprietary (i.e. not open source).

If you’ve done any GIS before, you’ll be familiar with ESRI shapefiles, which usually include a group of 3 or more files with the same name, but a different file extension. Each file stores different information. The most common ones are:

  • .shp = the main feature geometry
  • .shx = an index file, used for searching etc
  • .dbf = stores the attribute information
  • .prj = stores the coordinate reference system
  • etc = there are many other optional files that may be present depending on the data stored

Shapefiles are by far the most common format for vector data. For raster data, the most common format is probably GeoTIFF (.tif) or ASCII (.asc).

You can view the lists of most of the file types supported by R (or rather the GDAL software that underlies most of R’s spatial data capabilities) by running the code sf::st_drivers() which gives this output:


name long_name write copy is_raster is_vector vsi
ESRIC ESRIC Esri Compact Cache FALSE FALSE TRUE TRUE TRUE
PCIDSK PCIDSK PCIDSK Database File TRUE FALSE TRUE TRUE TRUE
netCDF netCDF Network Common Data Format TRUE TRUE TRUE TRUE FALSE
PDS4 PDS4 NASA Planetary Data System 4 TRUE TRUE TRUE TRUE TRUE
VICAR VICAR MIPL VICAR file TRUE TRUE TRUE TRUE TRUE
JP2OpenJPEG JP2OpenJPEG JPEG-2000 driver based on OpenJPEG library FALSE TRUE TRUE TRUE TRUE
PDF PDF Geospatial PDF TRUE TRUE TRUE TRUE FALSE
MBTiles MBTiles MBTiles TRUE TRUE TRUE TRUE TRUE
BAG BAG Bathymetry Attributed Grid TRUE TRUE TRUE TRUE TRUE
EEDA EEDA Earth Engine Data API FALSE FALSE FALSE TRUE FALSE
OGCAPI OGCAPI OGCAPI FALSE FALSE TRUE TRUE TRUE
ESRI Shapefile ESRI Shapefile ESRI Shapefile TRUE FALSE FALSE TRUE TRUE
MapInfo File MapInfo File MapInfo File TRUE FALSE FALSE TRUE TRUE
UK .NTF UK .NTF UK .NTF FALSE FALSE FALSE TRUE TRUE
LVBAG LVBAG Kadaster LV BAG Extract 2.0 FALSE FALSE FALSE TRUE TRUE
OGR_SDTS OGR_SDTS SDTS FALSE FALSE FALSE TRUE TRUE
S57 S57 IHO S-57 (ENC) TRUE FALSE FALSE TRUE TRUE
DGN DGN Microstation DGN TRUE FALSE FALSE TRUE TRUE
OGR_VRT OGR_VRT VRT - Virtual Datasource FALSE FALSE FALSE TRUE TRUE
Memory Memory Memory TRUE FALSE FALSE TRUE FALSE
CSV CSV Comma Separated Value (.csv) TRUE FALSE FALSE TRUE TRUE
GML GML Geography Markup Language (GML) TRUE FALSE FALSE TRUE TRUE
GPX GPX GPX TRUE FALSE FALSE TRUE TRUE
KML KML Keyhole Markup Language (KML) TRUE FALSE FALSE TRUE TRUE
GeoJSON GeoJSON GeoJSON TRUE FALSE FALSE TRUE TRUE
GeoJSONSeq GeoJSONSeq GeoJSON Sequence TRUE FALSE FALSE TRUE TRUE
ESRIJSON ESRIJSON ESRIJSON FALSE FALSE FALSE TRUE TRUE
TopoJSON TopoJSON TopoJSON FALSE FALSE FALSE TRUE TRUE
OGR_GMT OGR_GMT GMT ASCII Vectors (.gmt) TRUE FALSE FALSE TRUE TRUE
GPKG GPKG GeoPackage TRUE TRUE TRUE TRUE TRUE
SQLite SQLite SQLite / Spatialite TRUE FALSE FALSE TRUE TRUE
ODBC ODBC FALSE FALSE FALSE TRUE FALSE
WAsP WAsP WAsP .map format TRUE FALSE FALSE TRUE TRUE
PGeo PGeo ESRI Personal GeoDatabase FALSE FALSE FALSE TRUE FALSE
MSSQLSpatial MSSQLSpatial Microsoft SQL Server Spatial Database TRUE FALSE FALSE TRUE FALSE
PostgreSQL PostgreSQL PostgreSQL/PostGIS TRUE FALSE FALSE TRUE FALSE
OpenFileGDB OpenFileGDB ESRI FileGDB FALSE FALSE FALSE TRUE TRUE
DXF DXF AutoCAD DXF TRUE FALSE FALSE TRUE TRUE
CAD CAD AutoCAD Driver FALSE FALSE TRUE TRUE TRUE
FlatGeobuf FlatGeobuf FlatGeobuf TRUE FALSE FALSE TRUE TRUE
Geoconcept Geoconcept Geoconcept TRUE FALSE FALSE TRUE TRUE
GeoRSS GeoRSS GeoRSS TRUE FALSE FALSE TRUE TRUE
VFK VFK Czech Cadastral Exchange Data Format FALSE FALSE FALSE TRUE FALSE
PGDUMP PGDUMP PostgreSQL SQL dump TRUE FALSE FALSE TRUE TRUE
OSM OSM OpenStreetMap XML and PBF FALSE FALSE FALSE TRUE TRUE
GPSBabel GPSBabel GPSBabel TRUE FALSE FALSE TRUE FALSE
OGR_PDS OGR_PDS Planetary Data Systems TABLE FALSE FALSE FALSE TRUE TRUE
WFS WFS OGC WFS (Web Feature Service) FALSE FALSE FALSE TRUE TRUE
OAPIF OAPIF OGC API - Features FALSE FALSE FALSE TRUE FALSE
EDIGEO EDIGEO French EDIGEO exchange format FALSE FALSE FALSE TRUE TRUE
SVG SVG Scalable Vector Graphics FALSE FALSE FALSE TRUE TRUE
Idrisi Idrisi Idrisi Vector (.vct) FALSE FALSE FALSE TRUE TRUE
XLS XLS MS Excel format FALSE FALSE FALSE TRUE FALSE
ODS ODS Open Document/ LibreOffice / OpenOffice Spreadsheet TRUE FALSE FALSE TRUE TRUE
XLSX XLSX MS Office Open XML spreadsheet TRUE FALSE FALSE TRUE TRUE
Elasticsearch Elasticsearch Elastic Search TRUE FALSE FALSE TRUE FALSE
Carto Carto Carto TRUE FALSE FALSE TRUE FALSE
AmigoCloud AmigoCloud AmigoCloud TRUE FALSE FALSE TRUE FALSE
SXF SXF Storage and eXchange Format FALSE FALSE FALSE TRUE TRUE
Selafin Selafin Selafin TRUE FALSE FALSE TRUE TRUE
JML JML OpenJUMP JML TRUE FALSE FALSE TRUE TRUE
PLSCENES PLSCENES Planet Labs Scenes API FALSE FALSE TRUE TRUE FALSE
CSW CSW OGC CSW (Catalog Service for the Web) FALSE FALSE FALSE TRUE FALSE
VDV VDV VDV-451/VDV-452/INTREST Data Format TRUE FALSE FALSE TRUE TRUE
MVT MVT Mapbox Vector Tiles TRUE FALSE FALSE TRUE TRUE
NGW NGW NextGIS Web TRUE TRUE TRUE TRUE FALSE
MapML MapML MapML TRUE FALSE FALSE TRUE TRUE
TIGER TIGER U.S. Census TIGER/Line FALSE FALSE FALSE TRUE TRUE
AVCBin AVCBin Arc/Info Binary Coverage FALSE FALSE FALSE TRUE TRUE
AVCE00 AVCE00 Arc/Info E00 (ASCII) Coverage FALSE FALSE FALSE TRUE TRUE
HTTP HTTP HTTP Fetching Wrapper FALSE FALSE TRUE TRUE FALSE


Note that you can specify the what = argument in the function to "vector" or "raster" if you want only the drivers specific to each.


In short, there’s lots!!! But note that there are others that are not supported in R. Perhaps the most common unsupported ones you’ll encounter are the ESRI geodatabases (.gdb and .mdb), which are designed for ArcGIS and are super efficient (in ArcGIS), but ESRI haven’t released the drivers, so they don’t work (or at least not properly) for most other GIS software…

Note that there has been a big push to develop a standardized set of open source, efficient and interoperable file formats. Some examples to watch:

  • GeoPackage - SQLite database containers for storing vector, raster and attribute data in a compact and transferable format.
  • GeoJSON - a geographic version of JSON (JavaScript Object Notation) for vector data, very commonly used for web apps etc.
  • Cloud-optimized GeoTIFF - as the name suggests; a GeoTIFF-based format for optimally hosting and allowing querying and downloading of raster data on the cloud…
  • Simple Features - an open, efficient and interoperable standard for vector data.