GeoParquet is a file format for storing geospatial vector data. It builds upon the existing Parquet file format, which is a columnar storage file format optimized for use with big data processing frameworks like Apache Spark, Hadoop, and Dask. GeoParquet extends Parquet's capabilities to efficiently store and query geospatial data. Building on the success of an already existing file format enables using the investment that has occurred outside of the geospatial industry and use of existing file formats for geospatial data.
GeoParquet was initiated in 2021 by a collaboration between various organizations and individuals in the geospatial and data analytics communities, including Carto and Microsoft. The development of GeoParquet was driven by the need to create a standardized, efficient, and interoperable way to store and process geospatial data. The need for the GeoParquet file format stems from the need to be able to work with large datasets and use resources that are hosted on cloud providers, which enables scaling up workloads efficiently.
Organizations enabling GeoParquet adopting and integration within existing platforms
GeoParquet is currently being utilized by companies and organizations such as the Overture Maps Foundation and Carto, which integrated GeoParquet into its platform to facilitate easier import and export of geospatial data. Wherobots is a geospatial data company that leverages GeoParquet in conjunction with Apache Sedona to manage and analyze large-scale geospatial datasets efficiently​. More implementation examples are listed here.
Development Seed is an engineering and product company that helps other companies adopt cloud-native geospatial file formats and process data using these formats. The company thinks the cloud is the future for working with large datasets, and a format such as GeoParquet enables working with such datasets in the cloud. Kyle Barron is a Cloud Engineer at Development Seed and working on a website to access and download Overture Maps data. It reads Parquet files directly, without a server involved.
How GeoParquet manages spatial indexing and compression
Barron recently discussed his involvement in developing GeoParquet and some of the technical details of how the file format manages metadata such as spatial indexing and compression in a geospatial podcast. He also explains how the GeoParquet file format is different from FlatGeobuf, another new cloud-native file format for working with vector data.
For example, GeoParquet can scale more effectively to larger data sizes than FlatGeobuf because it indexes a collection of geometries at a time instead of just a single one. This grouping also allows for better compression, which makes it faster to move data between systems over a network.
Parquet is complemented by Arrow, a data format designed for high-speed in-memory data processing. It has a geospatial counterpart called GeoArrow for storing vector geospatial data and associated attributes. By adding support in Python for the Arrow format, reading spatial vector data in formats such as GeoParquet and FlatgGeobuf into the GeoPandas library has become much faster.
GeoParquet 1.1 revision
By the end of June 2024, the GeoParquet 1.1 revision was published on GitHub, adding support for spatial partitioning and native GeoArrow geometries. More information about this release is available in this blog post from Chris Holmes and on the GeoParquet GitHub page.
Resource: Mapscaping podcast episode “GeoParquet for beginners”, May 23 2024.