AI Coding Tools for GIS & Geospatial Engineers 2026: Spatial Analysis, PostGIS, Raster Processing, Remote Sensing & Web Mapping Guide

GIS and geospatial engineering is software development where the earth’s shape is a first-class concern in every computation. A web application stores user coordinates as two floating-point numbers and calls it a day. A geospatial application must know which coordinate reference system those numbers belong to — WGS84 (EPSG:4326) means latitude/longitude on the GRS 1980 ellipsoid, NAD83 (EPSG:4269) looks identical but differs by up to two meters in parts of North America due to tectonic plate motion, and if someone hands you coordinates in a projected CRS like UTM Zone 17N (EPSG:32617) without telling you, your “latitude” of 4,500,000 will place a point in the middle of the ocean instead of Ontario. The difference between getting CRS handling right and getting it wrong is the difference between a pipeline that produces correct spatial analysis and one that silently places every feature 100 meters from where it actually is. No unit test catches this unless you already know the expected coordinates in the target CRS, which requires the domain knowledge you were hoping the tool would provide.

The data scales are unlike anything in conventional application development. A single Landsat 8 scene is roughly 1 GB of multispectral raster data covering 185 km by 180 km at 30-meter resolution. The Sentinel-2 archive grows by about 1.6 TB per day. The OpenStreetMap planet file exceeds 70 GB in PBF format, containing billions of nodes, ways, and relations. Processing these datasets requires understanding Cloud Optimized GeoTIFFs (COG) that allow partial HTTP range reads instead of downloading entire files, STAC (SpatioTemporal Asset Catalog) APIs that index petabytes of earth observation data, spatial indexing structures like R-trees and H3 hexagonal grids that make billion-row spatial queries feasible, and tile-based processing pipelines that break continent-scale analysis into manageable chunks. An AI tool that suggests loading a 50 GB raster into memory with rasterio.open(path).read() instead of using windowed reads has just crashed your analysis server.

The toolchain is uniquely broad: you write SQL with PostGIS spatial extensions one moment, Python with GeoPandas and Shapely the next, JavaScript with Leaflet or MapLibre GL for visualization, and shell scripts orchestrating GDAL command-line tools for format conversion and reprojection. You need to understand OGC standards (Simple Features, WMS, WFS, WMTS), GeoJSON RFC 7946 (which mandates WGS84 and right-hand-rule polygon winding), and the DE-9IM matrix that describes all possible topological relationships between two geometries. You deal with floating-point precision issues that manifest as self-intersecting polygons, slivers from overlay operations, and topology exceptions that GEOS throws when geometric predicates encounter numerical degeneracies. This guide evaluates every major AI coding tool through the lens of what GIS engineers actually build: spatial databases, vector analysis pipelines, raster processing workflows, coordinate transformations, web mapping applications, remote sensing analyses, and spatial ETL systems.

TL;DR

Best free ($0): Gemini CLI Free — 1M context for GDAL documentation and spatial algorithm discussions. Best for spatial analysis ($20/mo): Claude Code — strongest reasoning for CRS transformations, spatial SQL, and algorithm correctness. Best for GIS codebases ($20/mo): Cursor Pro — indexes large geospatial projects, autocompletes GDAL/PostGIS patterns. Best combined ($40/mo): Claude Code + Cursor. Budget ($0): Copilot Free + Gemini CLI Free.

Why GIS/Geospatial Engineering Is Different

Coordinate reference systems are invisible correctness requirements: Every geospatial dataset has a CRS, and every operation that combines two datasets must ensure their CRSs are compatible. WGS84 (EPSG:4326) and NAD83 (EPSG:4269) use nearly identical ellipsoid parameters but differ by up to two meters because WGS84 is earth-centered while NAD83 is fixed to the North American plate. The 2011 realization (NAD83(2011)) shifted coordinates by up to 20 centimeters relative to the original NAD83(1986). In Australia, GDA94 and GDA2020 differ by approximately 1.8 meters due to tectonic plate motion. Projections introduce their own distortions: Mercator preserves angles but grossly distorts areas at high latitudes (Greenland appears the size of Africa), Albers Equal Area preserves area but distorts shapes, and UTM zones only work within their 6-degree longitudinal band — a buffer operation that crosses a UTM zone boundary produces incorrect geometry. EPSG codes are the universal language: EPSG:4326 is geographic WGS84 (degrees), EPSG:3857 is Web Mercator (meters, used by every web map tile provider), EPSG:32632 is UTM Zone 32N. Mixing them up produces no error message — just wrong results. An AI tool that generates a spatial join without checking gdf1.crs == gdf2.crs first has introduced a silent data quality bug.
Spatial indexing determines whether queries take milliseconds or hours: A spatial join between two tables of 10 million polygons each, without spatial indexing, requires 10¹⁴ pairwise geometry comparisons — computationally impossible. With an R-tree index, the query engine uses bounding box intersection to filter candidates, reducing comparisons by orders of magnitude. PostGIS uses GiST (Generalized Search Tree) indexes that build R-trees over geometry bounding boxes; a query without CREATE INDEX idx ON table USING GIST(geom) will sequential-scan the entire table for every ST_Intersects call. H3 (Uber’s hexagonal hierarchical spatial index) partitions the earth into hexagonal cells at multiple resolutions, enabling approximate spatial joins through cell-ID matching instead of computational geometry. S2 (Google’s spherical geometry library) uses Hilbert curves on the sphere for spatial indexing. Quadtrees, k-d trees, and space-filling curves each have different performance characteristics for different query patterns (point-in-polygon, range queries, nearest neighbor). Choosing the wrong indexing strategy — or forgetting to create an index entirely — is the single most common performance catastrophe in geospatial applications.
Raster data operates at scales that break standard tooling: A single Landsat 8 Collection 2 Level-2 scene contains 11 spectral bands at 30-meter resolution (bands 1-7, 9) and 100-meter resolution (bands 10-11), plus a panchromatic band at 15 meters. The GeoTIFF for one band is roughly 60-120 MB; the full scene exceeds 1 GB. Sentinel-2 Level-2A provides 13 spectral bands at 10, 20, and 60-meter resolution, with a single tile (110 km x 110 km) approaching 800 MB. National-scale analysis requires hundreds or thousands of these scenes. Cloud Optimized GeoTIFF (COG) format stores tiles and overviews that allow HTTP range reads — you can read a 256x256 pixel window from a 100 GB raster without downloading the whole file. STAC (SpatioTemporal Asset Catalog) provides a standardized API and JSON specification for searching and accessing these datasets. Tools like Rasterio, xarray with rioxarray, and GDAL’s virtual filesystem (/vsicurl/, /vsis3/) enable cloud-native workflows. An AI tool that suggests np.array(rasterio.open(path).read()) for a multi-gigabyte raster has just killed your process with an out-of-memory error.
Topology and geometric validity have strict mathematical rules: The OGC Simple Features specification defines what constitutes a valid geometry. A polygon must have its exterior ring oriented counter-clockwise and interior rings (holes) clockwise (per OGC; GeoJSON RFC 7946 uses the opposite convention). A polygon must not self-intersect. A multipolygon must not have overlapping components. The DE-9IM (Dimensionally Extended 9-Intersection Model) matrix describes all possible topological relationships between two geometries using a 3x3 matrix of Interior, Boundary, and Exterior intersections. ST_Intersects, ST_Contains, ST_Touches, ST_Crosses, and ST_Within are all specific DE-9IM patterns. Floating-point arithmetic in GEOS (the C++ geometry engine underlying PostGIS, Shapely, and most geospatial tools) produces numerical robustness issues: buffer operations create self-intersecting polygons, intersection operations produce slivers with near-zero area, and union operations fail with TopologyException when geometric predicates disagree due to floating-point rounding. The standard fix — geometry.buffer(0) to repair invalid geometries — is a hack that silently modifies your data. Understanding why these issues occur and how to handle them properly (snap-rounding, precision models, validation-before-operation) is core GIS engineering knowledge that AI tools rarely get right.
The toolchain spans databases, visualization, and domain science simultaneously: A typical GIS project might use PostGIS for spatial data storage and querying, Python with GeoPandas for analysis, GDAL/OGR for format conversion, Rasterio for raster processing, pyproj for coordinate transformations, Shapely for computational geometry, Leaflet or MapLibre GL JS for web visualization, tippecanoe for vector tile generation, and domain-specific libraries like rioxarray for remote sensing or OSMnx for street network analysis. Each tool has its own conventions: PostGIS uses EWKB (Extended Well-Known Binary) internally but accepts WKT and GeoJSON as input; Shapely operates in a Cartesian plane with no CRS awareness; GeoPandas adds CRS tracking on top of Shapely geometries; GeoJSON mandates WGS84 coordinates in longitude-latitude order (the opposite of the latitude-longitude convention used by most mapping APIs). An AI tool must understand not just each library in isolation, but how they interact — that Shapely geometries lose CRS information, that PostGIS geography type uses great-circle calculations while geometry type uses Cartesian, that GeoJSON coordinates are [longitude, latitude] while most human-readable formats use [latitude, longitude].

GIS Task Support Matrix

Task	Copilot	Cursor	Windsurf	Claude Code	Amazon Q	Gemini CLI
Spatial SQL & PostGIS	Good	Strong	Good	Excellent	Good	Strong
Vector Analysis (Shapely/GEOS/JTS)	Good	Strong	Fair	Excellent	Fair	Good
Raster Processing (GDAL/Rasterio)	Fair	Good	Fair	Strong	Fair	Good
CRS & Projections	Weak	Fair	Weak	Strong	Weak	Good
Web Mapping (Leaflet/MapLibre/Deck.gl)	Strong	Excellent	Good	Strong	Good	Good
Remote Sensing & Earth Observation	Weak	Fair	Weak	Strong	Fair	Good
Spatial Data Pipelines	Good	Strong	Fair	Strong	Strong	Good

1. Spatial SQL & PostGIS

PostGIS transforms PostgreSQL into the most capable open-source spatial database on the planet. It implements the OGC Simple Features specification and extends it with hundreds of spatial functions, geography types for geodetic calculations, raster support, topology management, and 3D geometry operations. The difference between writing competent PostGIS queries and naive spatial SQL is the difference between a query that returns in 50 milliseconds and one that takes 45 minutes — or returns subtly wrong results because you used the geometry type with degree-based coordinates and got Cartesian distances in meaningless “degree units” instead of meters.

The critical distinctions that AI tools must understand: geometry vs geography types. The geometry type performs Cartesian calculations on a flat plane — perfect for projected coordinates (UTM, State Plane) but wrong for unprojected WGS84 coordinates where one degree of longitude varies from 111 km at the equator to 0 km at the poles. The geography type performs geodetic calculations on the ellipsoid — correct for WGS84 but slower and supporting fewer functions. ST_DWithin(geom_a, geom_b, 1000) on geometry columns with EPSG:4326 data finds features within 1000 degrees (which is everything on Earth); the same query on geography columns correctly finds features within 1000 meters. Index usage is equally critical: ST_Intersects automatically uses GiST indexes through the && bounding box operator, but ST_Distance does not — you must pair it with ST_DWithin for index-accelerated proximity queries. ST_Buffer on geographic coordinates in PostGIS 3.x correctly handles geodetic buffering, but older versions silently produced wrong results.

The best AI tools for PostGIS understand these nuances. Claude Code consistently generates queries that use geography type for WGS84 distance calculations, pair ST_Distance with ST_DWithin for index usage, and include ST_SetSRID and ST_Transform when mixing CRSs. Cursor indexes your existing PostGIS migration files and autocompletes function signatures matching your project’s patterns. Copilot handles basic spatial SQL but frequently generates ST_Distance without index-accelerating predicates and uses geometry where geography is needed.

"""
PostGIS spatial query with proper SRID handling and index-aware patterns.
Finds all parcels within 500 meters of a flood zone boundary,
using geography type for correct geodetic distance on WGS84 data.
"""
import psycopg2
from psycopg2.extras import RealDictCursor


def find_parcels_near_flood_zones(
    conn_params: dict,
    buffer_meters: float = 500.0,
    flood_zone_code: str = "AE",
) -> list[dict]:
    """Find parcels within buffer_meters of a specific FEMA flood zone.

    Uses ST_DWithin on geography columns for:
    1. Correct geodetic distance (meters on the ellipsoid)
    2. Automatic GiST index usage (ST_DWithin triggers && operator)
    """
    query = """
        WITH flood_boundaries AS (
            SELECT
                fld_zone,
                -- Cast to geography for geodetic operations
                geom::geography AS geog
            FROM fema_flood_zones
            WHERE fld_zone = %(flood_zone_code)s
        )
        SELECT
            p.parcel_id,
            p.owner_name,
            p.address,
            ST_Area(p.geom::geography) AS area_sq_meters,
            -- Minimum distance to any flood zone boundary (meters)
            MIN(ST_Distance(
                p.geom::geography,
                fb.geog
            )) AS dist_to_flood_zone_m,
            -- Centroid for mapping (GeoJSON expects lon/lat WGS84)
            ST_AsGeoJSON(ST_Centroid(p.geom))::json AS centroid_geojson
        FROM parcels p
        INNER JOIN flood_boundaries fb
            -- ST_DWithin on geography uses meters and triggers GiST index
            ON ST_DWithin(
                p.geom::geography,
                fb.geog,
                %(buffer_meters)s
            )
        WHERE p.geom IS NOT NULL
          AND ST_IsValid(p.geom)  -- Skip invalid geometries
        GROUP BY p.parcel_id, p.owner_name, p.address, p.geom
        ORDER BY dist_to_flood_zone_m ASC;
    """

    with psycopg2.connect(**conn_params) as conn:
        with conn.cursor(cursor_factory=RealDictCursor) as cur:
            cur.execute(query, {
                "flood_zone_code": flood_zone_code,
                "buffer_meters": buffer_meters,
            })
            return cur.fetchall()


def ensure_spatial_indexes(conn_params: dict) -> None:
    """Create spatial indexes if they don't exist.

    GiST indexes on geometry columns are essential for ST_DWithin,
    ST_Intersects, and all spatial join performance.
    """
    indexes = [
        """CREATE INDEX IF NOT EXISTS idx_parcels_geom
           ON parcels USING GIST (geom)""",
        """CREATE INDEX IF NOT EXISTS idx_parcels_geom_geog
           ON parcels USING GIST ((geom::geography))""",
        """CREATE INDEX IF NOT EXISTS idx_flood_zones_geom
           ON fema_flood_zones USING GIST (geom)""",
        """CREATE INDEX IF NOT EXISTS idx_flood_zones_geom_geog
           ON fema_flood_zones USING GIST ((geom::geography))""",
    ]

    with psycopg2.connect(**conn_params) as conn:
        with conn.cursor() as cur:
            for idx_sql in indexes:
                cur.execute(idx_sql)
        conn.commit()


if __name__ == "__main__":
    conn_params = {
        "host": "localhost",
        "port": 5432,
        "dbname": "gis_db",
        "user": "gis_user",
        "password": "changeme",  # Use environment variable in production
    }

    ensure_spatial_indexes(conn_params)

    results = find_parcels_near_flood_zones(
        conn_params,
        buffer_meters=500.0,
        flood_zone_code="AE",
    )

    print(f"Found {len(results)} parcels within 500m of AE flood zones")
    for parcel in results[:5]:
        print(
            f"  {parcel['parcel_id']}: {parcel['address']} "
            f"({parcel['dist_to_flood_zone_m']:.1f}m from flood zone)"
        )

2. Vector Analysis with Python

Vector analysis in the Python geospatial ecosystem centers on three tightly coupled libraries: Shapely for computational geometry (points, lines, polygons, and operations like buffer, intersection, union), GeoPandas for geospatial DataFrames (Shapely geometries with attribute data and CRS awareness), and Fiona (or the newer pyogrio) for reading and writing geospatial file formats. Under the hood, Shapely 2.x wraps GEOS (the C++ Geometry Engine Open Source), which is the same engine PostGIS uses — so the geometric operations are mathematically identical between your Python analysis and your PostGIS queries.

The critical pitfall in vector analysis is CRS management. Shapely operates in a Cartesian plane with no CRS awareness whatsoever: Point(0, 0).buffer(1) creates a circle of radius 1 in whatever units the coordinates happen to be in. If your coordinates are in WGS84 degrees, that “buffer of 1” is roughly 111 km at the equator and 0 km at the poles. GeoPandas adds CRS tracking through the .crs attribute and the .to_crs() method, but it does not prevent you from performing operations on GeoDataFrames with mismatched CRSs — it just warns you. A spatial join between a GeoDataFrame in EPSG:4326 and another in EPSG:32617 will silently produce wrong results unless you reproject first. The correct workflow: always reproject to a projected CRS appropriate for your study area before performing metric operations (buffering, area calculation, distance measurement), and reproject back to WGS84 for storage and display.

AI tools vary significantly in their handling of these patterns. Claude Code generates spatial join code that checks CRS compatibility and reprojects before operations. Cursor autocompletes GeoPandas method chains based on your existing codebase patterns. Copilot generates syntactically correct GeoPandas code but frequently omits CRS checks and produces buffer operations on geographic coordinates.

"""
GeoPandas spatial join with CRS transformation and proper metric operations.
Identifies which census tracts each school is located in, and calculates
the area of each tract and the distance from each school to its tract centroid.
"""
import geopandas as gpd
import pandas as pd
from shapely.geometry import Point


def load_and_validate(path: str, expected_geom_type: str = None) -> gpd.GeoDataFrame:
    """Load a geospatial file and validate geometry."""
    gdf = gpd.read_file(path)

    # Remove null geometries
    null_count = gdf.geometry.isna().sum()
    if null_count > 0:
        print(f"  Warning: dropping {null_count} null geometries from {path}")
        gdf = gdf.dropna(subset=["geometry"])

    # Repair invalid geometries with buffer(0)
    invalid_mask = ~gdf.geometry.is_valid
    invalid_count = invalid_mask.sum()
    if invalid_count > 0:
        print(f"  Warning: repairing {invalid_count} invalid geometries in {path}")
        gdf.loc[invalid_mask, "geometry"] = gdf.loc[invalid_mask].geometry.buffer(0)

    if expected_geom_type:
        actual_types = gdf.geometry.geom_type.unique()
        print(f"  Geometry types in {path}: {actual_types}")

    print(f"  Loaded {len(gdf)} features, CRS: {gdf.crs}")
    return gdf


def schools_by_census_tract(
    schools_path: str,
    tracts_path: str,
    target_epsg: int = 2163,  # US National Atlas Equal Area
) -> gpd.GeoDataFrame:
    """Spatial join: assign each school to its census tract.

    Workflow:
    1. Load and validate both datasets
    2. Reproject to a projected CRS for accurate metric operations
    3. Calculate tract areas in square kilometers
    4. Perform point-in-polygon spatial join
    5. Calculate distance from each school to its tract centroid
    """
    # Load data
    schools = load_and_validate(schools_path, expected_geom_type="Point")
    tracts = load_and_validate(tracts_path, expected_geom_type="Polygon")

    # Reproject both to the same projected CRS for metric operations
    # EPSG:2163 (US National Atlas Equal Area) preserves area across CONUS
    schools_proj = schools.to_crs(epsg=target_epsg)
    tracts_proj = tracts.to_crs(epsg=target_epsg)

    # Calculate tract areas in projected CRS (square meters -> sq km)
    tracts_proj["area_sq_km"] = tracts_proj.geometry.area / 1e6

    # Calculate tract centroids in projected CRS (accurate for area calc)
    tracts_proj["tract_centroid"] = tracts_proj.geometry.centroid

    # Spatial join: point-in-polygon
    # predicate="within" assigns each school to the tract it falls inside
    joined = gpd.sjoin(
        schools_proj,
        tracts_proj[["GEOID", "NAME", "area_sq_km", "tract_centroid", "geometry"]],
        how="left",
        predicate="within",
    )

    # Calculate distance from each school to its tract centroid (meters)
    joined["dist_to_centroid_m"] = joined.apply(
        lambda row: row.geometry.distance(row["tract_centroid"])
        if pd.notna(row.get("tract_centroid")) else None,
        axis=1,
    )

    # Drop helper column and reproject back to WGS84 for output
    joined = joined.drop(columns=["tract_centroid", "index_right"])
    joined = joined.to_crs(epsg=4326)

    # Report unmatched schools (outside any tract)
    unmatched = joined["GEOID"].isna().sum()
    if unmatched > 0:
        print(f"  Warning: {unmatched} schools did not fall within any census tract")

    return joined


if __name__ == "__main__":
    result = schools_by_census_tract(
        schools_path="data/schools.geojson",
        tracts_path="data/census_tracts.shp",
    )

    print(f"\nResults: {len(result)} schools matched to tracts")
    print(result[["SCHOOL_NAME", "GEOID", "NAME", "area_sq_km", "dist_to_centroid_m"]].head(10))

    # Export to GeoJSON (mandates WGS84 per RFC 7946)
    result.to_file("output/schools_by_tract.geojson", driver="GeoJSON")

3. Raster Processing

Raster processing in the geospatial world means working with gridded data: satellite imagery, digital elevation models, land cover classifications, temperature grids, and any other dataset represented as a matrix of cell values georeferenced to the earth’s surface. The primary Python libraries are Rasterio (Pythonic API over GDAL for reading/writing raster data) and GDAL itself (the Swiss Army knife of geospatial format conversion and processing). The challenge is scale: a single Sentinel-2 tile at 10-meter resolution is 10,980 x 10,980 pixels per band, and a continental NDVI time series might span thousands of such tiles across dozens of dates.

Cloud Optimized GeoTIFF (COG) has transformed raster workflows. A COG stores data in tiled (typically 256x256 or 512x512 pixel) blocks with internal overviews, enabling HTTP range reads. Instead of downloading a 2 GB file to read a 256x256 pixel window, you fetch only the relevant tiles — perhaps 100 KB of data. Rasterio supports this natively through GDAL’s virtual filesystem: rasterio.open("https://example.com/image.tif") works transparently with COGs over HTTP. The critical performance pattern is windowed reads: dataset.read(1, window=window) reads only the pixels you need. AI tools that generate dataset.read() without a window argument for large rasters will attempt to load the entire raster into memory.

Nodata handling is the other major correctness issue. Satellite imagery has nodata values for pixels obscured by clouds, outside the sensor footprint, or flagged during quality assessment. NDVI calculated without masking nodata pixels will produce meaningless values (often extremely negative or positive) at cloud locations, corrupting your analysis. The correct pattern: read the nodata value from the dataset metadata, create a boolean mask, and apply it to all band math operations using NumPy masked arrays or explicit conditional logic.

"""
NDVI calculation from Landsat 8 with proper nodata handling,
windowed reads for memory efficiency, and COG output.
"""
import numpy as np
import rasterio
from rasterio.windows import Window
from rasterio.transform import from_bounds
from rasterio.enums import Resampling
from pathlib import Path


def calculate_ndvi_windowed(
    red_path: str,
    nir_path: str,
    output_path: str,
    block_size: int = 512,
) -> dict:
    """Calculate NDVI from Landsat 8 red (Band 4) and NIR (Band 5).

    NDVI = (NIR - Red) / (NIR + Red)

    Uses windowed reads to process arbitrarily large rasters
    without loading them entirely into memory.

    Args:
        red_path: Path to Band 4 (red) GeoTIFF
        nir_path: Path to Band 5 (NIR) GeoTIFF
        output_path: Path for output NDVI GeoTIFF (COG format)
        block_size: Processing window size in pixels

    Returns:
        Dictionary with statistics (min, max, mean NDVI, valid pixel count)
    """
    with rasterio.open(red_path) as red_ds, rasterio.open(nir_path) as nir_ds:
        # Validate that both bands have the same dimensions and CRS
        assert red_ds.crs == nir_ds.crs, (
            f"CRS mismatch: red={red_ds.crs}, nir={nir_ds.crs}"
        )
        assert red_ds.shape == nir_ds.shape, (
            f"Shape mismatch: red={red_ds.shape}, nir={nir_ds.shape}"
        )
        assert red_ds.transform == nir_ds.transform, (
            "Transform mismatch between red and NIR bands"
        )

        red_nodata = red_ds.nodata
        nir_nodata = nir_ds.nodata
        height, width = red_ds.shape

        # Output profile: COG-compatible GeoTIFF
        profile = red_ds.profile.copy()
        profile.update(
            dtype=rasterio.float32,
            count=1,
            nodata=-9999.0,
            compress="deflate",
            tiled=True,
            blockxsize=256,
            blockysize=256,
        )

        # Statistics accumulators
        ndvi_sum = 0.0
        ndvi_min = float("inf")
        ndvi_max = float("-inf")
        valid_count = 0

        with rasterio.open(output_path, "w", **profile) as dst:
            # Process in windows to control memory usage
            for row_off in range(0, height, block_size):
                for col_off in range(0, width, block_size):
                    # Clip window to raster extent
                    win_height = min(block_size, height - row_off)
                    win_width = min(block_size, width - col_off)
                    window = Window(col_off, row_off, win_width, win_height)

                    # Read bands as float32 for division
                    red = red_ds.read(1, window=window).astype(np.float32)
                    nir = nir_ds.read(1, window=window).astype(np.float32)

                    # Build nodata mask
                    valid_mask = np.ones(red.shape, dtype=bool)
                    if red_nodata is not None:
                        valid_mask &= (red != red_nodata)
                    if nir_nodata is not None:
                        valid_mask &= (nir != nir_nodata)

                    # Avoid division by zero where red + nir == 0
                    denominator = nir + red
                    valid_mask &= (denominator != 0)

                    # Calculate NDVI only for valid pixels
                    ndvi = np.full(red.shape, -9999.0, dtype=np.float32)
                    ndvi[valid_mask] = (
                        (nir[valid_mask] - red[valid_mask]) /
                        denominator[valid_mask]
                    )

                    # Clamp NDVI to valid range [-1, 1]
                    valid_ndvi = ndvi[valid_mask]
                    ndvi[valid_mask] = np.clip(valid_ndvi, -1.0, 1.0)

                    # Update statistics
                    if valid_ndvi.size > 0:
                        ndvi_sum += np.sum(ndvi[valid_mask])
                        ndvi_min = min(ndvi_min, float(np.min(ndvi[valid_mask])))
                        ndvi_max = max(ndvi_max, float(np.max(ndvi[valid_mask])))
                        valid_count += int(np.sum(valid_mask))

                    # Write window to output
                    dst.write(ndvi, 1, window=window)

        # Build overviews for COG compatibility
        with rasterio.open(output_path, "r+") as dst:
            overview_levels = [2, 4, 8, 16]
            dst.build_overviews(overview_levels, Resampling.nearest)
            dst.update_tags(ns="rio_overview", resampling="nearest")

    stats = {
        "valid_pixels": valid_count,
        "total_pixels": height * width,
        "coverage_pct": round(100 * valid_count / (height * width), 2),
        "ndvi_min": round(ndvi_min, 4) if valid_count > 0 else None,
        "ndvi_max": round(ndvi_max, 4) if valid_count > 0 else None,
        "ndvi_mean": round(ndvi_sum / valid_count, 4) if valid_count > 0 else None,
    }
    return stats


if __name__ == "__main__":
    # Landsat 8 Collection 2 Level-2 surface reflectance bands
    stats = calculate_ndvi_windowed(
        red_path="LC08_L2SP_015033_20240601_20240605_02_T1_SR_B4.TIF",
        nir_path="LC08_L2SP_015033_20240601_20240605_02_T1_SR_B5.TIF",
        output_path="output/ndvi_20240601.tif",
        block_size=512,
    )

    print("NDVI Statistics:")
    for key, value in stats.items():
        print(f"  {key}: {value}")

4. Coordinate Reference Systems

Coordinate reference systems are the single most error-prone aspect of geospatial engineering, and the area where AI tools cause the most silent data quality issues. Every coordinate pair is meaningless without its CRS. The number (-77.0369, 38.9072) could be WGS84 longitude/latitude (placing a point in Washington, D.C.), or it could be coordinates in some projected CRS that places it somewhere entirely different. The problem is compounded by the fact that CRS information is often implicit: GeoJSON mandates WGS84 per RFC 7946, but Shapefiles carry CRS in a separate .prj file that is frequently missing or incorrect, and CSV files with lat/lon columns have no CRS metadata at all.

The pyproj library is the Python interface to PROJ, the foundational coordinate transformation engine used by GDAL, PostGIS, QGIS, and virtually every other geospatial tool. The modern API uses the Transformer class, which is thread-safe and supports datum transformations, pipeline definitions, and authority-based CRS lookups. The critical pitfalls: (1) always use Transformer.from_crs() with always_xy=True to ensure longitude-first axis order regardless of the CRS definition (WGS84’s official axis order is latitude-first, but most software expects longitude-first); (2) datum transformations between NAD27 and WGS84 can shift coordinates by hundreds of meters and require grid-based corrections (NADCON) that are not applied unless you explicitly use the correct transformation pipeline; (3) the PROJ database includes transformation accuracy metadata — always check transformer.accuracy to know the expected error; (4) when converting between geographic and projected CRS, understand what your projection preserves (area, shape, distance, direction) and what it distorts.

AI tools are particularly weak here because CRS issues are fundamentally about domain knowledge, not syntax. Every tool can generate gdf.to_crs(epsg=4326) syntactically. The question is whether the tool understands when to reproject, which CRS to choose for a given operation, and what the implications are. Claude Code is the strongest: it recommends appropriate projected CRSs for specific regions (UTM zones for local accuracy, Albers Equal Area for continental area calculations, Web Mercator only for display), warns about datum shifts, and explains why always_xy=True matters.

"""
Coordinate transformation with pyproj: proper datum handling,
pipeline selection, and accuracy validation.
"""
from pyproj import Transformer, CRS, database
from pyproj.enums import TransformDirection
import numpy as np


def transform_coordinates(
    coords: list[tuple[float, float]],
    source_crs: str | int,
    target_crs: str | int,
    always_xy: bool = True,
) -> list[tuple[float, float]]:
    """Transform coordinates between CRSs with accuracy reporting.

    Args:
        coords: List of (x, y) coordinate pairs
        source_crs: Source CRS as EPSG code (int) or WKT/PROJ string
        target_crs: Target CRS as EPSG code (int) or WKT/PROJ string
        always_xy: If True, coordinates are always (x/lon, y/lat) order

    Returns:
        List of transformed (x, y) coordinate pairs
    """
    src = CRS.from_user_input(source_crs)
    tgt = CRS.from_user_input(target_crs)

    print(f"Source CRS: {src.name} (EPSG:{src.to_epsg()})")
    print(f"  Type: {src.type_name}")
    print(f"  Datum: {src.datum.name}")
    print(f"  Ellipsoid: {src.ellipsoid.name}")

    print(f"Target CRS: {tgt.name} (EPSG:{tgt.to_epsg()})")
    print(f"  Type: {tgt.type_name}")
    print(f"  Datum: {tgt.datum.name}")

    # Create transformer - pyproj automatically selects the best
    # available transformation pipeline based on area of use and accuracy
    transformer = Transformer.from_crs(
        src,
        tgt,
        always_xy=always_xy,  # Critical: ensures lon/x first, lat/y second
    )

    # Report the transformation pipeline and accuracy
    print(f"\nTransformation pipeline:")
    print(f"  {transformer.description}")
    if transformer.accuracy is not None and transformer.accuracy >= 0:
        print(f"  Estimated accuracy: {transformer.accuracy} meters")
    else:
        print("  Accuracy: unknown (no grid files or ballpark only)")

    # Transform all coordinates
    xs, ys = zip(*coords)
    tx, ty = transformer.transform(xs, ys)

    # Handle both scalar and array returns
    if isinstance(tx, float):
        return [(tx, ty)]
    return list(zip(tx, ty))


def find_utm_zone(longitude: float, latitude: float) -> int:
    """Determine the correct UTM EPSG code for a given lon/lat.

    UTM zones are 6 degrees wide, numbered 1-60 from 180W to 180E.
    North zones use EPSG 326xx, South zones use EPSG 327xx.
    """
    zone_number = int((longitude + 180) / 6) + 1

    # Handle edge cases at zone boundaries
    zone_number = max(1, min(60, zone_number))

    if latitude >= 0:
        epsg = 32600 + zone_number  # North
    else:
        epsg = 32700 + zone_number  # South

    print(f"UTM Zone for ({longitude}, {latitude}): Zone {zone_number}"
          f"{'N' if latitude >= 0 else 'S'} (EPSG:{epsg})")
    return epsg


def reproject_for_analysis(
    longitude: float,
    latitude: float,
    buffer_meters: float = 1000.0,
) -> dict:
    """Demonstrate the correct workflow for metric spatial operations.

    1. Start with WGS84 coordinates
    2. Determine appropriate projected CRS
    3. Transform to projected CRS for metric operations
    4. Perform buffer/distance/area calculations in meters
    5. Transform results back to WGS84 for display/storage
    """
    from shapely.geometry import Point
    from shapely.ops import transform as shapely_transform
    import pyproj

    # Step 1: Find the appropriate UTM zone for this location
    utm_epsg = find_utm_zone(longitude, latitude)

    # Step 2: Create transformer to UTM
    wgs84_to_utm = Transformer.from_crs(
        "EPSG:4326", f"EPSG:{utm_epsg}", always_xy=True
    )
    utm_to_wgs84 = Transformer.from_crs(
        f"EPSG:{utm_epsg}", "EPSG:4326", always_xy=True
    )

    # Step 3: Transform point to UTM (meters)
    utm_x, utm_y = wgs84_to_utm.transform(longitude, latitude)
    print(f"\nWGS84: ({longitude}, {latitude})")
    print(f"UTM:   ({utm_x:.2f}, {utm_y:.2f}) meters")

    # Step 4: Create buffer in projected CRS (correct metric buffer)
    point_utm = Point(utm_x, utm_y)
    buffer_utm = point_utm.buffer(buffer_meters)
    print(f"Buffer area: {buffer_utm.area:.2f} sq meters "
          f"({buffer_utm.area / 1e6:.6f} sq km)")

    # Step 5: Transform buffer back to WGS84 for GeoJSON export
    buffer_wgs84 = shapely_transform(
        lambda x, y: utm_to_wgs84.transform(x, y),
        buffer_utm,
    )

    return {
        "center_wgs84": (longitude, latitude),
        "center_utm": (utm_x, utm_y),
        "utm_epsg": utm_epsg,
        "buffer_meters": buffer_meters,
        "buffer_area_sq_m": buffer_utm.area,
        "buffer_wgs84_wkt": buffer_wgs84.wkt,
    }


if __name__ == "__main__":
    # Example 1: Transform NAD27 to WGS84 (requires datum shift)
    print("=== NAD27 to WGS84 (datum transformation) ===")
    nad27_coords = [(-77.0369, 38.9072), (-122.4194, 37.7749)]
    wgs84_coords = transform_coordinates(
        nad27_coords,
        source_crs=4267,  # NAD27
        target_crs=4326,  # WGS84
    )
    for orig, transformed in zip(nad27_coords, wgs84_coords):
        dx = transformed[0] - orig[0]
        dy = transformed[1] - orig[1]
        print(f"  {orig} -> ({transformed[0]:.6f}, {transformed[1]:.6f}) "
              f"shift: ({dx:.6f} deg, {dy:.6f} deg)")

    # Example 2: Correct metric buffer workflow
    print("\n=== Metric buffer in UTM ===")
    result = reproject_for_analysis(
        longitude=-77.0369,
        latitude=38.9072,
        buffer_meters=500.0,
    )

5. Web Mapping

Web mapping is where geospatial data meets the browser, and the engineering challenges shift from analytical correctness to rendering performance. The two dominant libraries in 2026 are Leaflet (lightweight, raster-tile focused, massive plugin ecosystem) and MapLibre GL JS (GPU-accelerated, vector-tile native, data-driven styling). Deck.gl adds WebGL-powered large-scale data visualization on top of either base map. The critical technical decisions: raster tiles vs vector tiles, client-side vs server-side rendering, GeoJSON vs vector tile formats, and how to handle the inevitable EPSG:3857 (Web Mercator) projection that all web map tile systems use.

Vector tiles have become the standard for interactive web maps. Unlike raster tiles (pre-rendered PNG images), vector tiles transmit geometry and attributes as Protocol Buffer-encoded data, allowing client-side styling, smooth zooming, and dynamic label placement. MapLibre GL JS renders vector tiles on the GPU using WebGL, enabling fluid interaction with millions of features. The standard pipeline: use tippecanoe (from Felt/Mapbox) to generate .pmtiles or .mbtiles from GeoJSON or FlatGeobuf sources, serve them from a static file host or S3, and render with MapLibre GL JS using data-driven style expressions. The alternative — loading raw GeoJSON into the browser — works for datasets under approximately 10,000 features but degrades rapidly beyond that. A 50 MB GeoJSON file of building footprints will freeze the browser during parsing; the same data as vector tiles loads progressively and renders smoothly at any zoom level.

AI tools are generally strong on web mapping because it overlaps heavily with mainstream web development. Cursor excels here: it indexes your MapLibre style specifications and autocompletes data-driven expressions, layer configurations, and event handlers. Copilot is surprisingly good at Leaflet plugin integration and basic tile layer setup. Claude Code handles the spatial reasoning aspects — choosing appropriate zoom levels, understanding tile coordinate math, implementing correct GeoJSON winding order for rendering.

/**
 * MapLibre GL JS map with vector tile source, data-driven styling,
 * and interactive popups. Displays census tract income data with
 * a choropleth color ramp and tooltip on hover.
 */

// Initialize the map
const map = new maplibregl.Map({
  container: "map",
  style: {
    version: 8,
    sources: {
      // OpenFreeMap base tiles (or any FOSS tile source)
      "osm-tiles": {
        type: "raster",
        tiles: [
          "https://tile.openstreetmap.org/{z}/{x}/{y}.png"
        ],
        tileSize: 256,
        attribution:
          '© OpenStreetMap',
      },
    },
    layers: [
      {
        id: "osm-base",
        type: "raster",
        source: "osm-tiles",
      },
    ],
  },
  center: [-77.0369, 38.9072], // Washington, DC [lon, lat]
  zoom: 10,
  maxZoom: 18,
  minZoom: 3,
});

map.on("load", () => {
  // Add vector tile source for census tracts
  // PMTiles served from static hosting (S3, Cloudflare R2, etc.)
  map.addSource("census-tracts", {
    type: "vector",
    url: "pmtiles://https://data.example.com/census-tracts-2024.pmtiles",
    promoteId: "GEOID", // Use GEOID as the feature ID for state management
  });

  // Choropleth fill layer with data-driven color based on median income
  map.addLayer(
    {
      id: "tracts-fill",
      type: "fill",
      source: "census-tracts",
      "source-layer": "census_tracts",
      paint: {
        "fill-color": [
          "step",
          ["get", "median_income"],
          "#d73027",   // < $25,000 - red
          25000,
          "#fc8d59",   // $25,000 - $40,000 - orange
          40000,
          "#fee08b",   // $40,000 - $60,000 - yellow
          60000,
          "#d9ef8b",   // $60,000 - $80,000 - light green
          80000,
          "#91cf60",   // $80,000 - $100,000 - green
          100000,
          "#1a9850",   // > $100,000 - dark green
        ],
        "fill-opacity": [
          "case",
          ["boolean", ["feature-state", "hover"], false],
          0.85,
          0.6,
        ],
      },
    },
    "osm-base" // Insert below labels
  );

  // Tract boundary outlines
  map.addLayer({
    id: "tracts-outline",
    type: "line",
    source: "census-tracts",
    "source-layer": "census_tracts",
    paint: {
      "line-color": "#333333",
      "line-width": [
        "interpolate",
        ["linear"],
        ["zoom"],
        8, 0.2,   // Thin lines at low zoom
        14, 1.5,  // Thicker lines at high zoom
      ],
      "line-opacity": 0.5,
    },
  });

  // Hover state management
  let hoveredTractId = null;

  map.on("mousemove", "tracts-fill", (e) => {
    if (e.features.length === 0) return;

    // Clear previous hover state
    if (hoveredTractId !== null) {
      map.setFeatureState(
        { source: "census-tracts", sourceLayer: "census_tracts", id: hoveredTractId },
        { hover: false }
      );
    }

    hoveredTractId = e.features[0].id;

    // Set new hover state
    map.setFeatureState(
      { source: "census-tracts", sourceLayer: "census_tracts", id: hoveredTractId },
      { hover: true }
    );

    // Update cursor
    map.getCanvas().style.cursor = "pointer";
  });

  map.on("mouseleave", "tracts-fill", () => {
    if (hoveredTractId !== null) {
      map.setFeatureState(
        { source: "census-tracts", sourceLayer: "census_tracts", id: hoveredTractId },
        { hover: false }
      );
    }
    hoveredTractId = null;
    map.getCanvas().style.cursor = "";
  });

  // Click popup with tract details
  map.on("click", "tracts-fill", (e) => {
    if (e.features.length === 0) return;

    const props = e.features[0].properties;
    const income = props.median_income
      ? `$${Number(props.median_income).toLocaleString()}`
      : "N/A";
    const population = props.total_population
      ? Number(props.total_population).toLocaleString()
      : "N/A";

    new maplibregl.Popup({ maxWidth: "320px" })
      .setLngLat(e.lngLat)
      .setHTML(
        `
          
            Census Tract ${props.GEOID}
          
          
            County: ${props.county_name || "N/A"}
          
          
            Median Income: ${income}
          
          
            Population: ${population}
          
          
            Area: ${props.area_sq_km
              ? Number(props.area_sq_km).toFixed(2) + " km²"
              : "N/A"}
          
        `
      )
      .addTo(map);
  });
});

// Navigation controls
map.addControl(new maplibregl.NavigationControl(), "top-right");
map.addControl(
  new maplibregl.ScaleControl({ maxWidth: 200, unit: "metric" }),
  "bottom-left"
);

6. Remote Sensing & Earth Observation

Remote sensing engineering has been transformed by the cloud-native geospatial movement. Instead of downloading terabytes of satellite imagery to local storage, the modern workflow queries STAC (SpatioTemporal Asset Catalog) APIs to search for imagery matching spatial, temporal, and quality criteria, then reads only the required pixels via HTTP range reads from Cloud Optimized GeoTIFFs stored in cloud object storage. The key libraries: pystac-client for STAC API access, rioxarray for loading raster data into xarray Datasets with CRS awareness, and Dask for lazy evaluation and distributed processing of datasets larger than memory.

The STAC ecosystem has standardized access to the world’s major Earth observation archives. Microsoft Planetary Computer hosts Sentinel-2, Landsat, NAIP, and dozens of other datasets behind a STAC API. Element 84’s Earth Search indexes Sentinel-2 and Landsat in AWS S3. NASA’s Common Metadata Repository (CMR) provides STAC access to the full NASA archive. The workflow is consistent: search by bounding box, date range, cloud cover percentage, and collection name; retrieve item metadata including asset URLs; load specific bands as xarray DataArrays; perform analysis. The performance-critical decision is how much data to load: a query that returns 500 Sentinel-2 tiles without filtering by cloud cover or limiting the date range will attempt to process hundreds of gigabytes.

AI tools vary dramatically in their remote sensing capability. Claude Code understands STAC query construction, band math for spectral indices (NDVI, NDWI, EVI, NBR), cloud masking using quality assessment bands, and the difference between surface reflectance and top-of-atmosphere products. Gemini CLI’s large context window is valuable for discussing sensor specifications, band wavelength ranges, and atmospheric correction methods. Copilot and Cursor handle xarray syntax well but lack the domain knowledge to construct correct STAC queries or choose appropriate spectral bands.

"""
Cloud-native remote sensing workflow: search Sentinel-2 imagery via STAC,
load specific bands with rioxarray, compute NDVI, and mask clouds.
"""
import planetary_computer
import pystac_client
import rioxarray
import xarray as xr
import numpy as np
from shapely.geometry import box, mapping


def search_sentinel2_imagery(
    bbox: tuple[float, float, float, float],
    date_range: str,
    max_cloud_cover: int = 20,
    max_items: int = 10,
) -> list:
    """Search Microsoft Planetary Computer for Sentinel-2 L2A imagery.

    Args:
        bbox: Bounding box as (west, south, east, north) in WGS84 degrees
        date_range: Date range as "YYYY-MM-DD/YYYY-MM-DD"
        max_cloud_cover: Maximum cloud cover percentage (0-100)
        max_items: Maximum number of items to return

    Returns:
        List of STAC items matching the search criteria
    """
    catalog = pystac_client.Client.open(
        "https://planetarycomputer.microsoft.com/api/stac/v1",
        modifier=planetary_computer.sign_inplace,
    )

    search = catalog.search(
        collections=["sentinel-2-l2a"],
        bbox=bbox,
        datetime=date_range,
        query={"eo:cloud_cover": {"lt": max_cloud_cover}},
        max_items=max_items,
        sortby=[{"field": "properties.eo:cloud_cover", "direction": "asc"}],
    )

    items = list(search.items())
    print(f"Found {len(items)} Sentinel-2 scenes with <{max_cloud_cover}% cloud cover")

    for item in items:
        props = item.properties
        print(
            f"  {item.id}: {props['datetime'][:10]}, "
            f"cloud: {props['eo:cloud_cover']:.1f}%, "
            f"platform: {props.get('platform', 'unknown')}"
        )

    return items


def compute_ndvi_from_stac(
    item,
    bbox: tuple[float, float, float, float],
    resolution: int = 10,
) -> xr.DataArray:
    """Load Sentinel-2 red and NIR bands and compute NDVI.

    Uses rioxarray for CRS-aware loading with automatic reprojection
    and spatial subsetting to the area of interest.

    Sentinel-2 L2A bands:
      - B04 (Red): 665nm, 10m resolution
      - B08 (NIR): 842nm, 10m resolution
      - SCL (Scene Classification): 20m resolution (for cloud masking)

    Args:
        item: STAC item from search results
        bbox: Area of interest (west, south, east, north) in WGS84
        resolution: Target resolution in meters

    Returns:
        xr.DataArray with NDVI values, cloud-masked
    """
    # Load red band (B04) clipped to AOI
    red_href = item.assets["B04"].href
    red = rioxarray.open_rasterio(red_href, overview_level=0).squeeze()
    red = red.rio.clip_box(*bbox, crs="EPSG:4326")
    red = red.astype(np.float32)

    # Load NIR band (B08) clipped to AOI
    nir_href = item.assets["B08"].href
    nir = rioxarray.open_rasterio(nir_href, overview_level=0).squeeze()
    nir = nir.rio.clip_box(*bbox, crs="EPSG:4326")
    nir = nir.astype(np.float32)

    # Load Scene Classification Layer for cloud masking
    scl_href = item.assets["SCL"].href
    scl = rioxarray.open_rasterio(scl_href, overview_level=0).squeeze()
    scl = scl.rio.clip_box(*bbox, crs="EPSG:4326")
    # Reproject SCL to match red/NIR resolution (SCL is 20m, bands are 10m)
    scl = scl.rio.reproject_match(red, resampling=0)  # Nearest neighbor

    # Sentinel-2 L2A surface reflectance scale factor
    # Collection 2: values are already in reflectance * 10000
    red = red / 10000.0
    nir = nir / 10000.0

    # Create cloud mask from SCL values
    # SCL classes to mask out:
    #   0: No data, 1: Saturated/defective, 2: Dark area pixels,
    #   3: Cloud shadow, 8: Cloud medium probability,
    #   9: Cloud high probability, 10: Thin cirrus, 11: Snow/ice
    cloud_mask = scl.isin([0, 1, 2, 3, 8, 9, 10, 11])
    clear_mask = ~cloud_mask

    # Calculate NDVI: (NIR - Red) / (NIR + Red)
    denominator = nir + red
    ndvi = xr.where(
        (denominator != 0) & clear_mask,
        (nir - red) / denominator,
        np.nan,
    )

    # Clip to valid NDVI range
    ndvi = ndvi.clip(-1.0, 1.0)

    # Preserve CRS and spatial metadata
    ndvi = ndvi.rio.write_crs(red.rio.crs)
    ndvi = ndvi.rio.write_transform()
    ndvi = ndvi.rio.write_nodata(np.nan)

    # Add metadata
    ndvi.attrs["long_name"] = "Normalized Difference Vegetation Index"
    ndvi.attrs["source"] = item.id
    ndvi.attrs["date"] = item.properties["datetime"][:10]
    ndvi.attrs["cloud_cover_pct"] = item.properties["eo:cloud_cover"]

    clear_pct = float(clear_mask.sum()) / float(clear_mask.size) * 100
    print(f"  NDVI computed: {ndvi.shape}, clear pixels: {clear_pct:.1f}%")
    print(f"  NDVI range: [{float(ndvi.min()):.3f}, {float(ndvi.max()):.3f}]")
    print(f"  CRS: {ndvi.rio.crs}")

    return ndvi


if __name__ == "__main__":
    # Area of interest: agricultural region in Iowa
    aoi_bbox = (-93.8, 41.5, -93.4, 41.8)

    # Search for summer 2025 imagery
    items = search_sentinel2_imagery(
        bbox=aoi_bbox,
        date_range="2025-06-01/2025-08-31",
        max_cloud_cover=15,
        max_items=5,
    )

    if items:
        # Compute NDVI from the least cloudy scene
        best_item = items[0]
        print(f"\nProcessing: {best_item.id}")
        ndvi = compute_ndvi_from_stac(best_item, bbox=aoi_bbox)

        # Save as COG
        output_path = f"output/ndvi_{best_item.properties['datetime'][:10]}.tif"
        ndvi.rio.to_raster(
            output_path,
            driver="GTiff",
            compress="deflate",
            tiled=True,
        )
        print(f"\nSaved NDVI raster to {output_path}")

7. Spatial Data Pipelines

Spatial data pipelines — the ETL (Extract, Transform, Load) workflows that convert, clean, reproject, and consolidate geospatial data from disparate sources into analysis-ready formats — are where GIS engineers spend a surprising proportion of their time. The sources are heterogeneous: Shapefiles from government agencies (still the most common exchange format despite being a 1990s design with a 2 GB size limit and no support for field names longer than 10 characters), GeoJSON from web APIs, GeoPackage from QGIS projects, KML from Google Earth, CSV with lat/lon columns, GeoParquet from modern data lakehouse architectures, and proprietary formats like Esri FileGDB. The pipeline must handle CRS detection and transformation, geometry validation and repair, attribute schema normalization, and often deduplication based on spatial proximity.

The modern geospatial data engineering stack has evolved significantly. GeoParquet (the Parquet format with GeoArrow-encoded geometry columns and CRS metadata) has emerged as the standard analytical format, combining Parquet’s columnar compression and predicate pushdown with spatial metadata. DuckDB’s spatial extension enables SQL-based spatial analysis directly on Parquet and GeoParquet files without loading data into a database first. Apache Sedona (formerly GeoSpark) provides distributed spatial processing on Spark for datasets that exceed single-machine capacity. GDAL/OGR’s ogr2ogr command-line tool remains the universal format converter, capable of reading and writing over 80 vector formats with CRS transformation, SQL-based filtering, and geometry reprojection in a single command.

AI tools are useful for spatial pipeline construction because much of the work is boilerplate: reading a format, transforming coordinates, validating geometry, writing to another format. Claude Code generates correct ogr2ogr command lines with proper SRID handling and format-specific creation options. Cursor autocompletes DuckDB spatial SQL from your existing query patterns. Amazon Q is strong on AWS-native spatial pipelines (Athena with GeoParquet, Glue for spatial ETL, Redshift spatial functions).

"""
Spatial data pipeline with DuckDB: convert Shapefiles to GeoParquet,
perform spatial analysis, and export results.
"""
import duckdb
from pathlib import Path


def build_spatial_pipeline(
    input_dir: str,
    output_path: str,
    target_srid: int = 4326,
) -> dict:
    """Spatial ETL pipeline using DuckDB spatial extension.

    1. Load Shapefiles and GeoJSON files from input directory
    2. Standardize CRS to WGS84
    3. Validate and repair geometries
    4. Perform spatial aggregation (points per polygon)
    5. Export to GeoParquet

    Args:
        input_dir: Directory containing source geospatial files
        output_path: Path for output GeoParquet file
        target_srid: Target SRID for output (default: 4326 WGS84)

    Returns:
        Dictionary with pipeline statistics
    """
    con = duckdb.connect()

    # Install and load spatial extension
    con.execute("INSTALL spatial; LOAD spatial;")

    # Load building permits (points) from GeoJSON
    con.execute("""
        CREATE TABLE permits AS
        SELECT
            *,
            ST_GeomFromGeoJSON(geometry) AS geom
        FROM ST_Read(?)
    """, [f"{input_dir}/building_permits.geojson"])

    # Load neighborhood boundaries (polygons) from Shapefile
    con.execute("""
        CREATE TABLE neighborhoods AS
        SELECT
            *,
            ST_GeomFromWKB(wkb_geometry) AS geom
        FROM ST_Read(?)
    """, [f"{input_dir}/neighborhoods.shp"])

    # Check and report source CRS info
    permit_count = con.execute("SELECT COUNT(*) FROM permits").fetchone()[0]
    neighborhood_count = con.execute("SELECT COUNT(*) FROM neighborhoods").fetchone()[0]
    print(f"Loaded {permit_count} permits, {neighborhood_count} neighborhoods")

    # Validate geometries and repair invalid ones
    con.execute("""
        UPDATE neighborhoods
        SET geom = ST_MakeValid(geom)
        WHERE NOT ST_IsValid(geom)
    """)

    invalid_repaired = con.execute("""
        SELECT changes() AS repaired_count
    """).fetchone()[0]
    print(f"Repaired {invalid_repaired} invalid neighborhood geometries")

    # Transform CRS if needed (assumes source data has SRID set)
    # DuckDB spatial uses ST_Transform for reprojection
    con.execute(f"""
        UPDATE permits SET geom = ST_Transform(geom, 'EPSG:4326', 'EPSG:{target_srid}')
        WHERE ST_SRID(geom) != {target_srid}
    """)

    # Spatial join: count permits per neighborhood and calculate density
    con.execute("""
        CREATE TABLE neighborhood_stats AS
        SELECT
            n.neighborhood_id,
            n.name AS neighborhood_name,
            n.geom,
            ST_Area(
                ST_Transform(n.geom, 'EPSG:4326', 'EPSG:3857')
            ) / 1e6 AS area_sq_km,
            COUNT(p.permit_id) AS permit_count,
            COUNT(CASE WHEN p.permit_type = 'NEW_CONSTRUCTION' THEN 1 END)
                AS new_construction_count,
            COUNT(CASE WHEN p.permit_type = 'RENOVATION' THEN 1 END)
                AS renovation_count,
            SUM(COALESCE(p.estimated_cost, 0)) AS total_investment,
            AVG(COALESCE(p.estimated_cost, 0))
                FILTER (WHERE p.estimated_cost > 0) AS avg_permit_cost
        FROM neighborhoods n
        LEFT JOIN permits p
            ON ST_Contains(n.geom, p.geom)
        GROUP BY n.neighborhood_id, n.name, n.geom
    """)

    # Add derived metrics
    con.execute("""
        ALTER TABLE neighborhood_stats ADD COLUMN permits_per_sq_km DOUBLE;
        UPDATE neighborhood_stats
        SET permits_per_sq_km = CASE
            WHEN area_sq_km > 0 THEN permit_count / area_sq_km
            ELSE 0
        END;
    """)

    # Report top neighborhoods by development activity
    top_neighborhoods = con.execute("""
        SELECT
            neighborhood_name,
            permit_count,
            new_construction_count,
            ROUND(permits_per_sq_km, 1) AS density,
            ROUND(total_investment / 1e6, 2) AS investment_millions
        FROM neighborhood_stats
        WHERE permit_count > 0
        ORDER BY permit_count DESC
        LIMIT 10
    """).fetchall()

    print("\nTop neighborhoods by permit activity:")
    for row in top_neighborhoods:
        print(
            f"  {row[0]}: {row[1]} permits "
            f"({row[2]} new construction), "
            f"{row[3]} permits/km2, "
            f"${row[4]}M investment"
        )

    # Export to GeoParquet
    con.execute(f"""
        COPY (
            SELECT
                neighborhood_id,
                neighborhood_name,
                permit_count,
                new_construction_count,
                renovation_count,
                ROUND(area_sq_km, 4) AS area_sq_km,
                ROUND(permits_per_sq_km, 2) AS permits_per_sq_km,
                ROUND(total_investment, 2) AS total_investment,
                ROUND(avg_permit_cost, 2) AS avg_permit_cost,
                ST_AsWKB(geom) AS geometry
            FROM neighborhood_stats
            ORDER BY permit_count DESC
        )
        TO '{output_path}'
        (FORMAT PARQUET, COMPRESSION 'ZSTD')
    """)

    # Collect pipeline statistics
    stats = {
        "input_permits": permit_count,
        "input_neighborhoods": neighborhood_count,
        "geometries_repaired": invalid_repaired,
        "output_rows": con.execute(
            "SELECT COUNT(*) FROM neighborhood_stats"
        ).fetchone()[0],
        "total_permits_matched": con.execute(
            "SELECT SUM(permit_count) FROM neighborhood_stats"
        ).fetchone()[0],
        "output_path": output_path,
    }

    con.close()

    print(f"\nPipeline complete: {stats['output_rows']} neighborhoods exported")
    print(f"  {stats['total_permits_matched']}/{stats['input_permits']} "
          f"permits matched to neighborhoods")
    print(f"  Output: {output_path}")

    return stats


if __name__ == "__main__":
    stats = build_spatial_pipeline(
        input_dir="data/raw",
        output_path="data/processed/neighborhood_development.parquet",
        target_srid=4326,
    )

Quick-Reference: Best Tool by Task

Task	Best Tool	Why
PostGIS spatial queries	Claude Code	Understands geometry vs geography types, index usage patterns, SRID handling
GeoPandas analysis	Claude Code	Generates CRS-aware spatial joins, warns about geographic vs projected operations
Rasterio / GDAL processing	Claude Code	Correct windowed reads, nodata handling, COG output configuration
CRS transformations	Claude Code	Knows datum shifts, always_xy parameter, recommends appropriate projections for regions
MapLibre GL / Leaflet maps	Cursor Pro	Indexes style specs, autocompletes data-driven expressions and layer configs
Vector tile pipelines	Cursor Pro	tippecanoe flags, PMTiles configuration, layer optimization from project context
STAC API / remote sensing	Claude Code	Correct STAC query construction, band selection, cloud masking, spectral indices
AWS spatial infrastructure	Amazon Q	Athena GeoParquet queries, Location Service, Glue spatial ETL
Large GIS codebase navigation	Cursor Pro	Indexes QGIS plugins, GeoServer configs, multi-module geospatial projects
GDAL/OGR documentation	Gemini CLI	1M context for GDAL driver docs, format specifications, and ogr2ogr flag combinations
DuckDB spatial analysis	Claude Code	Correct spatial SQL with GeoParquet I/O, ST_Transform pipelines, analytical queries

What AI Tools Get Wrong About GIS

Ignoring CRS entirely: The most common AI-generated GIS bug. Code that performs a spatial join between two GeoDataFrames without checking or aligning their CRSs. Code that calculates distances using Shapely’s Euclidean distance on WGS84 coordinates, producing results in meaningless “degree units.” Code that buffers geographic coordinates by a value in meters, producing wildly wrong geometry because one degree of latitude is approximately 111 km but one degree of longitude varies from 111 km at the equator to 0 at the poles.
Loading entire rasters into memory: data = rasterio.open("huge_raster.tif").read() for a multi-gigabyte file. The correct pattern is windowed reads: dataset.read(1, window=window) to process blocks that fit in memory. For cloud-hosted data, this also means the difference between downloading the entire file and fetching only the needed tiles via HTTP range requests.
Confusing latitude/longitude order: GeoJSON uses [longitude, latitude] per RFC 7946. Leaflet’s L.latLng() uses (latitude, longitude). PostGIS ST_MakePoint() takes (x, y) which is (longitude, latitude). WKT uses (longitude latitude). Google Maps API uses {lat, lng}. AI tools frequently mix these up, placing points on the wrong side of the planet. The always_xy=True parameter in pyproj exists precisely because of this confusion.
Missing spatial indexes: Generating PostGIS queries with ST_Intersects or ST_DWithin but not creating or verifying GiST indexes on the geometry columns. On a table with 10 million rows, this turns a 50ms query into a 45-minute sequential scan. AI tools generate the query but rarely generate the CREATE INDEX ... USING GIST statement or the EXPLAIN ANALYZE to verify index usage.
Using ST_Distance without ST_DWithin: WHERE ST_Distance(a.geom, b.geom) < 1000 cannot use a spatial index because it must compute the distance for every pair. WHERE ST_DWithin(a.geom, b.geom, 1000) uses the GiST index to filter by bounding box first. AI tools generate the former when they should always generate the latter.
Web Mercator for analysis: Using EPSG:3857 (Web Mercator) for area or distance calculations. Web Mercator is a display projection designed for tiling web maps, not for spatial analysis. It distorts areas increasingly at higher latitudes — a polygon in Norway appears 4x larger than its actual area. The correct approach: use an equal-area projection (Albers, Lambert Azimuthal) for area calculations, an equidistant projection for distance measurements, and Web Mercator only for rendering tiles.
GeoJSON without right-hand rule: RFC 7946 requires exterior rings to be counter-clockwise and interior rings (holes) to be clockwise. Many rendering engines silently handle wrong winding order, but some (notably MapLibre GL and many tiling tools) will render the polygon as covering the entire earth except the intended area. AI tools generate GeoJSON without ensuring correct winding order, creating maps where continents disappear and oceans become solid fill.
Ignoring nodata in raster analysis: Performing band math on satellite imagery without checking for nodata values. Cloud-covered pixels, sensor gaps, and quality-flagged pixels all have nodata values that must be masked before computation. NDVI calculated with nodata pixels produces extreme outlier values that corrupt statistics and classification results.

Cost Model: What GIS Engineers Actually Pay

Scenario 1: Student / Learner — $0/month

Copilot Free (2,000 completions/mo) for basic GeoPandas, QGIS plugin development, and PostGIS queries
Plus Gemini CLI Free for discussing GDAL documentation, understanding projections, and learning spatial SQL patterns
Combined with QGIS (free and open source), PostGIS, and the Python geospatial stack, this is sufficient for coursework and personal projects. Manually verify CRS handling and spatial index usage in all generated code.

Scenario 2: Solo GIS Analyst — $10/month

Copilot Pro ($10/mo) for unlimited completions when writing GeoPandas analysis scripts, QGIS processing plugins, and ArcPy automation
Good for daily GIS workflows: loading Shapefiles, performing spatial joins, generating maps, automating geoprocessing tasks. Copilot handles the Python boilerplate while you verify the spatial logic. Watch for CRS issues and missing index creation in PostGIS queries.

Scenario 3: GIS Developer — $20/month

Cursor Pro ($20/mo) for working with large GIS codebases: QGIS plugins, GeoServer configurations, web mapping applications with MapLibre/Leaflet, and PostGIS migration files
Or Claude Code ($20/mo) for spatial algorithm development, CRS transformation pipelines, raster processing workflows, and complex PostGIS query optimization
Choose Cursor if your work centers on a large existing codebase with established patterns. Choose Claude Code if your work involves designing new spatial algorithms, optimizing complex queries, or working with unfamiliar CRS and projection issues.

Scenario 4: Full-stack Geo Developer — $40/month

Claude Code ($20/mo) for spatial reasoning: CRS selection, PostGIS query optimization, raster pipeline design, STAC workflow construction
Plus Cursor Pro ($20/mo) for codebase-indexed development: web mapping UI, API endpoints serving GeoJSON, vector tile pipeline configuration, multi-file refactoring
The optimal combination for GIS engineers building full-stack geospatial applications. Claude Code for the domain-specific spatial logic, Cursor for the daily development velocity across frontend map components, backend spatial APIs, and database migration files.

Scenario 5: GIS Team Lead — $19–40/seat/month

Copilot Business ($19/seat/mo) or Cursor Business ($40/seat/mo) for team-wide codebase indexing and consistent spatial coding patterns
Cursor Business indexes the team’s shared PostGIS migration history, MapLibre style specifications, and GDAL processing scripts, ensuring consistent patterns across developers. Copilot Business provides organization-level policy controls and audit logging.

Scenario 6: Enterprise / Government — $39–99/seat/month

Copilot Enterprise ($39/seat/mo) or Cursor Enterprise for organization-wide codebase indexing with compliance controls
Plus Claude Code ($20/seat/mo) for specialized spatial analysis and architecture review
Government GIS agencies and enterprise mapping companies need consistent spatial data quality standards across large teams. Enterprise tiers index the full proprietary codebase, enforce CRS conventions, and provide audit trails. Tabnine ($39/user/mo) offers self-hosted deployment for air-gapped environments handling classified geospatial data.

The GIS Engineer’s Verdict

The fundamental challenge for AI tools in GIS engineering is that spatial correctness is domain-dependent and often invisible. A function that returns a number is not obviously wrong — it takes domain knowledge to recognize that a buffer distance of 0.001 on WGS84 coordinates is approximately 111 meters at the equator but only 1 meter at 89 degrees latitude, or that a PostGIS ST_Distance on geometry columns in EPSG:4326 returns results in degrees rather than meters. AI tools that lack this domain awareness generate code that compiles, runs without errors, and produces results that are silently, systematically wrong in ways that propagate through every downstream analysis.

The good news is that AI tools are genuinely useful for GIS work when paired with domain expertise. The geospatial Python ecosystem has a steep learning curve — remembering whether Rasterio returns bands-first or bands-last arrays, which GeoPandas methods trigger CRS warnings, how to construct ogr2ogr commands with the right combination of -t_srs, -s_srs, and -a_srs flags, which PostGIS functions support geography type and which only work with geometry. AI tools accelerate this boilerplate significantly. Claude Code generates correct PostGIS queries with proper SRID handling and index-aware patterns more reliably than any other tool. Cursor indexes your project’s existing spatial conventions and autocompletes consistently. The combination of Claude Code for spatial reasoning and Cursor for codebase-indexed development covers the full GIS development workflow.

The tool-specific recommendations: Claude Code is the best single tool for GIS engineers. It understands CRS implications (when to use geography vs geometry, which projection to choose for a study area, why always_xy=True matters in pyproj), generates correct PostGIS spatial queries with index usage, handles raster processing with proper nodata masking and windowed reads, and constructs STAC queries for remote sensing workflows. Cursor is the best for GIS application development — web mapping with MapLibre GL, vector tile pipelines, multi-file geospatial projects where understanding the existing codebase is more important than spatial algorithm design. Gemini CLI is valuable for GDAL documentation and format-specific questions — paste the entire GDAL driver documentation into its 1M context window for precise ogr2ogr flag recommendations. Amazon Q is worth considering if your spatial infrastructure is AWS-native (Athena with GeoParquet, Location Service, SageMaker for geospatial ML).

The bottom line: treat AI-generated geospatial code the same way you treat GIS data from an unknown source — validate the CRS, check the geometry validity, verify the spatial relationships make sense, and never assume the numbers are correct just because the code ran without errors. In GIS, silent errors are the default, not the exception. The right AI tools make you faster at generating correct spatial code; the wrong ones (or the right ones used without verification) make you faster at generating wrong spatial code that looks right.

Compare all tools and pricing on our main comparison table, or check the cheapest tools guide for budget options.

Related on CodeCosts

AI Coding Tools for Data Engineers (2026) — ETL pipelines, data warehouses, dbt, Spark
AI Coding Tools for Data Scientists (2026) — pandas, scikit-learn, Jupyter, statistical analysis
AI Coding Tools for Backend Engineers (2026) — APIs, databases, server-side architecture
AI Coding Tools for Cloud Architects (2026) — infrastructure as code, multi-cloud, cost optimization

$_ CodeCosts