GIS and geospatial engineering is software development where the earth’s shape is a first-class concern in every computation. A web application stores user coordinates as two floating-point numbers and calls it a day. A geospatial application must know which coordinate reference system those numbers belong to — WGS84 (EPSG:4326) means latitude/longitude on the GRS 1980 ellipsoid, NAD83 (EPSG:4269) looks identical but differs by up to two meters in parts of North America due to tectonic plate motion, and if someone hands you coordinates in a projected CRS like UTM Zone 17N (EPSG:32617) without telling you, your “latitude” of 4,500,000 will place a point in the middle of the ocean instead of Ontario. The difference between getting CRS handling right and getting it wrong is the difference between a pipeline that produces correct spatial analysis and one that silently places every feature 100 meters from where it actually is. No unit test catches this unless you already know the expected coordinates in the target CRS, which requires the domain knowledge you were hoping the tool would provide.
The data scales are unlike anything in conventional application development. A single Landsat 8 scene is roughly 1 GB of multispectral raster data covering 185 km by 180 km at 30-meter resolution. The Sentinel-2 archive grows by about 1.6 TB per day. The OpenStreetMap planet file exceeds 70 GB in PBF format, containing billions of nodes, ways, and relations. Processing these datasets requires understanding Cloud Optimized GeoTIFFs (COG) that allow partial HTTP range reads instead of downloading entire files, STAC (SpatioTemporal Asset Catalog) APIs that index petabytes of earth observation data, spatial indexing structures like R-trees and H3 hexagonal grids that make billion-row spatial queries feasible, and tile-based processing pipelines that break continent-scale analysis into manageable chunks. An AI tool that suggests loading a 50 GB raster into memory with rasterio.open(path).read() instead of using windowed reads has just crashed your analysis server.
The toolchain is uniquely broad: you write SQL with PostGIS spatial extensions one moment, Python with GeoPandas and Shapely the next, JavaScript with Leaflet or MapLibre GL for visualization, and shell scripts orchestrating GDAL command-line tools for format conversion and reprojection. You need to understand OGC standards (Simple Features, WMS, WFS, WMTS), GeoJSON RFC 7946 (which mandates WGS84 and right-hand-rule polygon winding), and the DE-9IM matrix that describes all possible topological relationships between two geometries. You deal with floating-point precision issues that manifest as self-intersecting polygons, slivers from overlay operations, and topology exceptions that GEOS throws when geometric predicates encounter numerical degeneracies. This guide evaluates every major AI coding tool through the lens of what GIS engineers actually build: spatial databases, vector analysis pipelines, raster processing workflows, coordinate transformations, web mapping applications, remote sensing analyses, and spatial ETL systems.
Best free ($0): Gemini CLI Free — 1M context for GDAL documentation and spatial algorithm discussions. Best for spatial analysis ($20/mo): Claude Code — strongest reasoning for CRS transformations, spatial SQL, and algorithm correctness. Best for GIS codebases ($20/mo): Cursor Pro — indexes large geospatial projects, autocompletes GDAL/PostGIS patterns. Best combined ($40/mo): Claude Code + Cursor. Budget ($0): Copilot Free + Gemini CLI Free.
Why GIS/Geospatial Engineering Is Different
- Coordinate reference systems are invisible correctness requirements: Every geospatial dataset has a CRS, and every operation that combines two datasets must ensure their CRSs are compatible. WGS84 (EPSG:4326) and NAD83 (EPSG:4269) use nearly identical ellipsoid parameters but differ by up to two meters because WGS84 is earth-centered while NAD83 is fixed to the North American plate. The 2011 realization (NAD83(2011)) shifted coordinates by up to 20 centimeters relative to the original NAD83(1986). In Australia, GDA94 and GDA2020 differ by approximately 1.8 meters due to tectonic plate motion. Projections introduce their own distortions: Mercator preserves angles but grossly distorts areas at high latitudes (Greenland appears the size of Africa), Albers Equal Area preserves area but distorts shapes, and UTM zones only work within their 6-degree longitudinal band — a buffer operation that crosses a UTM zone boundary produces incorrect geometry. EPSG codes are the universal language: EPSG:4326 is geographic WGS84 (degrees), EPSG:3857 is Web Mercator (meters, used by every web map tile provider), EPSG:32632 is UTM Zone 32N. Mixing them up produces no error message — just wrong results. An AI tool that generates a spatial join without checking
gdf1.crs == gdf2.crsfirst has introduced a silent data quality bug. - Spatial indexing determines whether queries take milliseconds or hours: A spatial join between two tables of 10 million polygons each, without spatial indexing, requires 1014 pairwise geometry comparisons — computationally impossible. With an R-tree index, the query engine uses bounding box intersection to filter candidates, reducing comparisons by orders of magnitude. PostGIS uses GiST (Generalized Search Tree) indexes that build R-trees over geometry bounding boxes; a query without
CREATE INDEX idx ON table USING GIST(geom)will sequential-scan the entire table for everyST_Intersectscall. H3 (Uber’s hexagonal hierarchical spatial index) partitions the earth into hexagonal cells at multiple resolutions, enabling approximate spatial joins through cell-ID matching instead of computational geometry. S2 (Google’s spherical geometry library) uses Hilbert curves on the sphere for spatial indexing. Quadtrees, k-d trees, and space-filling curves each have different performance characteristics for different query patterns (point-in-polygon, range queries, nearest neighbor). Choosing the wrong indexing strategy — or forgetting to create an index entirely — is the single most common performance catastrophe in geospatial applications. - Raster data operates at scales that break standard tooling: A single Landsat 8 Collection 2 Level-2 scene contains 11 spectral bands at 30-meter resolution (bands 1-7, 9) and 100-meter resolution (bands 10-11), plus a panchromatic band at 15 meters. The GeoTIFF for one band is roughly 60-120 MB; the full scene exceeds 1 GB. Sentinel-2 Level-2A provides 13 spectral bands at 10, 20, and 60-meter resolution, with a single tile (110 km x 110 km) approaching 800 MB. National-scale analysis requires hundreds or thousands of these scenes. Cloud Optimized GeoTIFF (COG) format stores tiles and overviews that allow HTTP range reads — you can read a 256x256 pixel window from a 100 GB raster without downloading the whole file. STAC (SpatioTemporal Asset Catalog) provides a standardized API and JSON specification for searching and accessing these datasets. Tools like Rasterio, xarray with rioxarray, and GDAL’s virtual filesystem (
/vsicurl/,/vsis3/) enable cloud-native workflows. An AI tool that suggestsnp.array(rasterio.open(path).read())for a multi-gigabyte raster has just killed your process with an out-of-memory error. - Topology and geometric validity have strict mathematical rules: The OGC Simple Features specification defines what constitutes a valid geometry. A polygon must have its exterior ring oriented counter-clockwise and interior rings (holes) clockwise (per OGC; GeoJSON RFC 7946 uses the opposite convention). A polygon must not self-intersect. A multipolygon must not have overlapping components. The DE-9IM (Dimensionally Extended 9-Intersection Model) matrix describes all possible topological relationships between two geometries using a 3x3 matrix of Interior, Boundary, and Exterior intersections.
ST_Intersects,ST_Contains,ST_Touches,ST_Crosses, andST_Withinare all specific DE-9IM patterns. Floating-point arithmetic in GEOS (the C++ geometry engine underlying PostGIS, Shapely, and most geospatial tools) produces numerical robustness issues: buffer operations create self-intersecting polygons, intersection operations produce slivers with near-zero area, and union operations fail withTopologyExceptionwhen geometric predicates disagree due to floating-point rounding. The standard fix —geometry.buffer(0)to repair invalid geometries — is a hack that silently modifies your data. Understanding why these issues occur and how to handle them properly (snap-rounding, precision models, validation-before-operation) is core GIS engineering knowledge that AI tools rarely get right. - The toolchain spans databases, visualization, and domain science simultaneously: A typical GIS project might use PostGIS for spatial data storage and querying, Python with GeoPandas for analysis, GDAL/OGR for format conversion, Rasterio for raster processing, pyproj for coordinate transformations, Shapely for computational geometry, Leaflet or MapLibre GL JS for web visualization, tippecanoe for vector tile generation, and domain-specific libraries like rioxarray for remote sensing or OSMnx for street network analysis. Each tool has its own conventions: PostGIS uses EWKB (Extended Well-Known Binary) internally but accepts WKT and GeoJSON as input; Shapely operates in a Cartesian plane with no CRS awareness; GeoPandas adds CRS tracking on top of Shapely geometries; GeoJSON mandates WGS84 coordinates in longitude-latitude order (the opposite of the latitude-longitude convention used by most mapping APIs). An AI tool must understand not just each library in isolation, but how they interact — that Shapely geometries lose CRS information, that PostGIS
geographytype uses great-circle calculations whilegeometrytype uses Cartesian, that GeoJSON coordinates are [longitude, latitude] while most human-readable formats use [latitude, longitude].
GIS Task Support Matrix
| Task | Copilot | Cursor | Windsurf | Claude Code | Amazon Q | Gemini CLI |
|---|---|---|---|---|---|---|
| Spatial SQL & PostGIS | Good | Strong | Good | Excellent | Good | Strong |
| Vector Analysis (Shapely/GEOS/JTS) | Good | Strong | Fair | Excellent | Fair | Good |
| Raster Processing (GDAL/Rasterio) | Fair | Good | Fair | Strong | Fair | Good |
| CRS & Projections | Weak | Fair | Weak | Strong | Weak | Good |
| Web Mapping (Leaflet/MapLibre/Deck.gl) | Strong | Excellent | Good | Strong | Good | Good |
| Remote Sensing & Earth Observation | Weak | Fair | Weak | Strong | Fair | Good |
| Spatial Data Pipelines | Good | Strong | Fair | Strong | Strong | Good |
1. Spatial SQL & PostGIS
PostGIS transforms PostgreSQL into the most capable open-source spatial database on the planet. It implements the OGC Simple Features specification and extends it with hundreds of spatial functions, geography types for geodetic calculations, raster support, topology management, and 3D geometry operations. The difference between writing competent PostGIS queries and naive spatial SQL is the difference between a query that returns in 50 milliseconds and one that takes 45 minutes — or returns subtly wrong results because you used the geometry type with degree-based coordinates and got Cartesian distances in meaningless “degree units” instead of meters.
The critical distinctions that AI tools must understand: geometry vs geography types. The geometry type performs Cartesian calculations on a flat plane — perfect for projected coordinates (UTM, State Plane) but wrong for unprojected WGS84 coordinates where one degree of longitude varies from 111 km at the equator to 0 km at the poles. The geography type performs geodetic calculations on the ellipsoid — correct for WGS84 but slower and supporting fewer functions. ST_DWithin(geom_a, geom_b, 1000) on geometry columns with EPSG:4326 data finds features within 1000 degrees (which is everything on Earth); the same query on geography columns correctly finds features within 1000 meters. Index usage is equally critical: ST_Intersects automatically uses GiST indexes through the && bounding box operator, but ST_Distance does not — you must pair it with ST_DWithin for index-accelerated proximity queries. ST_Buffer on geographic coordinates in PostGIS 3.x correctly handles geodetic buffering, but older versions silently produced wrong results.
The best AI tools for PostGIS understand these nuances. Claude Code consistently generates queries that use geography type for WGS84 distance calculations, pair ST_Distance with ST_DWithin for index usage, and include ST_SetSRID and ST_Transform when mixing CRSs. Cursor indexes your existing PostGIS migration files and autocompletes function signatures matching your project’s patterns. Copilot handles basic spatial SQL but frequently generates ST_Distance without index-accelerating predicates and uses geometry where geography is needed.
"""
PostGIS spatial query with proper SRID handling and index-aware patterns.
Finds all parcels within 500 meters of a flood zone boundary,
using geography type for correct geodetic distance on WGS84 data.
"""
import psycopg2
from psycopg2.extras import RealDictCursor
def find_parcels_near_flood_zones(
conn_params: dict,
buffer_meters: float = 500.0,
flood_zone_code: str = "AE",
) -> list[dict]:
"""Find parcels within buffer_meters of a specific FEMA flood zone.
Uses ST_DWithin on geography columns for:
1. Correct geodetic distance (meters on the ellipsoid)
2. Automatic GiST index usage (ST_DWithin triggers && operator)
"""
query = """
WITH flood_boundaries AS (
SELECT
fld_zone,
-- Cast to geography for geodetic operations
geom::geography AS geog
FROM fema_flood_zones
WHERE fld_zone = %(flood_zone_code)s
)
SELECT
p.parcel_id,
p.owner_name,
p.address,
ST_Area(p.geom::geography) AS area_sq_meters,
-- Minimum distance to any flood zone boundary (meters)
MIN(ST_Distance(
p.geom::geography,
fb.geog
)) AS dist_to_flood_zone_m,
-- Centroid for mapping (GeoJSON expects lon/lat WGS84)
ST_AsGeoJSON(ST_Centroid(p.geom))::json AS centroid_geojson
FROM parcels p
INNER JOIN flood_boundaries fb
-- ST_DWithin on geography uses meters and triggers GiST index
ON ST_DWithin(
p.geom::geography,
fb.geog,
%(buffer_meters)s
)
WHERE p.geom IS NOT NULL
AND ST_IsValid(p.geom) -- Skip invalid geometries
GROUP BY p.parcel_id, p.owner_name, p.address, p.geom
ORDER BY dist_to_flood_zone_m ASC;
"""
with psycopg2.connect(**conn_params) as conn:
with conn.cursor(cursor_factory=RealDictCursor) as cur:
cur.execute(query, {
"flood_zone_code": flood_zone_code,
"buffer_meters": buffer_meters,
})
return cur.fetchall()
def ensure_spatial_indexes(conn_params: dict) -> None:
"""Create spatial indexes if they don't exist.
GiST indexes on geometry columns are essential for ST_DWithin,
ST_Intersects, and all spatial join performance.
"""
indexes = [
"""CREATE INDEX IF NOT EXISTS idx_parcels_geom
ON parcels USING GIST (geom)""",
"""CREATE INDEX IF NOT EXISTS idx_parcels_geom_geog
ON parcels USING GIST ((geom::geography))""",
"""CREATE INDEX IF NOT EXISTS idx_flood_zones_geom
ON fema_flood_zones USING GIST (geom)""",
"""CREATE INDEX IF NOT EXISTS idx_flood_zones_geom_geog
ON fema_flood_zones USING GIST ((geom::geography))""",
]
with psycopg2.connect(**conn_params) as conn:
with conn.cursor() as cur:
for idx_sql in indexes:
cur.execute(idx_sql)
conn.commit()
if __name__ == "__main__":
conn_params = {
"host": "localhost",
"port": 5432,
"dbname": "gis_db",
"user": "gis_user",
"password": "changeme", # Use environment variable in production
}
ensure_spatial_indexes(conn_params)
results = find_parcels_near_flood_zones(
conn_params,
buffer_meters=500.0,
flood_zone_code="AE",
)
print(f"Found {len(results)} parcels within 500m of AE flood zones")
for parcel in results[:5]:
print(
f" {parcel['parcel_id']}: {parcel['address']} "
f"({parcel['dist_to_flood_zone_m']:.1f}m from flood zone)"
)
2. Vector Analysis with Python
Vector analysis in the Python geospatial ecosystem centers on three tightly coupled libraries: Shapely for computational geometry (points, lines, polygons, and operations like buffer, intersection, union), GeoPandas for geospatial DataFrames (Shapely geometries with attribute data and CRS awareness), and Fiona (or the newer pyogrio) for reading and writing geospatial file formats. Under the hood, Shapely 2.x wraps GEOS (the C++ Geometry Engine Open Source), which is the same engine PostGIS uses — so the geometric operations are mathematically identical between your Python analysis and your PostGIS queries.
The critical pitfall in vector analysis is CRS management. Shapely operates in a Cartesian plane with no CRS awareness whatsoever: Point(0, 0).buffer(1) creates a circle of radius 1 in whatever units the coordinates happen to be in. If your coordinates are in WGS84 degrees, that “buffer of 1” is roughly 111 km at the equator and 0 km at the poles. GeoPandas adds CRS tracking through the .crs attribute and the .to_crs() method, but it does not prevent you from performing operations on GeoDataFrames with mismatched CRSs — it just warns you. A spatial join between a GeoDataFrame in EPSG:4326 and another in EPSG:32617 will silently produce wrong results unless you reproject first. The correct workflow: always reproject to a projected CRS appropriate for your study area before performing metric operations (buffering, area calculation, distance measurement), and reproject back to WGS84 for storage and display.
AI tools vary significantly in their handling of these patterns. Claude Code generates spatial join code that checks CRS compatibility and reprojects before operations. Cursor autocompletes GeoPandas method chains based on your existing codebase patterns. Copilot generates syntactically correct GeoPandas code but frequently omits CRS checks and produces buffer operations on geographic coordinates.
"""
GeoPandas spatial join with CRS transformation and proper metric operations.
Identifies which census tracts each school is located in, and calculates
the area of each tract and the distance from each school to its tract centroid.
"""
import geopandas as gpd
import pandas as pd
from shapely.geometry import Point
def load_and_validate(path: str, expected_geom_type: str = None) -> gpd.GeoDataFrame:
"""Load a geospatial file and validate geometry."""
gdf = gpd.read_file(path)
# Remove null geometries
null_count = gdf.geometry.isna().sum()
if null_count > 0:
print(f" Warning: dropping {null_count} null geometries from {path}")
gdf = gdf.dropna(subset=["geometry"])
# Repair invalid geometries with buffer(0)
invalid_mask = ~gdf.geometry.is_valid
invalid_count = invalid_mask.sum()
if invalid_count > 0:
print(f" Warning: repairing {invalid_count} invalid geometries in {path}")
gdf.loc[invalid_mask, "geometry"] = gdf.loc[invalid_mask].geometry.buffer(0)
if expected_geom_type:
actual_types = gdf.geometry.geom_type.unique()
print(f" Geometry types in {path}: {actual_types}")
print(f" Loaded {len(gdf)} features, CRS: {gdf.crs}")
return gdf
def schools_by_census_tract(
schools_path: str,
tracts_path: str,
target_epsg: int = 2163, # US National Atlas Equal Area
) -> gpd.GeoDataFrame:
"""Spatial join: assign each school to its census tract.
Workflow:
1. Load and validate both datasets
2. Reproject to a projected CRS for accurate metric operations
3. Calculate tract areas in square kilometers
4. Perform point-in-polygon spatial join
5. Calculate distance from each school to its tract centroid
"""
# Load data
schools = load_and_validate(schools_path, expected_geom_type="Point")
tracts = load_and_validate(tracts_path, expected_geom_type="Polygon")
# Reproject both to the same projected CRS for metric operations
# EPSG:2163 (US National Atlas Equal Area) preserves area across CONUS
schools_proj = schools.to_crs(epsg=target_epsg)
tracts_proj = tracts.to_crs(epsg=target_epsg)
# Calculate tract areas in projected CRS (square meters -> sq km)
tracts_proj["area_sq_km"] = tracts_proj.geometry.area / 1e6
# Calculate tract centroids in projected CRS (accurate for area calc)
tracts_proj["tract_centroid"] = tracts_proj.geometry.centroid
# Spatial join: point-in-polygon
# predicate="within" assigns each school to the tract it falls inside
joined = gpd.sjoin(
schools_proj,
tracts_proj[["GEOID", "NAME", "area_sq_km", "tract_centroid", "geometry"]],
how="left",
predicate="within",
)
# Calculate distance from each school to its tract centroid (meters)
joined["dist_to_centroid_m"] = joined.apply(
lambda row: row.geometry.distance(row["tract_centroid"])
if pd.notna(row.get("tract_centroid")) else None,
axis=1,
)
# Drop helper column and reproject back to WGS84 for output
joined = joined.drop(columns=["tract_centroid", "index_right"])
joined = joined.to_crs(epsg=4326)
# Report unmatched schools (outside any tract)
unmatched = joined["GEOID"].isna().sum()
if unmatched > 0:
print(f" Warning: {unmatched} schools did not fall within any census tract")
return joined
if __name__ == "__main__":
result = schools_by_census_tract(
schools_path="data/schools.geojson",
tracts_path="data/census_tracts.shp",
)
print(f"\nResults: {len(result)} schools matched to tracts")
print(result[["SCHOOL_NAME", "GEOID", "NAME", "area_sq_km", "dist_to_centroid_m"]].head(10))
# Export to GeoJSON (mandates WGS84 per RFC 7946)
result.to_file("output/schools_by_tract.geojson", driver="GeoJSON")
3. Raster Processing
Raster processing in the geospatial world means working with gridded data: satellite imagery, digital elevation models, land cover classifications, temperature grids, and any other dataset represented as a matrix of cell values georeferenced to the earth’s surface. The primary Python libraries are Rasterio (Pythonic API over GDAL for reading/writing raster data) and GDAL itself (the Swiss Army knife of geospatial format conversion and processing). The challenge is scale: a single Sentinel-2 tile at 10-meter resolution is 10,980 x 10,980 pixels per band, and a continental NDVI time series might span thousands of such tiles across dozens of dates.
Cloud Optimized GeoTIFF (COG) has transformed raster workflows. A COG stores data in tiled (typically 256x256 or 512x512 pixel) blocks with internal overviews, enabling HTTP range reads. Instead of downloading a 2 GB file to read a 256x256 pixel window, you fetch only the relevant tiles — perhaps 100 KB of data. Rasterio supports this natively through GDAL’s virtual filesystem: rasterio.open("https://example.com/image.tif") works transparently with COGs over HTTP. The critical performance pattern is windowed reads: dataset.read(1, window=window) reads only the pixels you need. AI tools that generate dataset.read() without a window argument for large rasters will attempt to load the entire raster into memory.
Nodata handling is the other major correctness issue. Satellite imagery has nodata values for pixels obscured by clouds, outside the sensor footprint, or flagged during quality assessment. NDVI calculated without masking nodata pixels will produce meaningless values (often extremely negative or positive) at cloud locations, corrupting your analysis. The correct pattern: read the nodata value from the dataset metadata, create a boolean mask, and apply it to all band math operations using NumPy masked arrays or explicit conditional logic.
"""
NDVI calculation from Landsat 8 with proper nodata handling,
windowed reads for memory efficiency, and COG output.
"""
import numpy as np
import rasterio
from rasterio.windows import Window
from rasterio.transform import from_bounds
from rasterio.enums import Resampling
from pathlib import Path
def calculate_ndvi_windowed(
red_path: str,
nir_path: str,
output_path: str,
block_size: int = 512,
) -> dict:
"""Calculate NDVI from Landsat 8 red (Band 4) and NIR (Band 5).
NDVI = (NIR - Red) / (NIR + Red)
Uses windowed reads to process arbitrarily large rasters
without loading them entirely into memory.
Args:
red_path: Path to Band 4 (red) GeoTIFF
nir_path: Path to Band 5 (NIR) GeoTIFF
output_path: Path for output NDVI GeoTIFF (COG format)
block_size: Processing window size in pixels
Returns:
Dictionary with statistics (min, max, mean NDVI, valid pixel count)
"""
with rasterio.open(red_path) as red_ds, rasterio.open(nir_path) as nir_ds:
# Validate that both bands have the same dimensions and CRS
assert red_ds.crs == nir_ds.crs, (
f"CRS mismatch: red={red_ds.crs}, nir={nir_ds.crs}"
)
assert red_ds.shape == nir_ds.shape, (
f"Shape mismatch: red={red_ds.shape}, nir={nir_ds.shape}"
)
assert red_ds.transform == nir_ds.transform, (
"Transform mismatch between red and NIR bands"
)
red_nodata = red_ds.nodata
nir_nodata = nir_ds.nodata
height, width = red_ds.shape
# Output profile: COG-compatible GeoTIFF
profile = red_ds.profile.copy()
profile.update(
dtype=rasterio.float32,
count=1,
nodata=-9999.0,
compress="deflate",
tiled=True,
blockxsize=256,
blockysize=256,
)
# Statistics accumulators
ndvi_sum = 0.0
ndvi_min = float("inf")
ndvi_max = float("-inf")
valid_count = 0
with rasterio.open(output_path, "w", **profile) as dst:
# Process in windows to control memory usage
for row_off in range(0, height, block_size):
for col_off in range(0, width, block_size):
# Clip window to raster extent
win_height = min(block_size, height - row_off)
win_width = min(block_size, width - col_off)
window = Window(col_off, row_off, win_width, win_height)
# Read bands as float32 for division
red = red_ds.read(1, window=window).astype(np.float32)
nir = nir_ds.read(1, window=window).astype(np.float32)
# Build nodata mask
valid_mask = np.ones(red.shape, dtype=bool)
if red_nodata is not None:
valid_mask &= (red != red_nodata)
if nir_nodata is not None:
valid_mask &= (nir != nir_nodata)
# Avoid division by zero where red + nir == 0
denominator = nir + red
valid_mask &= (denominator != 0)
# Calculate NDVI only for valid pixels
ndvi = np.full(red.shape, -9999.0, dtype=np.float32)
ndvi[valid_mask] = (
(nir[valid_mask] - red[valid_mask]) /
denominator[valid_mask]
)
# Clamp NDVI to valid range [-1, 1]
valid_ndvi = ndvi[valid_mask]
ndvi[valid_mask] = np.clip(valid_ndvi, -1.0, 1.0)
# Update statistics
if valid_ndvi.size > 0:
ndvi_sum += np.sum(ndvi[valid_mask])
ndvi_min = min(ndvi_min, float(np.min(ndvi[valid_mask])))
ndvi_max = max(ndvi_max, float(np.max(ndvi[valid_mask])))
valid_count += int(np.sum(valid_mask))
# Write window to output
dst.write(ndvi, 1, window=window)
# Build overviews for COG compatibility
with rasterio.open(output_path, "r+") as dst:
overview_levels = [2, 4, 8, 16]
dst.build_overviews(overview_levels, Resampling.nearest)
dst.update_tags(ns="rio_overview", resampling="nearest")
stats = {
"valid_pixels": valid_count,
"total_pixels": height * width,
"coverage_pct": round(100 * valid_count / (height * width), 2),
"ndvi_min": round(ndvi_min, 4) if valid_count > 0 else None,
"ndvi_max": round(ndvi_max, 4) if valid_count > 0 else None,
"ndvi_mean": round(ndvi_sum / valid_count, 4) if valid_count > 0 else None,
}
return stats
if __name__ == "__main__":
# Landsat 8 Collection 2 Level-2 surface reflectance bands
stats = calculate_ndvi_windowed(
red_path="LC08_L2SP_015033_20240601_20240605_02_T1_SR_B4.TIF",
nir_path="LC08_L2SP_015033_20240601_20240605_02_T1_SR_B5.TIF",
output_path="output/ndvi_20240601.tif",
block_size=512,
)
print("NDVI Statistics:")
for key, value in stats.items():
print(f" {key}: {value}")
4. Coordinate Reference Systems
Coordinate reference systems are the single most error-prone aspect of geospatial engineering, and the area where AI tools cause the most silent data quality issues. Every coordinate pair is meaningless without its CRS. The number (-77.0369, 38.9072) could be WGS84 longitude/latitude (placing a point in Washington, D.C.), or it could be coordinates in some projected CRS that places it somewhere entirely different. The problem is compounded by the fact that CRS information is often implicit: GeoJSON mandates WGS84 per RFC 7946, but Shapefiles carry CRS in a separate .prj file that is frequently missing or incorrect, and CSV files with lat/lon columns have no CRS metadata at all.
The pyproj library is the Python interface to PROJ, the foundational coordinate transformation engine used by GDAL, PostGIS, QGIS, and virtually every other geospatial tool. The modern API uses the Transformer class, which is thread-safe and supports datum transformations, pipeline definitions, and authority-based CRS lookups. The critical pitfalls: (1) always use Transformer.from_crs() with always_xy=True to ensure longitude-first axis order regardless of the CRS definition (WGS84’s official axis order is latitude-first, but most software expects longitude-first); (2) datum transformations between NAD27 and WGS84 can shift coordinates by hundreds of meters and require grid-based corrections (NADCON) that are not applied unless you explicitly use the correct transformation pipeline; (3) the PROJ database includes transformation accuracy metadata — always check transformer.accuracy to know the expected error; (4) when converting between geographic and projected CRS, understand what your projection preserves (area, shape, distance, direction) and what it distorts.
AI tools are particularly weak here because CRS issues are fundamentally about domain knowledge, not syntax. Every tool can generate gdf.to_crs(epsg=4326) syntactically. The question is whether the tool understands when to reproject, which CRS to choose for a given operation, and what the implications are. Claude Code is the strongest: it recommends appropriate projected CRSs for specific regions (UTM zones for local accuracy, Albers Equal Area for continental area calculations, Web Mercator only for display), warns about datum shifts, and explains why always_xy=True matters.
"""
Coordinate transformation with pyproj: proper datum handling,
pipeline selection, and accuracy validation.
"""
from pyproj import Transformer, CRS, database
from pyproj.enums import TransformDirection
import numpy as np
def transform_coordinates(
coords: list[tuple[float, float]],
source_crs: str | int,
target_crs: str | int,
always_xy: bool = True,
) -> list[tuple[float, float]]:
"""Transform coordinates between CRSs with accuracy reporting.
Args:
coords: List of (x, y) coordinate pairs
source_crs: Source CRS as EPSG code (int) or WKT/PROJ string
target_crs: Target CRS as EPSG code (int) or WKT/PROJ string
always_xy: If True, coordinates are always (x/lon, y/lat) order
Returns:
List of transformed (x, y) coordinate pairs
"""
src = CRS.from_user_input(source_crs)
tgt = CRS.from_user_input(target_crs)
print(f"Source CRS: {src.name} (EPSG:{src.to_epsg()})")
print(f" Type: {src.type_name}")
print(f" Datum: {src.datum.name}")
print(f" Ellipsoid: {src.ellipsoid.name}")
print(f"Target CRS: {tgt.name} (EPSG:{tgt.to_epsg()})")
print(f" Type: {tgt.type_name}")
print(f" Datum: {tgt.datum.name}")
# Create transformer - pyproj automatically selects the best
# available transformation pipeline based on area of use and accuracy
transformer = Transformer.from_crs(
src,
tgt,
always_xy=always_xy, # Critical: ensures lon/x first, lat/y second
)
# Report the transformation pipeline and accuracy
print(f"\nTransformation pipeline:")
print(f" {transformer.description}")
if transformer.accuracy is not None and transformer.accuracy >= 0:
print(f" Estimated accuracy: {transformer.accuracy} meters")
else:
print(" Accuracy: unknown (no grid files or ballpark only)")
# Transform all coordinates
xs, ys = zip(*coords)
tx, ty = transformer.transform(xs, ys)
# Handle both scalar and array returns
if isinstance(tx, float):
return [(tx, ty)]
return list(zip(tx, ty))
def find_utm_zone(longitude: float, latitude: float) -> int:
"""Determine the correct UTM EPSG code for a given lon/lat.
UTM zones are 6 degrees wide, numbered 1-60 from 180W to 180E.
North zones use EPSG 326xx, South zones use EPSG 327xx.
"""
zone_number = int((longitude + 180) / 6) + 1
# Handle edge cases at zone boundaries
zone_number = max(1, min(60, zone_number))
if latitude >= 0:
epsg = 32600 + zone_number # North
else:
epsg = 32700 + zone_number # South
print(f"UTM Zone for ({longitude}, {latitude}): Zone {zone_number}"
f"{'N' if latitude >= 0 else 'S'} (EPSG:{epsg})")
return epsg
def reproject_for_analysis(
longitude: float,
latitude: float,
buffer_meters: float = 1000.0,
) -> dict:
"""Demonstrate the correct workflow for metric spatial operations.
1. Start with WGS84 coordinates
2. Determine appropriate projected CRS
3. Transform to projected CRS for metric operations
4. Perform buffer/distance/area calculations in meters
5. Transform results back to WGS84 for display/storage
"""
from shapely.geometry import Point
from shapely.ops import transform as shapely_transform
import pyproj
# Step 1: Find the appropriate UTM zone for this location
utm_epsg = find_utm_zone(longitude, latitude)
# Step 2: Create transformer to UTM
wgs84_to_utm = Transformer.from_crs(
"EPSG:4326", f"EPSG:{utm_epsg}", always_xy=True
)
utm_to_wgs84 = Transformer.from_crs(
f"EPSG:{utm_epsg}", "EPSG:4326", always_xy=True
)
# Step 3: Transform point to UTM (meters)
utm_x, utm_y = wgs84_to_utm.transform(longitude, latitude)
print(f"\nWGS84: ({longitude}, {latitude})")
print(f"UTM: ({utm_x:.2f}, {utm_y:.2f}) meters")
# Step 4: Create buffer in projected CRS (correct metric buffer)
point_utm = Point(utm_x, utm_y)
buffer_utm = point_utm.buffer(buffer_meters)
print(f"Buffer area: {buffer_utm.area:.2f} sq meters "
f"({buffer_utm.area / 1e6:.6f} sq km)")
# Step 5: Transform buffer back to WGS84 for GeoJSON export
buffer_wgs84 = shapely_transform(
lambda x, y: utm_to_wgs84.transform(x, y),
buffer_utm,
)
return {
"center_wgs84": (longitude, latitude),
"center_utm": (utm_x, utm_y),
"utm_epsg": utm_epsg,
"buffer_meters": buffer_meters,
"buffer_area_sq_m": buffer_utm.area,
"buffer_wgs84_wkt": buffer_wgs84.wkt,
}
if __name__ == "__main__":
# Example 1: Transform NAD27 to WGS84 (requires datum shift)
print("=== NAD27 to WGS84 (datum transformation) ===")
nad27_coords = [(-77.0369, 38.9072), (-122.4194, 37.7749)]
wgs84_coords = transform_coordinates(
nad27_coords,
source_crs=4267, # NAD27
target_crs=4326, # WGS84
)
for orig, transformed in zip(nad27_coords, wgs84_coords):
dx = transformed[0] - orig[0]
dy = transformed[1] - orig[1]
print(f" {orig} -> ({transformed[0]:.6f}, {transformed[1]:.6f}) "
f"shift: ({dx:.6f} deg, {dy:.6f} deg)")
# Example 2: Correct metric buffer workflow
print("\n=== Metric buffer in UTM ===")
result = reproject_for_analysis(
longitude=-77.0369,
latitude=38.9072,
buffer_meters=500.0,
)
5. Web Mapping
Web mapping is where geospatial data meets the browser, and the engineering challenges shift from analytical correctness to rendering performance. The two dominant libraries in 2026 are Leaflet (lightweight, raster-tile focused, massive plugin ecosystem) and MapLibre GL JS (GPU-accelerated, vector-tile native, data-driven styling). Deck.gl adds WebGL-powered large-scale data visualization on top of either base map. The critical technical decisions: raster tiles vs vector tiles, client-side vs server-side rendering, GeoJSON vs vector tile formats, and how to handle the inevitable EPSG:3857 (Web Mercator) projection that all web map tile systems use.
Vector tiles have become the standard for interactive web maps. Unlike raster tiles (pre-rendered PNG images), vector tiles transmit geometry and attributes as Protocol Buffer-encoded data, allowing client-side styling, smooth zooming, and dynamic label placement. MapLibre GL JS renders vector tiles on the GPU using WebGL, enabling fluid interaction with millions of features. The standard pipeline: use tippecanoe (from Felt/Mapbox) to generate .pmtiles or .mbtiles from GeoJSON or FlatGeobuf sources, serve them from a static file host or S3, and render with MapLibre GL JS using data-driven style expressions. The alternative — loading raw GeoJSON into the browser — works for datasets under approximately 10,000 features but degrades rapidly beyond that. A 50 MB GeoJSON file of building footprints will freeze the browser during parsing; the same data as vector tiles loads progressively and renders smoothly at any zoom level.
AI tools are generally strong on web mapping because it overlaps heavily with mainstream web development. Cursor excels here: it indexes your MapLibre style specifications and autocompletes data-driven expressions, layer configurations, and event handlers. Copilot is surprisingly good at Leaflet plugin integration and basic tile layer setup. Claude Code handles the spatial reasoning aspects — choosing appropriate zoom levels, understanding tile coordinate math, implementing correct GeoJSON winding order for rendering.
/**
* MapLibre GL JS map with vector tile source, data-driven styling,
* and interactive popups. Displays census tract income data with
* a choropleth color ramp and tooltip on hover.
*/
// Initialize the map
const map = new maplibregl.Map({
container: "map",
style: {
version: 8,
sources: {
// OpenFreeMap base tiles (or any FOSS tile source)
"osm-tiles": {
type: "raster",
tiles: [
"https://tile.openstreetmap.org/{z}/{x}/{y}.png"
],
tileSize: 256,
attribution:
'© OpenStreetMap',
},
},
layers: [
{
id: "osm-base",
type: "raster",
source: "osm-tiles",
},
],
},
center: [-77.0369, 38.9072], // Washington, DC [lon, lat]
zoom: 10,
maxZoom: 18,
minZoom: 3,
});
map.on("load", () => {
// Add vector tile source for census tracts
// PMTiles served from static hosting (S3, Cloudflare R2, etc.)
map.addSource("census-tracts", {
type: "vector",
url: "pmtiles://https://data.example.com/census-tracts-2024.pmtiles",
promoteId: "GEOID", // Use GEOID as the feature ID for state management
});
// Choropleth fill layer with data-driven color based on median income
map.addLayer(
{
id: "tracts-fill",
type: "fill",
source: "census-tracts",
"source-layer": "census_tracts",
paint: {
"fill-color": [
"step",
["get", "median_income"],
"#d73027", // < $25,000 - red
25000,
"#fc8d59", // $25,000 - $40,000 - orange
40000,
"#fee08b", // $40,000 - $60,000 - yellow
60000,
"#d9ef8b", // $60,000 - $80,000 - light green
80000,
"#91cf60", // $80,000 - $100,000 - green
100000,
"#1a9850", // > $100,000 - dark green
],
"fill-opacity": [
"case",
["boolean", ["feature-state", "hover"], false],
0.85,
0.6,
],
},
},
"osm-base" // Insert below labels
);
// Tract boundary outlines
map.addLayer({
id: "tracts-outline",
type: "line",
source: "census-tracts",
"source-layer": "census_tracts",
paint: {
"line-color": "#333333",
"line-width": [
"interpolate",
["linear"],
["zoom"],
8, 0.2, // Thin lines at low zoom
14, 1.5, // Thicker lines at high zoom
],
"line-opacity": 0.5,
},
});
// Hover state management
let hoveredTractId = null;
map.on("mousemove", "tracts-fill", (e) => {
if (e.features.length === 0) return;
// Clear previous hover state
if (hoveredTractId !== null) {
map.setFeatureState(
{ source: "census-tracts", sourceLayer: "census_tracts", id: hoveredTractId },
{ hover: false }
);
}
hoveredTractId = e.features[0].id;
// Set new hover state
map.setFeatureState(
{ source: "census-tracts", sourceLayer: "census_tracts", id: hoveredTractId },
{ hover: true }
);
// Update cursor
map.getCanvas().style.cursor = "pointer";
});
map.on("mouseleave", "tracts-fill", () => {
if (hoveredTractId !== null) {
map.setFeatureState(
{ source: "census-tracts", sourceLayer: "census_tracts", id: hoveredTractId },
{ hover: false }
);
}
hoveredTractId = null;
map.getCanvas().style.cursor = "";
});
// Click popup with tract details
map.on("click", "tracts-fill", (e) => {
if (e.features.length === 0) return;
const props = e.features[0].properties;
const income = props.median_income
? `$${Number(props.median_income).toLocaleString()}`
: "N/A";
const population = props.total_population
? Number(props.total_population).toLocaleString()
: "N/A";
new maplibregl.Popup({ maxWidth: "320px" })
.setLngLat(e.lngLat)
.setHTML(
`
Census Tract ${props.GEOID}
County: ${props.county_name || "N/A"}
Median Income: ${income}
Population: ${population}
Area: ${props.area_sq_km
? Number(props.area_sq_km).toFixed(2) + " km²"
: "N/A"}
`
)
.addTo(map);
});
});
// Navigation controls
map.addControl(new maplibregl.NavigationControl(), "top-right");
map.addControl(
new maplibregl.ScaleControl({ maxWidth: 200, unit: "metric" }),
"bottom-left"
);
6. Remote Sensing & Earth Observation
Remote sensing engineering has been transformed by the cloud-native geospatial movement. Instead of downloading terabytes of satellite imagery to local storage, the modern workflow queries STAC (SpatioTemporal Asset Catalog) APIs to search for imagery matching spatial, temporal, and quality criteria, then reads only the required pixels via HTTP range reads from Cloud Optimized GeoTIFFs stored in cloud object storage. The key libraries: pystac-client for STAC API access, rioxarray for loading raster data into xarray Datasets with CRS awareness, and Dask for lazy evaluation and distributed processing of datasets larger than memory.
The STAC ecosystem has standardized access to the world’s major Earth observation archives. Microsoft Planetary Computer hosts Sentinel-2, Landsat, NAIP, and dozens of other datasets behind a STAC API. Element 84’s Earth Search indexes Sentinel-2 and Landsat in AWS S3. NASA’s Common Metadata Repository (CMR) provides STAC access to the full NASA archive. The workflow is consistent: search by bounding box, date range, cloud cover percentage, and collection name; retrieve item metadata including asset URLs; load specific bands as xarray DataArrays; perform analysis. The performance-critical decision is how much data to load: a query that returns 500 Sentinel-2 tiles without filtering by cloud cover or limiting the date range will attempt to process hundreds of gigabytes.
AI tools vary dramatically in their remote sensing capability. Claude Code understands STAC query construction, band math for spectral indices (NDVI, NDWI, EVI, NBR), cloud masking using quality assessment bands, and the difference between surface reflectance and top-of-atmosphere products. Gemini CLI’s large context window is valuable for discussing sensor specifications, band wavelength ranges, and atmospheric correction methods. Copilot and Cursor handle xarray syntax well but lack the domain knowledge to construct correct STAC queries or choose appropriate spectral bands.
"""
Cloud-native remote sensing workflow: search Sentinel-2 imagery via STAC,
load specific bands with rioxarray, compute NDVI, and mask clouds.
"""
import planetary_computer
import pystac_client
import rioxarray
import xarray as xr
import numpy as np
from shapely.geometry import box, mapping
def search_sentinel2_imagery(
bbox: tuple[float, float, float, float],
date_range: str,
max_cloud_cover: int = 20,
max_items: int = 10,
) -> list:
"""Search Microsoft Planetary Computer for Sentinel-2 L2A imagery.
Args:
bbox: Bounding box as (west, south, east, north) in WGS84 degrees
date_range: Date range as "YYYY-MM-DD/YYYY-MM-DD"
max_cloud_cover: Maximum cloud cover percentage (0-100)
max_items: Maximum number of items to return
Returns:
List of STAC items matching the search criteria
"""
catalog = pystac_client.Client.open(
"https://planetarycomputer.microsoft.com/api/stac/v1",
modifier=planetary_computer.sign_inplace,
)
search = catalog.search(
collections=["sentinel-2-l2a"],
bbox=bbox,
datetime=date_range,
query={"eo:cloud_cover": {"lt": max_cloud_cover}},
max_items=max_items,
sortby=[{"field": "properties.eo:cloud_cover", "direction": "asc"}],
)
items = list(search.items())
print(f"Found {len(items)} Sentinel-2 scenes with <{max_cloud_cover}% cloud cover")
for item in items:
props = item.properties
print(
f" {item.id}: {props['datetime'][:10]}, "
f"cloud: {props['eo:cloud_cover']:.1f}%, "
f"platform: {props.get('platform', 'unknown')}"
)
return items
def compute_ndvi_from_stac(
item,
bbox: tuple[float, float, float, float],
resolution: int = 10,
) -> xr.DataArray:
"""Load Sentinel-2 red and NIR bands and compute NDVI.
Uses rioxarray for CRS-aware loading with automatic reprojection
and spatial subsetting to the area of interest.
Sentinel-2 L2A bands:
- B04 (Red): 665nm, 10m resolution
- B08 (NIR): 842nm, 10m resolution
- SCL (Scene Classification): 20m resolution (for cloud masking)
Args:
item: STAC item from search results
bbox: Area of interest (west, south, east, north) in WGS84
resolution: Target resolution in meters
Returns:
xr.DataArray with NDVI values, cloud-masked
"""
# Load red band (B04) clipped to AOI
red_href = item.assets["B04"].href
red = rioxarray.open_rasterio(red_href, overview_level=0).squeeze()
red = red.rio.clip_box(*bbox, crs="EPSG:4326")
red = red.astype(np.float32)
# Load NIR band (B08) clipped to AOI
nir_href = item.assets["B08"].href
nir = rioxarray.open_rasterio(nir_href, overview_level=0).squeeze()
nir = nir.rio.clip_box(*bbox, crs="EPSG:4326")
nir = nir.astype(np.float32)
# Load Scene Classification Layer for cloud masking
scl_href = item.assets["SCL"].href
scl = rioxarray.open_rasterio(scl_href, overview_level=0).squeeze()
scl = scl.rio.clip_box(*bbox, crs="EPSG:4326")
# Reproject SCL to match red/NIR resolution (SCL is 20m, bands are 10m)
scl = scl.rio.reproject_match(red, resampling=0) # Nearest neighbor
# Sentinel-2 L2A surface reflectance scale factor
# Collection 2: values are already in reflectance * 10000
red = red / 10000.0
nir = nir / 10000.0
# Create cloud mask from SCL values
# SCL classes to mask out:
# 0: No data, 1: Saturated/defective, 2: Dark area pixels,
# 3: Cloud shadow, 8: Cloud medium probability,
# 9: Cloud high probability, 10: Thin cirrus, 11: Snow/ice
cloud_mask = scl.isin([0, 1, 2, 3, 8, 9, 10, 11])
clear_mask = ~cloud_mask
# Calculate NDVI: (NIR - Red) / (NIR + Red)
denominator = nir + red
ndvi = xr.where(
(denominator != 0) & clear_mask,
(nir - red) / denominator,
np.nan,
)
# Clip to valid NDVI range
ndvi = ndvi.clip(-1.0, 1.0)
# Preserve CRS and spatial metadata
ndvi = ndvi.rio.write_crs(red.rio.crs)
ndvi = ndvi.rio.write_transform()
ndvi = ndvi.rio.write_nodata(np.nan)
# Add metadata
ndvi.attrs["long_name"] = "Normalized Difference Vegetation Index"
ndvi.attrs["source"] = item.id
ndvi.attrs["date"] = item.properties["datetime"][:10]
ndvi.attrs["cloud_cover_pct"] = item.properties["eo:cloud_cover"]
clear_pct = float(clear_mask.sum()) / float(clear_mask.size) * 100
print(f" NDVI computed: {ndvi.shape}, clear pixels: {clear_pct:.1f}%")
print(f" NDVI range: [{float(ndvi.min()):.3f}, {float(ndvi.max()):.3f}]")
print(f" CRS: {ndvi.rio.crs}")
return ndvi
if __name__ == "__main__":
# Area of interest: agricultural region in Iowa
aoi_bbox = (-93.8, 41.5, -93.4, 41.8)
# Search for summer 2025 imagery
items = search_sentinel2_imagery(
bbox=aoi_bbox,
date_range="2025-06-01/2025-08-31",
max_cloud_cover=15,
max_items=5,
)
if items:
# Compute NDVI from the least cloudy scene
best_item = items[0]
print(f"\nProcessing: {best_item.id}")
ndvi = compute_ndvi_from_stac(best_item, bbox=aoi_bbox)
# Save as COG
output_path = f"output/ndvi_{best_item.properties['datetime'][:10]}.tif"
ndvi.rio.to_raster(
output_path,
driver="GTiff",
compress="deflate",
tiled=True,
)
print(f"\nSaved NDVI raster to {output_path}")
7. Spatial Data Pipelines
Spatial data pipelines — the ETL (Extract, Transform, Load) workflows that convert, clean, reproject, and consolidate geospatial data from disparate sources into analysis-ready formats — are where GIS engineers spend a surprising proportion of their time. The sources are heterogeneous: Shapefiles from government agencies (still the most common exchange format despite being a 1990s design with a 2 GB size limit and no support for field names longer than 10 characters), GeoJSON from web APIs, GeoPackage from QGIS projects, KML from Google Earth, CSV with lat/lon columns, GeoParquet from modern data lakehouse architectures, and proprietary formats like Esri FileGDB. The pipeline must handle CRS detection and transformation, geometry validation and repair, attribute schema normalization, and often deduplication based on spatial proximity.
The modern geospatial data engineering stack has evolved significantly. GeoParquet (the Parquet format with GeoArrow-encoded geometry columns and CRS metadata) has emerged as the standard analytical format, combining Parquet’s columnar compression and predicate pushdown with spatial metadata. DuckDB’s spatial extension enables SQL-based spatial analysis directly on Parquet and GeoParquet files without loading data into a database first. Apache Sedona (formerly GeoSpark) provides distributed spatial processing on Spark for datasets that exceed single-machine capacity. GDAL/OGR’s ogr2ogr command-line tool remains the universal format converter, capable of reading and writing over 80 vector formats with CRS transformation, SQL-based filtering, and geometry reprojection in a single command.
AI tools are useful for spatial pipeline construction because much of the work is boilerplate: reading a format, transforming coordinates, validating geometry, writing to another format. Claude Code generates correct ogr2ogr command lines with proper SRID handling and format-specific creation options. Cursor autocompletes DuckDB spatial SQL from your existing query patterns. Amazon Q is strong on AWS-native spatial pipelines (Athena with GeoParquet, Glue for spatial ETL, Redshift spatial functions).
"""
Spatial data pipeline with DuckDB: convert Shapefiles to GeoParquet,
perform spatial analysis, and export results.
"""
import duckdb
from pathlib import Path
def build_spatial_pipeline(
input_dir: str,
output_path: str,
target_srid: int = 4326,
) -> dict:
"""Spatial ETL pipeline using DuckDB spatial extension.
1. Load Shapefiles and GeoJSON files from input directory
2. Standardize CRS to WGS84
3. Validate and repair geometries
4. Perform spatial aggregation (points per polygon)
5. Export to GeoParquet
Args:
input_dir: Directory containing source geospatial files
output_path: Path for output GeoParquet file
target_srid: Target SRID for output (default: 4326 WGS84)
Returns:
Dictionary with pipeline statistics
"""
con = duckdb.connect()
# Install and load spatial extension
con.execute("INSTALL spatial; LOAD spatial;")
# Load building permits (points) from GeoJSON
con.execute("""
CREATE TABLE permits AS
SELECT
*,
ST_GeomFromGeoJSON(geometry) AS geom
FROM ST_Read(?)
""", [f"{input_dir}/building_permits.geojson"])
# Load neighborhood boundaries (polygons) from Shapefile
con.execute("""
CREATE TABLE neighborhoods AS
SELECT
*,
ST_GeomFromWKB(wkb_geometry) AS geom
FROM ST_Read(?)
""", [f"{input_dir}/neighborhoods.shp"])
# Check and report source CRS info
permit_count = con.execute("SELECT COUNT(*) FROM permits").fetchone()[0]
neighborhood_count = con.execute("SELECT COUNT(*) FROM neighborhoods").fetchone()[0]
print(f"Loaded {permit_count} permits, {neighborhood_count} neighborhoods")
# Validate geometries and repair invalid ones
con.execute("""
UPDATE neighborhoods
SET geom = ST_MakeValid(geom)
WHERE NOT ST_IsValid(geom)
""")
invalid_repaired = con.execute("""
SELECT changes() AS repaired_count
""").fetchone()[0]
print(f"Repaired {invalid_repaired} invalid neighborhood geometries")
# Transform CRS if needed (assumes source data has SRID set)
# DuckDB spatial uses ST_Transform for reprojection
con.execute(f"""
UPDATE permits SET geom = ST_Transform(geom, 'EPSG:4326', 'EPSG:{target_srid}')
WHERE ST_SRID(geom) != {target_srid}
""")
# Spatial join: count permits per neighborhood and calculate density
con.execute("""
CREATE TABLE neighborhood_stats AS
SELECT
n.neighborhood_id,
n.name AS neighborhood_name,
n.geom,
ST_Area(
ST_Transform(n.geom, 'EPSG:4326', 'EPSG:3857')
) / 1e6 AS area_sq_km,
COUNT(p.permit_id) AS permit_count,
COUNT(CASE WHEN p.permit_type = 'NEW_CONSTRUCTION' THEN 1 END)
AS new_construction_count,
COUNT(CASE WHEN p.permit_type = 'RENOVATION' THEN 1 END)
AS renovation_count,
SUM(COALESCE(p.estimated_cost, 0)) AS total_investment,
AVG(COALESCE(p.estimated_cost, 0))
FILTER (WHERE p.estimated_cost > 0) AS avg_permit_cost
FROM neighborhoods n
LEFT JOIN permits p
ON ST_Contains(n.geom, p.geom)
GROUP BY n.neighborhood_id, n.name, n.geom
""")
# Add derived metrics
con.execute("""
ALTER TABLE neighborhood_stats ADD COLUMN permits_per_sq_km DOUBLE;
UPDATE neighborhood_stats
SET permits_per_sq_km = CASE
WHEN area_sq_km > 0 THEN permit_count / area_sq_km
ELSE 0
END;
""")
# Report top neighborhoods by development activity
top_neighborhoods = con.execute("""
SELECT
neighborhood_name,
permit_count,
new_construction_count,
ROUND(permits_per_sq_km, 1) AS density,
ROUND(total_investment / 1e6, 2) AS investment_millions
FROM neighborhood_stats
WHERE permit_count > 0
ORDER BY permit_count DESC
LIMIT 10
""").fetchall()
print("\nTop neighborhoods by permit activity:")
for row in top_neighborhoods:
print(
f" {row[0]}: {row[1]} permits "
f"({row[2]} new construction), "
f"{row[3]} permits/km2, "
f"${row[4]}M investment"
)
# Export to GeoParquet
con.execute(f"""
COPY (
SELECT
neighborhood_id,
neighborhood_name,
permit_count,
new_construction_count,
renovation_count,
ROUND(area_sq_km, 4) AS area_sq_km,
ROUND(permits_per_sq_km, 2) AS permits_per_sq_km,
ROUND(total_investment, 2) AS total_investment,
ROUND(avg_permit_cost, 2) AS avg_permit_cost,
ST_AsWKB(geom) AS geometry
FROM neighborhood_stats
ORDER BY permit_count DESC
)
TO '{output_path}'
(FORMAT PARQUET, COMPRESSION 'ZSTD')
""")
# Collect pipeline statistics
stats = {
"input_permits": permit_count,
"input_neighborhoods": neighborhood_count,
"geometries_repaired": invalid_repaired,
"output_rows": con.execute(
"SELECT COUNT(*) FROM neighborhood_stats"
).fetchone()[0],
"total_permits_matched": con.execute(
"SELECT SUM(permit_count) FROM neighborhood_stats"
).fetchone()[0],
"output_path": output_path,
}
con.close()
print(f"\nPipeline complete: {stats['output_rows']} neighborhoods exported")
print(f" {stats['total_permits_matched']}/{stats['input_permits']} "
f"permits matched to neighborhoods")
print(f" Output: {output_path}")
return stats
if __name__ == "__main__":
stats = build_spatial_pipeline(
input_dir="data/raw",
output_path="data/processed/neighborhood_development.parquet",
target_srid=4326,
)
Quick-Reference: Best Tool by Task
| Task | Best Tool | Why |
|---|---|---|
| PostGIS spatial queries | Claude Code | Understands geometry vs geography types, index usage patterns, SRID handling |
| GeoPandas analysis | Claude Code | Generates CRS-aware spatial joins, warns about geographic vs projected operations |
| Rasterio / GDAL processing | Claude Code | Correct windowed reads, nodata handling, COG output configuration |
| CRS transformations | Claude Code | Knows datum shifts, always_xy parameter, recommends appropriate projections for regions |
| MapLibre GL / Leaflet maps | Cursor Pro | Indexes style specs, autocompletes data-driven expressions and layer configs |
| Vector tile pipelines | Cursor Pro | tippecanoe flags, PMTiles configuration, layer optimization from project context |
| STAC API / remote sensing | Claude Code | Correct STAC query construction, band selection, cloud masking, spectral indices |
| AWS spatial infrastructure | Amazon Q | Athena GeoParquet queries, Location Service, Glue spatial ETL |
| Large GIS codebase navigation | Cursor Pro | Indexes QGIS plugins, GeoServer configs, multi-module geospatial projects |
| GDAL/OGR documentation | Gemini CLI | 1M context for GDAL driver docs, format specifications, and ogr2ogr flag combinations |
| DuckDB spatial analysis | Claude Code | Correct spatial SQL with GeoParquet I/O, ST_Transform pipelines, analytical queries |
What AI Tools Get Wrong About GIS
- Ignoring CRS entirely: The most common AI-generated GIS bug. Code that performs a spatial join between two GeoDataFrames without checking or aligning their CRSs. Code that calculates distances using Shapely’s Euclidean distance on WGS84 coordinates, producing results in meaningless “degree units.” Code that buffers geographic coordinates by a value in meters, producing wildly wrong geometry because one degree of latitude is approximately 111 km but one degree of longitude varies from 111 km at the equator to 0 at the poles.
- Loading entire rasters into memory:
data = rasterio.open("huge_raster.tif").read()for a multi-gigabyte file. The correct pattern is windowed reads:dataset.read(1, window=window)to process blocks that fit in memory. For cloud-hosted data, this also means the difference between downloading the entire file and fetching only the needed tiles via HTTP range requests. - Confusing latitude/longitude order: GeoJSON uses [longitude, latitude] per RFC 7946. Leaflet’s
L.latLng()uses (latitude, longitude). PostGISST_MakePoint()takes (x, y) which is (longitude, latitude). WKT uses (longitude latitude). Google Maps API uses{lat, lng}. AI tools frequently mix these up, placing points on the wrong side of the planet. Thealways_xy=Trueparameter in pyproj exists precisely because of this confusion. - Missing spatial indexes: Generating PostGIS queries with
ST_IntersectsorST_DWithinbut not creating or verifying GiST indexes on the geometry columns. On a table with 10 million rows, this turns a 50ms query into a 45-minute sequential scan. AI tools generate the query but rarely generate theCREATE INDEX ... USING GISTstatement or theEXPLAIN ANALYZEto verify index usage. - Using ST_Distance without ST_DWithin:
WHERE ST_Distance(a.geom, b.geom) < 1000cannot use a spatial index because it must compute the distance for every pair.WHERE ST_DWithin(a.geom, b.geom, 1000)uses the GiST index to filter by bounding box first. AI tools generate the former when they should always generate the latter. - Web Mercator for analysis: Using EPSG:3857 (Web Mercator) for area or distance calculations. Web Mercator is a display projection designed for tiling web maps, not for spatial analysis. It distorts areas increasingly at higher latitudes — a polygon in Norway appears 4x larger than its actual area. The correct approach: use an equal-area projection (Albers, Lambert Azimuthal) for area calculations, an equidistant projection for distance measurements, and Web Mercator only for rendering tiles.
- GeoJSON without right-hand rule: RFC 7946 requires exterior rings to be counter-clockwise and interior rings (holes) to be clockwise. Many rendering engines silently handle wrong winding order, but some (notably MapLibre GL and many tiling tools) will render the polygon as covering the entire earth except the intended area. AI tools generate GeoJSON without ensuring correct winding order, creating maps where continents disappear and oceans become solid fill.
- Ignoring nodata in raster analysis: Performing band math on satellite imagery without checking for nodata values. Cloud-covered pixels, sensor gaps, and quality-flagged pixels all have nodata values that must be masked before computation. NDVI calculated with nodata pixels produces extreme outlier values that corrupt statistics and classification results.
Cost Model: What GIS Engineers Actually Pay
Scenario 1: Student / Learner — $0/month
- Copilot Free (2,000 completions/mo) for basic GeoPandas, QGIS plugin development, and PostGIS queries
- Plus Gemini CLI Free for discussing GDAL documentation, understanding projections, and learning spatial SQL patterns
- Combined with QGIS (free and open source), PostGIS, and the Python geospatial stack, this is sufficient for coursework and personal projects. Manually verify CRS handling and spatial index usage in all generated code.
Scenario 2: Solo GIS Analyst — $10/month
- Copilot Pro ($10/mo) for unlimited completions when writing GeoPandas analysis scripts, QGIS processing plugins, and ArcPy automation
- Good for daily GIS workflows: loading Shapefiles, performing spatial joins, generating maps, automating geoprocessing tasks. Copilot handles the Python boilerplate while you verify the spatial logic. Watch for CRS issues and missing index creation in PostGIS queries.
Scenario 3: GIS Developer — $20/month
- Cursor Pro ($20/mo) for working with large GIS codebases: QGIS plugins, GeoServer configurations, web mapping applications with MapLibre/Leaflet, and PostGIS migration files
- Or Claude Code ($20/mo) for spatial algorithm development, CRS transformation pipelines, raster processing workflows, and complex PostGIS query optimization
- Choose Cursor if your work centers on a large existing codebase with established patterns. Choose Claude Code if your work involves designing new spatial algorithms, optimizing complex queries, or working with unfamiliar CRS and projection issues.
Scenario 4: Full-stack Geo Developer — $40/month
- Claude Code ($20/mo) for spatial reasoning: CRS selection, PostGIS query optimization, raster pipeline design, STAC workflow construction
- Plus Cursor Pro ($20/mo) for codebase-indexed development: web mapping UI, API endpoints serving GeoJSON, vector tile pipeline configuration, multi-file refactoring
- The optimal combination for GIS engineers building full-stack geospatial applications. Claude Code for the domain-specific spatial logic, Cursor for the daily development velocity across frontend map components, backend spatial APIs, and database migration files.
Scenario 5: GIS Team Lead — $19–40/seat/month
- Copilot Business ($19/seat/mo) or Cursor Business ($40/seat/mo) for team-wide codebase indexing and consistent spatial coding patterns
- Cursor Business indexes the team’s shared PostGIS migration history, MapLibre style specifications, and GDAL processing scripts, ensuring consistent patterns across developers. Copilot Business provides organization-level policy controls and audit logging.
Scenario 6: Enterprise / Government — $39–99/seat/month
- Copilot Enterprise ($39/seat/mo) or Cursor Enterprise for organization-wide codebase indexing with compliance controls
- Plus Claude Code ($20/seat/mo) for specialized spatial analysis and architecture review
- Government GIS agencies and enterprise mapping companies need consistent spatial data quality standards across large teams. Enterprise tiers index the full proprietary codebase, enforce CRS conventions, and provide audit trails. Tabnine ($39/user/mo) offers self-hosted deployment for air-gapped environments handling classified geospatial data.
The GIS Engineer’s Verdict
The fundamental challenge for AI tools in GIS engineering is that spatial correctness is domain-dependent and often invisible. A function that returns a number is not obviously wrong — it takes domain knowledge to recognize that a buffer distance of 0.001 on WGS84 coordinates is approximately 111 meters at the equator but only 1 meter at 89 degrees latitude, or that a PostGIS ST_Distance on geometry columns in EPSG:4326 returns results in degrees rather than meters. AI tools that lack this domain awareness generate code that compiles, runs without errors, and produces results that are silently, systematically wrong in ways that propagate through every downstream analysis.
The good news is that AI tools are genuinely useful for GIS work when paired with domain expertise. The geospatial Python ecosystem has a steep learning curve — remembering whether Rasterio returns bands-first or bands-last arrays, which GeoPandas methods trigger CRS warnings, how to construct ogr2ogr commands with the right combination of -t_srs, -s_srs, and -a_srs flags, which PostGIS functions support geography type and which only work with geometry. AI tools accelerate this boilerplate significantly. Claude Code generates correct PostGIS queries with proper SRID handling and index-aware patterns more reliably than any other tool. Cursor indexes your project’s existing spatial conventions and autocompletes consistently. The combination of Claude Code for spatial reasoning and Cursor for codebase-indexed development covers the full GIS development workflow.
The tool-specific recommendations: Claude Code is the best single tool for GIS engineers. It understands CRS implications (when to use geography vs geometry, which projection to choose for a study area, why always_xy=True matters in pyproj), generates correct PostGIS spatial queries with index usage, handles raster processing with proper nodata masking and windowed reads, and constructs STAC queries for remote sensing workflows. Cursor is the best for GIS application development — web mapping with MapLibre GL, vector tile pipelines, multi-file geospatial projects where understanding the existing codebase is more important than spatial algorithm design. Gemini CLI is valuable for GDAL documentation and format-specific questions — paste the entire GDAL driver documentation into its 1M context window for precise ogr2ogr flag recommendations. Amazon Q is worth considering if your spatial infrastructure is AWS-native (Athena with GeoParquet, Location Service, SageMaker for geospatial ML).
The bottom line: treat AI-generated geospatial code the same way you treat GIS data from an unknown source — validate the CRS, check the geometry validity, verify the spatial relationships make sense, and never assume the numbers are correct just because the code ran without errors. In GIS, silent errors are the default, not the exception. The right AI tools make you faster at generating correct spatial code; the wrong ones (or the right ones used without verification) make you faster at generating wrong spatial code that looks right.
Compare all tools and pricing on our main comparison table, or check the cheapest tools guide for budget options.
Related on CodeCosts
- AI Coding Tools for Data Engineers (2026) — ETL pipelines, data warehouses, dbt, Spark
- AI Coding Tools for Data Scientists (2026) — pandas, scikit-learn, Jupyter, statistical analysis
- AI Coding Tools for Backend Engineers (2026) — APIs, databases, server-side architecture
- AI Coding Tools for Cloud Architects (2026) — infrastructure as code, multi-cloud, cost optimization