Skip to content

Waterpark — HEALPix Data Hub

MODIS SST on healpix 10

Sea surface temperature seen by MODIS AQUA, remapped to HEALPix level 10.

Waterpark is a task force effort at DKRZ to convert climate and Earth observation datasets onto a common HEALPix grid and serve them as multi-resolution Zarr pyramids from S3 object storage. The unified grid enables efficient cross-dataset analysis, ML training pipelines, and interactive visualisation without per-dataset regridding at access time.

Currently available Datasets

Dataset Data Access via
CMIP6 s3://cimp6
DYAMOND s3://dyamond
EERIE s3://eerie
ERA5 s3://era5
ICDC s3://icdc
ICON-DREAM s3://icon-dream
nextGEMS s3://nextgems
ORCESTRA s3://orchsestra

Preliminary test data

The data shown here is part of a test/prototype setup and must be considered preliminary.

It may be changed or deleted at any time. There is no guarantee of availability, persistence, completeness, or long-term access.

The underlying S3 buckets are not intended or approved for public access. Bucket URLs must therefore not be published, shared, or linked from public documentation.


Remapping methodology

All datasets should be remapped using pre-computed, reusable ESMF weight files applied via sparse matrix multiplication. Weight generation is a one-time cost per source grid; application runs at near-memory-bandwidth speed using batched cuSPARSE on GPU or batched SciPy on CPU.

Variable type Method Rationale
Continuous fields (SST, temperature, wind, radiation, …) Conservative Area-weighted averaging preserves integrals and handles sub-pixel variability correctly.
Discrete / categorical fields (land-sea masks, land cover, soil type, …) Nearest neighbour Preserves class labels without creating non-physical intermediate values.

The target HEALPix level for each dataset is chosen as the next level whose characteristic pixel spacing is coarser than the source resolution, avoiding oversampling. Lower pyramid levels are derived by nested coarsening (reshape + nanmean), not by repeated remapping.

Technical details
  • Weight generation uses ESMF (ESMF_RegridWeightGen under MPI for large grids, ESMPy in-memory for moderate grids).
  • All HEALPix geometry is computed on a perfect sphere, consistent with ESMF's internal overlap calculation and with the geocentric coordinates used by climate models.
  • Weight application supports three backends selected automatically: cuSPARSE / GPU (via CuPy), Numba (fused CSR kernel), and batched SciPy (BLAS-accelerated sparse matmul).
  • The tooling is provided by the grid-doctor Python package.

Benchmark for conservative remapping of HP Level 10 on NVIDIA GH200

The example below outlines the full pipeline of conservative remapping on a 4x NVIDIA Grace Hopper 200 Superchip node (288 CPUs and 856 GB RAM, 4 GPUs 382 GB) using CuPy and ESMF with 64 openmpi ranks. The dataset that was regridded and uploaded on s3 was one year of MODIS-AQUA at roughly 2km resolution (HEALPix level 10).

from dask.diagnostics.progress import ProgressBar
from getpass import getuser
from pathlib import Path

import grid_doctor as gd

# --- 1. Open source data ---
dset = gd.cached_open_dataset(
    Path("/pool/data/ICDC/ocean/modis_aqua_sst/DATA/daily/2025").rglob("*.nc"),
    chunks={"lat": -1},
)

# --- 2. Generate reusable weights (one-time) ---
weights_dir = Path(
    "/scratch/{u[0]}/{u}/healpix-weights".format(u=getuser())
)
resolved_level = gd.resolution_to_healpix_level(
    gd.get_latlon_resolution(dset)
)
%time weights_file = gd.cached_weights(
    dset,
    level=resolved_level,
    cache_path=weights_dir,
    nproc=64,
    prefer_offline=True,
)
# Wall time: ~2 min 30 s (Grace CPUs, 64 MPI ranks)

# --- 3. Build multi-resolution pyramid ---
pyramid = gd.create_healpix_pyramid(
    dset,
    max_level=resolved_level,
    weights_path=weights_file,
    backend="cupy",
)
# --- 4. Verify ---
with ProgressBar():
    hp = pyramid[resolved_level].isel(time=slice(0, 10)).load()
    # Wall time: ~13 s (Hopper GPU)

# --- 5. Write to S3 ---
s3_options = gd.get_s3_options(
    "https://s3-example.org",
    Path("~/.s3-credentials.json").expanduser(),
)
%time gd.save_pyramid(
    pyramid,
    "/icon-dream/healpix/icdc/modis/aqua",
    s3_options,
    mode="w",
)
# Wall time: ~1 h 25 min (Hopper GPU)

The result is displayed below:

MODIS SST on HEALPix
HEALPix: remapped to HEALPix level 10.

MODIS SST on original grid
Original: regular lat-lon grid at about 2.6 km resolution.

Datasets

Observations & reanalysis

ICDC — Integrated Climate Data Center

Status: In Progress

Satellite and in-situ derived climate data records curated by the ICDC at Universität Hamburg.

Property Value
Source resolution varies
HEALPix level varies
Temporal coverage 2002 – present (daily)
Variables TBD

S3 store: s3://icdc

Example: icdc/healpix/atmosphere/IMERG/PT30M/level_9
    <xarray.Dataset> Size: 587GB
    Dimensions:                              (time: 11664, cell: 3145728)
    Coordinates:
      * time                                 (time) datetime64[ns] 93kB 2025-01-0...
      * cell                                 (cell) int64 25MB 0 1 ... 3145727
        crs                                  float64 8B ...
        latitude                             (cell) float64 25MB dask.array<chunksize=(98304,), meta=np.ndarray>
        longitude                            (cell) float64 25MB dask.array<chunksize=(98304,), meta=np.ndarray>
    Data variables:
        calibrated_precipitation             (time, cell) float32 147GB dask.array<chunksize=(1, 3145728), meta=np.ndarray>
        precipitation_qualityindex           (time, cell) float32 147GB dask.array<chunksize=(1, 3145728), meta=np.ndarray>
        precipitation_randomerror            (time, cell) float32 147GB dask.array<chunksize=(1, 3145728), meta=np.ndarray>
        probability_of_liquid_precipitation  (time, cell) float32 147GB dask.array<chunksize=(1, 3145728), meta=np.ndarray>
    Attributes: (12/33)
        Conventions:                CF-1.6
        title:                      GPM_3IMERGHH: NASA Global Precipitation Measu...
        summary:                    See https://disc.gsfc.nasa.gov/datasets/GPM_3...
        institution:                Producer: NASA Global Precipitation Measureme...
        creator_url:                https://gpm.nasa.gov/missions/GPM ; https://w...
        creator_name:               NASA Global Precipitation Measurement Mission...
        ...                         ...
        comment:                    This is a reduced data set. Not included (but...
        references:                 1) Tan, J., G. J. Huffman, D. T. Belvin, E. J...
        citation:                   Huffman, G.J., E.F. Stocker, D.T. Bolvin, E.J...
        healpix_nside:              512
        healpix_level:              9
        healpix_order:              nested
ERA5 — ECMWF Reanalysis v5

Status: Available

The ERA5 global atmospheric reanalysis produced by ECMWF, covering 1940 to present on a 0.25° regular latitude-longitude grid (~31 km). Provides hourly estimates of a large number of atmospheric, land, and oceanic climate variables.

Property Value
Source resolution 0.25° (~31 km)
HEALPix level 7 (nside = 128, ~27 arcmin)
Temporal coverage 1940 – present (hourly)
Variables T2m, Precip,

S3 store: s3://era5 Notes: time freq: cmor, level.zarr

Example: era5/PT1H/level_7
    <xarray.Dataset> Size: 1TB
    Dimensions:    (time: 754752, cell: 196608)
    Coordinates:
      * time       (time) datetime64[ns] 6MB 1940-01-01 ... 2026-02-05T23:00:00
      * cell       (cell) int64 2MB 0 1 2 3 4 ... 196603 196604 196605 196606 196607
        crs        float64 8B ...
        latitude   (cell) float64 2MB dask.array<chunksize=(49152,), meta=np.ndarray>
        longitude  (cell) float64 2MB dask.array<chunksize=(49152,), meta=np.ndarray>
    Data variables:
        pr         (time, cell) float32 594GB dask.array<chunksize=(48, 196608), meta=np.ndarray>
        tas        (time, cell) float32 594GB dask.array<chunksize=(48, 196608), meta=np.ndarray>
    Attributes:
        CDI:            Climate Data Interface version 1.9.6 (http://mpimet.mpg.d...
        history:        Thu Mar 23 23:49:59 2023: cdo -s -z zip_9 mergetime /scra...
        institution:    European Centre for Medium-Range Weather Forecasts
        Conventions:    CF-1.6
        license:        Contains modified Copernicus Atmosphere Monitoring Servic...
        tracking_id:    d5b13485-16f3-5f65-8dfd-cf03615bcc01
        creation_date:  2023-03-23T23:27:08Z
        CDO:            Climate Data Operators version 1.9.6 (http://mpimet.mpg.d...
        healpix_nside:  128
        healpix_level:  7
        healpix_order:  nested

Model intercomparisons & campaigns

EERIE — Eddy-Rich Earth System Models

Status: Available

EERIE is an EU Horizon Europe project running coupled climate simulations at ocean-eddy-resolving resolution (< 10 km ocean, < 25 km atmosphere). The project aims to understand the role of ocean mesoscale eddies in the climate system and provide high-fidelity projections.

Property Value
Source resolution ~5 km (model-dependent)
HEALPix level 9
Models TBD
Variables T2m, Precip,

S3 store: s3://eerie Notes: Pseudo directory for experiments. More metadata

Example eerie/eerie-hist-1950-v20240618_P1M_mean_9.zarr
<xarray.Dataset> Size: 20GB
Dimensions:  (time: 780, cell: 3145728, crs: 1)
Coordinates:
  * time     (time) datetime64[ns] 6kB 1950-01-31T23:59:59 ... 2014-12-31T23:...
  * crs      (crs) float32 4B 0.0
Dimensions without coordinates: cell
Data variables:
    pr       (time, cell) float32 10GB dask.array<chunksize=(12, 262144), meta=np.ndarray>
    ts       (time, cell) float32 10GB dask.array<chunksize=(12, 262144), meta=np.ndarray>
DYAMOND — DYnamics of the Atmospheric general circulation Modeled On Non-hydrostatic Domains

Status: Available

DYAMOND is a model intercomparison of global storm-resolving simulations at 2.5–5 km resolution. Includes both summer (2016) and winter (2020) experiment phases with output from ICON, IFS, NICAM, MPAS, GEOS, SAM, SCREAM, SHiELD, and UM.

Property Value
Source resolution 2.5–5 km
HEALPix level 10–11
Temporal coverage 40-day windows (summer 2016, winter 2020)
Variables T2m, Precip,

S3 store: s3://dyamond

Example dyamond/icon-sap-5km/PT15M/level_10.zarr
    <xarray.Dataset> Size: 399GB
    Dimensions:    (time: 3961, cell: 12582912)
    Coordinates:
      * time       (time) datetime64[ns] 32kB 2020-01-20 ... 2020-03-01T06:00:00
      * cell       (cell) int64 101MB 0 1 2 3 ... 12582909 12582910 12582911
        crs        float64 8B ...
        latitude   (cell) float64 101MB dask.array<chunksize=(4194304,), meta=np.ndarray>
        longitude  (cell) float64 101MB dask.array<chunksize=(4194304,), meta=np.ndarray>
    Data variables:
        pr         (time, cell) float32 199GB dask.array<chunksize=(1, 4194304), meta=np.ndarray>
        tas        (time, cell) float32 199GB dask.array<chunksize=(1, 4194304), meta=np.ndarray>
    Attributes: (12/14)
        CDI:                       Climate Data Interface version 2.0.0rc2 (https...
        number_of_grid_used:       15
        grid_file_uri:             http://icon-downloads.mpimet.mpg.de/grids/publ...
        uuidOfHGrid:               0f1e7d66-637e-11e8-913b-51232bb4d8f9
        source:                    git@gitlab.dkrz.de:icon/icon-aes.git@6b5726d38...
        institution:               Max Planck Institute for Meteorology
        ...                        ...
        references:                see MPIM/DWD publications
        CDO:                       Climate Data Operators version 2.0.0rc2 (https...
        cdo_openmp_thread_number:  4
        healpix_nside:             1024
        healpix_level:             10
        healpix_order:             nested
nextGEMS — next Generation Earth Modelling Systems

Status: Available

nextGEMS is an EU Horizon 2020 project developing km-scale coupled Earth system models. Multi-decadal simulations with ICON and IFS-FESOM at ocean-eddy-permitting resolution, providing a testbed for next-generation climate projections.

Property Value
Source resolution ~5–10 km
HEALPix level 10
Models ICON
Variables T2m, Precip,

S3 store: s3://nextgems/ Notes:* Subfolder for time frequencies, more metadata

Example nextgems/ngc4008_PT15M_10.zarr
    <xarray.Dataset> Size: 106TB
    Dimensions:  (time: 1051968, cell: 12582912)
    Dimensions without coordinates: time, cell
    Data variables:
        pr       (time, cell) float32 53TB dask.array<chunksize=(16, 262144), meta=np.ndarray>
        tas      (time, cell) float32 53TB dask.array<chunksize=(16, 262144), meta=np.ndarray>
ICON-DREAM — DWD Reanalysis

Status: Available

ICON-DREAM is a regional and global reanalysis effort by the German Weather Service (DWD) using the ICON modelling framework. It produces high-resolution atmospheric reanalysis fields on the native ICON triangular mesh.

Property Value
Source grid ICON unstructured (triangular)
HEALPix level 8
Variables T2m, Precip,

S3 store: s3://icon-dream Notes:* Consistent places

Example /icon-dream/healpix/icon-dream-global/hourly/level_8.zarr
    <xarray.Dataset> Size: 2TB
    Dimensions:    (time: 137169, cell: 786432)
    Coordinates:
      * time       (time) datetime64[ns] 1MB 2009-12-31T22:00:00 ... 2025-08-30T2...
      * cell       (cell) int64 6MB 0 1 2 3 4 ... 786427 786428 786429 786430 786431
        latitude   (cell) float64 6MB dask.array<chunksize=(786432,), meta=np.ndarray>
        longitude  (cell) float64 6MB dask.array<chunksize=(786432,), meta=np.ndarray>
    Data variables:
        t2m        (time, cell) float64 863GB dask.array<chunksize=(10, 786432), meta=np.ndarray>
        tp         (time, cell) float64 863GB dask.array<chunksize=(10, 786432), meta=np.ndarray>
    Attributes:
        Conventions:             CF-1.7
        GRIB_centre:             edzw
        GRIB_centreDescription:  Offenbach
        GRIB_edition:            2
        GRIB_subCentre:          255
        healpix_level:           8
        healpix_nside:           256
        healpix_order:           nested
        history:                 2026-03-27T08:35 GRIB to CDM+CF via cfgrib-0.9.1...
        institution:             Offenbach
ORCESTRA — Organised Convection and EarthCARE Studies over the Tropical Atlantic

Status: Available

ORCESTRA is a coordinated field campaign (2024) combining aircraft, ship, and satellite observations over the tropical Atlantic to study organised deep convection and validate EarthCARE retrievals. Includes HALO, ATR-42, and METEOR observations alongside high-resolution ICON simulations.

Property Value
Source resolution Campaign-dependent
HEALPix level TBD

S3 store: s3://orchestra Notes: More sub folders

Example /orchestra/Basic_Halo_Measurement_and_Sensor_System_BAHAMAS_data.zarr
    <xarray.Dataset> Size: 5MB
    Dimensions:  (time: 55628)
    Coordinates:
      * time     (time) datetime64[ns] 445kB 2024-08-16T07:19:00 ... 2024-09-23T2...
        height   float64 8B ...
    Data variables:
        Dauer    (time) int64 445kB dask.array<chunksize=(55628,), meta=np.ndarray>
        DD       (time) int64 445kB dask.array<chunksize=(55628,), meta=np.ndarray>
        FF       (time) float64 445kB dask.array<chunksize=(55628,), meta=np.ndarray>
        Lat      (time) float64 445kB dask.array<chunksize=(55628,), meta=np.ndarray>
        Long     (time) float64 445kB dask.array<chunksize=(55628,), meta=np.ndarray>
        RH       (time) float64 445kB dask.array<chunksize=(55628,), meta=np.ndarray>
        RR_SRM   (time) float64 445kB dask.array<chunksize=(55628,), meta=np.ndarray>
        Tro1     (time) int64 445kB dask.array<chunksize=(55628,), meta=np.ndarray>
        Trs      (time) int64 445kB dask.array<chunksize=(55628,), meta=np.ndarray>
        TT       (time) float64 445kB dask.array<chunksize=(55628,), meta=np.ndarray>
        VVV      (time) int64 445kB dask.array<chunksize=(55628,), meta=np.ndarray>
    Attributes:
        creator_email:  Martin.Stelzner@dwd.de, daniel.klocke@mpimet.mpg.de
        creator_name:   Martin Stelzner, Daniel Klocke
        featureType:    trajectory
        history:        Converted to Zarr by Lukas Kluft (lukas.kluft@mpimet.mpg.de)
        keywords:       precipitation amount, precipitation gauge, precipitation ...
        license:        CC-BY-4.0
        platform:       RV METEOR
        project:        ORCESTRA, BOW-TIE
        summary:        The rain gauge has an upper and a lateral collecting surf...
        title:          Rain gauge measurements during METEOR cruise M203
CMIP6 — Coupled Model Intercomparison Project Phase 6

Status: Available

CMIP6 is the latest phase of the international model intercomparison providing the scientific basis for IPCC assessment reports. Dozens of climate models at resolutions from ~25 km to ~250 km, covering historical simulations, future projections, and targeted experiments.

Property Value
Source resolution ~25–250 km (model-dependent)
HEALPix level 5 - 6
Variables T2m, Precip,
S3 store: s3://cmip6
Example cmip6/healpix/cmip6/historical-r1i1p1f1/noresm2-mm/PT6H/level_5.zarr
    <xarray.Dataset> Size: 19GB
    Dimensions:    (time: 94900, lat: 192, bnds: 2, lon: 288, cell: 12288)
    Coordinates:
      * time       (time) object 759kB 1950-01-01 03:00:00 ... 2014-12-31 21:00:00
      * lat        (lat) float64 2kB -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0
      * lon        (lon) float64 2kB 0.0 1.25 2.5 3.75 ... 355.0 356.2 357.5 358.8
      * cell       (cell) int64 98kB 0 1 2 3 4 5 ... 12283 12284 12285 12286 12287
        crs        float64 8B ...
        height     float64 8B ...
        latitude   (cell) float64 98kB dask.array<chunksize=(12288,), meta=np.ndarray>
        longitude  (cell) float64 98kB dask.array<chunksize=(12288,), meta=np.ndarray>
    Dimensions without coordinates: bnds
    Data variables:
        lat_bnds   (time, lat, bnds) float64 292MB dask.array<chunksize=(14600, 192, 2), meta=np.ndarray>
        lon_bnds   (time, lon, bnds) float64 437MB dask.array<chunksize=(14600, 288, 2), meta=np.ndarray>
        pr         (time, cell) float64 9GB dask.array<chunksize=(14600, 12288), meta=np.ndarray>
        tas        (time, cell) float64 9GB dask.array<chunksize=(14600, 12288), meta=np.ndarray>
        time_bnds  (time, bnds) object 2MB dask.array<chunksize=(1, 2), meta=np.ndarray>
    Attributes: (12/54)
        Conventions:               CF-1.7 CMIP-6.2
        activity_id:               CMIP
        branch_method:             Hybrid-restart from year 1200-01-01 of piControl
        branch_time:               0.0
        branch_time_in_child:      0.0
        branch_time_in_parent:     438000.0
        ...                        ...
        table_id:                  6hrPlev
        table_info:                Creation Date:(24 July 2019) MD5:0bb394a356ef9...
        title:                     NorESM2-MM output prepared for CMIP6
        tracking_id:               hdl:21.14100/19ec4c56-4a33-4abb-a992-efe6324676bd
        variable_id:               pr
        variant_label:             r1i1p1f

Open decisions

Can we define an appropriate naming convention?

Suggestion: <bucket>/healpix/<experiment-compaign>/<model>/<freq>/level_X.zarr"

Naming conventions can be quite different for different datasets, but we can still aim at having a number four directory levels. If those four levels don't guarantee unique paths the directory names themselves can be adjusted to from uniq name patterns such as:

<bucket>/healpix/<product>/<instrument-level>/<freq>/level_X.zarr
<bucket>/healpix/<product-experiment>/<model-ensemble>/<freq>/level_X.zarr

The output time frequency freq should follow ISO 8601 standard.

Where to store cached weight files?

Weight files are reusable across runs for the same source grid and target level. The weight files with their grid signature should be stored at the following location:

ls /work/ks1387/healpix-weights

weights_0ba7e6dca9ba1ae9.nc  weights_3051722bc32a01e5.nc  weights_46fdfc6feb8ea520.nc  weights_d9c2730b22295f4a.nc
weights_2aff1785f62b0254.nc  weights_3a28f272e1fb6024.nc  weights_901fbfc4a3ce2458.nc

To make sure that the weight files are getting reproducible and reusable stored use cache_path=/work/ks1387/healpix-weights weights generation.

Zarr format: v2 or v3?

Some clients (gdal, zarrita) still require Zarr v2. Zarr v3 is the future standard but ecosystem support is still catching up.

Current decision: Write Zarr v2 with consolidated metadata until gridlook supports v3.


Roadmap

Completed

  • Conservative remapping pipeline (grid-doctor) with GPU acceleration
  • 1st round of Datasets uploaded

Next steps

  • Operationalise the remapping pipeline (reproducible batch jobs)
  • Set up shared weight-file cache on Lustre
  • Automatic tape archival and retrieval of Zarr stores
  • STAC catalogue integration for data discoverability
  • Freva databrowser registration
  • Documentation of the full process and methodology