Re-create the sample dataset

The following example presents a simple but comprehensive recipe to create an Anemoi dataset based on the atmospheric reanalysis ERA5. Once built, the dataset can be used to train an ML atmospheric model, such as AIFS, at low resolution. Note that a prebuilt version of the anemoi-dataset generated by this recipe can be downloaded from this link era5-o48-2020-2021-6h-v1.zip. Do not unzip the file; you can pass it directly to open_dataset.

Warning

Please note that the size of the file era5-o48-2020-2021-6h-v1.zip is approximately 2.5 GB.

name: era5-o48-2020-2021-6h-v1

description: Low resolution reduced dataset for documentation purposes

attribution: ECMWF/C3S

licence: CC-BY-4.0

dates:
  start: '2020-01-01T00:00:00'
  end: '2021-12-31T23:00:00'
  frequency: 6h

input:
  join:
  - mars:
      use_cdsapi_dataset: "reanalysis-era5-complete"
      class: ea
      expver: '0001'
      grid: o48
      levtype: sfc
      param:
      - 10u
      - 10v
      - 2d
      - 2t
      - lsm
      - msl
      - sdor
      - skt
      - slor
      - sp
      - tcw
      - z
  - mars:
      use_cdsapi_dataset: "reanalysis-era5-complete"
      class: ea
      expver: '0001'
      grid: o48
      level:
      - 250
      - 500
      - 850
      - 1000
      levtype: pl
      param:
      - u
      - v
      - q
      - t
      - z
  - accumulate:
      period: 6h
      availability: auto
      source:
        mars:
          use_cdsapi_dataset: "reanalysis-era5-complete"
          class: ea
          expver: '0001'
          grid: o48
          param:
          - cp
          - tp
  - constants:
      param:
      - cos_latitude
      - cos_longitude
      - sin_latitude
      - sin_longitude
      - cos_julian_day
      - cos_local_time
      - sin_julian_day
      - sin_local_time
      - insolation
      template: ${input.join.0.mars}

Let’s break down the recipe to understand its main components! Some concepts presented below will be explained in more detail in the USER GUIDE section.

Dataset Naming and Description

name: era5-o48-2020-2021-6h-v1

description: Low-resolution reduced dataset for documentation purposes

attribution: ECMWF/C3S

licence: CC-BY-4.0

In the first lines of the recipe, we define the dataset name, a brief description, the attribution to the data source, and the licence under which the dataset is released. This information will be stored in the dataset metadata.

Dates

dates:
    start: '2020-01-01T00:00:00'
    end: '2021-12-31T23:00:00'
    frequency: 6h

Here, we define the time dimension of the Anemoi dataset. In particular, we specify the start and end dates of the dataset and the frequency of the time steps. In our specific example, the dataset is based on ERA5, which has a 1-hour frequency, but we choose to downsample it to 6 hours – the typical timestep of the AIFS model. Note that this does not imply that the timestep of the final model has to be 6 hours; it could be a multiple of it.

Getting the Data from Different Streams

In the remainder of the recipe (input), we join variables coming from different streams to form the dataset. In particular, the following streams are used:

  • Surface fields from the Copernicus Climate Data Store (CDS) through MARS requests.

  • Pressure level fields from the CDS through MARS requests.

  • Accumulated fields from the CDS through MARS requests.

  • Computed constants generated by anemoi-datasets.

At this point, we also specify the resolution of the dataset, an o48 octahedral Gaussian grid in our specific case.

How Do We Build the Dataset?

The dataset is built via the anemoi-datasets command-line tool as follows:

$ anemoi-datasets create era5-o48-2020-2021-6h-v1.yaml era5-o48-2020-2021-6h-v1.zarr

The YAML file era5-o48-2020-2021-6h-v1.yaml contains the recipe, and the output is stored in era5-o48-2020-2021-6h-v1.zarr.

Warning

Note that running the recipe requires that the access to the CDS is correctly set up. For this, please refer to the CDS page.

Once the build is complete, you can inspect the dataset using the following command:

$ anemoi-datasets inspect era5-o48-2020-2021-6h-v1.zarr

Leading to the following output:

📦 Path          : ./era5-o48-2020-2021-6h-v1.zarr
🔢 Format version: 0.30.0

📅 Start      : 2020-01-01 00:00
📅 End        : 2021-12-31 18:00
⏰ Frequency  : 6h
🚫 Missing    : 0
🌎 Resolution : O48
🌎 Field shape: [10944]

📐 Shape      : 2,924 × 43 × 1 × 10,944 (5.1 GiB)
💽 Size       : 2.8 GiB (2.8 GiB)
📁 Files      : 3,040

Index  Variable                 Min        Max          Mean        Stdev
──────┼────────────────┼──────────────┼───────────┼──────────────┼────────────
    0  10u                 -35.0512    30.6795     -0.550947      5.47129
    1  10v                 -34.5802     33.542      0.212642      4.51989
    2  2d                   191.632    304.995       282.699      15.8811
    3  2t                   195.362    325.785        287.83      16.1394
    4  cos_julian_day     -0.999998          1    -0.0562081     0.721861
    5  cos_latitude       0.0249178   0.999867      0.785126     0.237451
    6  cos_local_time            -1          1   9.31395e-13     0.707107
    7  cos_longitude             -1          1   3.48565e-10     0.707107
    8  cp                         0  0.0859051   0.000384542   0.00126894
    9  insolation                 0          1      0.250838     0.323889
    10  lsm                        0          1      0.287684     0.441725
    11  msl                  92639.9     106800        101148       1132.7
    12  q_1000                 1e-08  0.0293589    0.00963055   0.00588138
    13  q_250           -4.33519e-05   0.001311   8.47611e-05  9.17112e-05
    14  q_500           -5.44315e-05  0.0100051    0.00120283   0.00132836
    15  q_850                  1e-08  0.0228447    0.00629476   0.00432839
    16  sdor                       0    679.961       20.9144      61.4512
    17  sin_julian_day     -0.999999   0.999999      0.180368     0.665751
    18  sin_latitude        -0.99969    0.99969             0     0.572008
    19  sin_local_time            -1          1   1.49023e-13     0.707107
    20  sin_longitude             -1          1  -2.17853e-09     0.707107
    21  skt                  194.347    342.479       288.565      16.9845
    22  slor                  0.0001   0.115619    0.00346549    0.0100738
    23  sp                   50213.9     106807       98447.5      7059.61
    24  t_1000               219.141    324.407       288.891      13.7525
    25  t_250                195.464    247.214       226.305      7.57534
    26  t_500                218.042    285.169       259.119      11.1935
    27  t_850                219.383    313.381       281.671      12.6545
    28  tcw                0.0523909    127.525       25.5015       17.518
    29  tp                         0   0.204357   0.000744018   0.00257298
    30  u_1000              -31.3894    29.2405      -0.60533      6.04254
    31  u_250               -65.3296    112.971        13.497      18.5855
    32  u_500               -49.8631    82.0021       6.03161      11.8884
    33  u_850                -56.938    52.5499      0.718384      8.11392
    34  v_1000              -34.9257     30.746      0.220698      4.95779
    35  v_250               -87.8586    100.591    -0.0613504      12.6573
    36  v_500                -63.541    76.2497    -0.0431783      8.24538
    37  v_850               -47.4795    52.8012      0.108581      5.63597
    38  z                   -374.223      54350       2377.04      6366.77
    39  z_1000              -6045.95    4721.73       943.065      915.956
    40  z_250                87717.1     109625        103869      4752.65
    41  z_500                43732.3      58934       55645.5      2822.68
    42  z_850                6671.23    16961.5       14286.1      1239.99
──────┴────────────────┴──────────────┴───────────┴──────────────┴────────────
🪫 Dataset not initialised