Re-create the sample dataset
The following example presents a simple but comprehensive recipe to
create an Anemoi dataset based on the atmospheric reanalysis ERA5. Once
built, the dataset can be used to train an ML atmospheric model, such as
AIFS, at low resolution. Note that a prebuilt version of the
anemoi-dataset generated by this recipe can be downloaded from this link
era5-o48-2020-2021-6h-v1.zip.
Do not unzip the file; you can pass it directly to open_dataset.
Warning
Please note that the size of the file
era5-o48-2020-2021-6h-v1.zip is approximately 2.5 GB.
name: era5-o48-2020-2021-6h-v1
description: Low resolution reduced dataset for documentation purposes
attribution: ECMWF/C3S
licence: CC-BY-4.0
dates:
start: '2020-01-01T00:00:00'
end: '2021-12-31T23:00:00'
frequency: 6h
input:
join:
- mars:
use_cdsapi_dataset: "reanalysis-era5-complete"
class: ea
expver: '0001'
grid: o48
levtype: sfc
param:
- 10u
- 10v
- 2d
- 2t
- lsm
- msl
- sdor
- skt
- slor
- sp
- tcw
- z
- mars:
use_cdsapi_dataset: "reanalysis-era5-complete"
class: ea
expver: '0001'
grid: o48
level:
- 250
- 500
- 850
- 1000
levtype: pl
param:
- u
- v
- q
- t
- z
- accumulate:
period: 6h
availability: auto
source:
mars:
use_cdsapi_dataset: "reanalysis-era5-complete"
class: ea
expver: '0001'
grid: o48
param:
- cp
- tp
- constants:
param:
- cos_latitude
- cos_longitude
- sin_latitude
- sin_longitude
- cos_julian_day
- cos_local_time
- sin_julian_day
- sin_local_time
- insolation
template: ${input.join.0.mars}
Let’s break down the recipe to understand its main components! Some concepts presented below will be explained in more detail in the USER GUIDE section.
Dataset Naming and Description
name: era5-o48-2020-2021-6h-v1
description: Low-resolution reduced dataset for documentation purposes
attribution: ECMWF/C3S
licence: CC-BY-4.0
In the first lines of the recipe, we define the dataset name, a brief description, the attribution to the data source, and the licence under which the dataset is released. This information will be stored in the dataset metadata.
Dates
dates:
start: '2020-01-01T00:00:00'
end: '2021-12-31T23:00:00'
frequency: 6h
Here, we define the time dimension of the Anemoi dataset. In particular, we specify the start and end dates of the dataset and the frequency of the time steps. In our specific example, the dataset is based on ERA5, which has a 1-hour frequency, but we choose to downsample it to 6 hours – the typical timestep of the AIFS model. Note that this does not imply that the timestep of the final model has to be 6 hours; it could be a multiple of it.
Getting the Data from Different Streams
In the remainder of the recipe (input), we join variables coming from different streams to form the dataset. In particular, the following streams are used:
Surface fields from the Copernicus Climate Data Store (CDS) through MARS requests.
Pressure level fields from the CDS through MARS requests.
Accumulated fields from the CDS through MARS requests.
Computed constants generated by anemoi-datasets.
At this point, we also specify the resolution of the dataset, an o48 octahedral Gaussian grid in our specific case.
How Do We Build the Dataset?
The dataset is built via the anemoi-datasets command-line tool as follows:
$ anemoi-datasets create era5-o48-2020-2021-6h-v1.yaml era5-o48-2020-2021-6h-v1.zarr
The YAML file era5-o48-2020-2021-6h-v1.yaml contains the recipe, and
the output is stored in era5-o48-2020-2021-6h-v1.zarr.
Warning
Note that running the recipe requires that the access to the CDS is correctly set up. For this, please refer to the CDS page.
Once the build is complete, you can inspect the dataset using the following command:
$ anemoi-datasets inspect era5-o48-2020-2021-6h-v1.zarr
Leading to the following output:
📦 Path : ./era5-o48-2020-2021-6h-v1.zarr
🔢 Format version: 0.30.0
📅 Start : 2020-01-01 00:00
📅 End : 2021-12-31 18:00
⏰ Frequency : 6h
🚫 Missing : 0
🌎 Resolution : O48
🌎 Field shape: [10944]
📐 Shape : 2,924 × 43 × 1 × 10,944 (5.1 GiB)
💽 Size : 2.8 GiB (2.8 GiB)
📁 Files : 3,040
Index │ Variable │ Min │ Max │ Mean │ Stdev
──────┼────────────────┼──────────────┼───────────┼──────────────┼────────────
0 │ 10u │ -35.0512 │ 30.6795 │ -0.550947 │ 5.47129
1 │ 10v │ -34.5802 │ 33.542 │ 0.212642 │ 4.51989
2 │ 2d │ 191.632 │ 304.995 │ 282.699 │ 15.8811
3 │ 2t │ 195.362 │ 325.785 │ 287.83 │ 16.1394
4 │ cos_julian_day │ -0.999998 │ 1 │ -0.0562081 │ 0.721861
5 │ cos_latitude │ 0.0249178 │ 0.999867 │ 0.785126 │ 0.237451
6 │ cos_local_time │ -1 │ 1 │ 9.31395e-13 │ 0.707107
7 │ cos_longitude │ -1 │ 1 │ 3.48565e-10 │ 0.707107
8 │ cp │ 0 │ 0.0859051 │ 0.000384542 │ 0.00126894
9 │ insolation │ 0 │ 1 │ 0.250838 │ 0.323889
10 │ lsm │ 0 │ 1 │ 0.287684 │ 0.441725
11 │ msl │ 92639.9 │ 106800 │ 101148 │ 1132.7
12 │ q_1000 │ 1e-08 │ 0.0293589 │ 0.00963055 │ 0.00588138
13 │ q_250 │ -4.33519e-05 │ 0.001311 │ 8.47611e-05 │ 9.17112e-05
14 │ q_500 │ -5.44315e-05 │ 0.0100051 │ 0.00120283 │ 0.00132836
15 │ q_850 │ 1e-08 │ 0.0228447 │ 0.00629476 │ 0.00432839
16 │ sdor │ 0 │ 679.961 │ 20.9144 │ 61.4512
17 │ sin_julian_day │ -0.999999 │ 0.999999 │ 0.180368 │ 0.665751
18 │ sin_latitude │ -0.99969 │ 0.99969 │ 0 │ 0.572008
19 │ sin_local_time │ -1 │ 1 │ 1.49023e-13 │ 0.707107
20 │ sin_longitude │ -1 │ 1 │ -2.17853e-09 │ 0.707107
21 │ skt │ 194.347 │ 342.479 │ 288.565 │ 16.9845
22 │ slor │ 0.0001 │ 0.115619 │ 0.00346549 │ 0.0100738
23 │ sp │ 50213.9 │ 106807 │ 98447.5 │ 7059.61
24 │ t_1000 │ 219.141 │ 324.407 │ 288.891 │ 13.7525
25 │ t_250 │ 195.464 │ 247.214 │ 226.305 │ 7.57534
26 │ t_500 │ 218.042 │ 285.169 │ 259.119 │ 11.1935
27 │ t_850 │ 219.383 │ 313.381 │ 281.671 │ 12.6545
28 │ tcw │ 0.0523909 │ 127.525 │ 25.5015 │ 17.518
29 │ tp │ 0 │ 0.204357 │ 0.000744018 │ 0.00257298
30 │ u_1000 │ -31.3894 │ 29.2405 │ -0.60533 │ 6.04254
31 │ u_250 │ -65.3296 │ 112.971 │ 13.497 │ 18.5855
32 │ u_500 │ -49.8631 │ 82.0021 │ 6.03161 │ 11.8884
33 │ u_850 │ -56.938 │ 52.5499 │ 0.718384 │ 8.11392
34 │ v_1000 │ -34.9257 │ 30.746 │ 0.220698 │ 4.95779
35 │ v_250 │ -87.8586 │ 100.591 │ -0.0613504 │ 12.6573
36 │ v_500 │ -63.541 │ 76.2497 │ -0.0431783 │ 8.24538
37 │ v_850 │ -47.4795 │ 52.8012 │ 0.108581 │ 5.63597
38 │ z │ -374.223 │ 54350 │ 2377.04 │ 6366.77
39 │ z_1000 │ -6045.95 │ 4721.73 │ 943.065 │ 915.956
40 │ z_250 │ 87717.1 │ 109625 │ 103869 │ 4752.65
41 │ z_500 │ 43732.3 │ 58934 │ 55645.5 │ 2822.68
42 │ z_850 │ 6671.23 │ 16961.5 │ 14286.1 │ 1239.99
──────┴────────────────┴──────────────┴───────────┴──────────────┴────────────
🪫 Dataset not initialised