Tabular
Note
This page describe what is specific to the tabular layout. For more general information creating and using datasets, see Using an existing dataset and Building your own datasets respectively.
Creating
To create a tabular dataset, the layout entry in the recipe’s
output section must be set to tabular:
output:
layout: tabular
Using
To open a tabular dataset, you use the open_dataset function with the start, end, window and frequency parameters.
ds = open_dataset(
dataset,
start=1979,
end=2020,
window="(-6h,0]",
frequency="6h",
)
The default values for start and end are the first and last date
of the dataset, respectively. Because these value may fall on full round
hours, it is recommended to set them explicitly.
Unlike for gridded datasets, the start, end and frequency
parameters can have arbitrary values, and are used to define how windows
are built and how many samples are in the dataset. Note that start
and end can be outside the range of actual dates in the datasets.
When requesting windows outside the range of actual dates, empty records
will be returned to the user.
The default value for window is (-3h,0] and the default value
for frequency is 3h. Windows are relative time intervals
that can be open or closed at each end. A round bracket indicates an
open end, while a square bracket indicates a closed end. The default
units are hours.
Windows can be open or closed at each end:
"[-3,+3]" # Both ends are included
"(-1d,0]" # Start is open, end is closed
Data samples
The dataset is made of samples, which are built by applying the
window to a list of reference dates defined by the start,
end and frequency parameters.
Reference dates
The references dates of the dataset are defined as all dates between
start and end with a step of frequency.
result = []
date = start
while date <= end:
result.append(date)
date += frequency
Note
The reference dates are not necessarily the same as the actual
dates in the dataset. They are used, together with the window
parameter, to define the samples returned when iterating the dataset.
See below for more information. Nevertheless, in order to ensure
compatibility with gridded datasets, the reference dates are
available as the dates attribute of the dataset.
It is not currently possible to combine tabular and gridded datasets within a single call to open_dataset,
but when this will be implemented, ds.dates, ds.frequency,
len(ds), etc. will all be compatible and comparable between the
two layouts.
ds.dates # Returns the list of reference dates defined by start, end and frequency
The number of samples in the dataset is then given by the formula:
number_of_samples = (end - start) // frequency + 1
to get the list of dates, you can access the dates attribute of the
dataset:
ds.dates # Returns the list of reference dates defined by start, end and frequency
The length of the dataset is equal to the number of samples:
assert len(ds) == number_of_samples
Single sample
A sample is a 2D numpy array that is returned when indexing the dataset
with an integer. The first dimension of the array is the number of
observations in the window, and the second dimension is the number of
variables. Each sample is constructed by applying the window to the
corresponding date. For example, if the date is 2020-01-01 00:00:00 and
the window is (-3h,0], then the sample will contain all observations
between 2019-12-31 21:00:00 and 2020-01-01 00:00:00, including the
latter but not the former.
sample = ds[42]
# A 2D array is returned, the first dimension is the number of observations
# in the 43rd window (samples are 0-indexed).
assert len(sample.shape) == 2
# The second dimension is the variables
assert sample.shape[1] == len(ds.variables)
The whole dataset can also be iterated over using a for loop:
for sample in ds:
assert len(sample.shape) == 2
assert sample.shape[1] == len(ds.variables)
is equivalent to:
for i in range(len(ds)):
sample = ds[i]
assert len(sample.shape) == 2
assert sample.shape[1] == len(ds.variables)
Auxiliary information
Becuse tabular data is unstructured, information such as the latidudes, longitudes and dates if the actual data cannot be provided at the dataset level. Instead, it is provided at the sample level. When you access a sample, you can also access the corresponding latitudes, longitudes, dates, etc. This information is returned as attributes of the sample:
Auxiliary information can be accessed as:
sample = ds[42]
number_of_observations_in_window = sample.shape[0]
# Returns the corresponding latitudes
sample.latitudes
assert len(sample.latitudes) == number_of_observations_in_window
# Returns the corresponding longitudes
sample.longitudes
assert len(sample.longitudes) == number_of_observations_in_window
sample.dates # Returns the corresponding row dates
# Returns the corresponding dates
sample.dates
assert len(sample.dates) == number_of_observations_in_window
# Return the reference date of the window
sample.reference_date
assert sample.reference_date == ds.start_date + 42 * ds.frequency
# Return the timedeltas in seconds relative to the reference_date
sample.timedeltas
assert len(sample.timedeltas) == number_of_observations_in_window
Slices
When slicing the dataset, the same rules apply as for indexing with an
integer, but you can recover the samples using the boundaries
attribute of the resulting array . The boudaries attribute is a list
of slice objects that can be used to access the samples in the
result. You also can retrieve the reference dates with the
reference_dates attribute of the result.
samples = ds[10:30]
assert len(samples.boundaries) == 20
assert len(samples.reference_dates) == 20
i = 10
for b in samples.boundaries:
sample = samples[b]
assert np.array_equal(sample, ds[i])
i += 1
The latitudes, longitudes, dates, timedeltas, etc.
attributes of the resulting array are the concatenation of the
corresponding attributes of the samples.
assert np.array_equal(samples.latitudes, np.concatenate([ds[i].latitudes for i in range(10,30)]))
assert np.array_equal(samples.longitudes, np.concatenate([ds[i].longitudes for i in range(10,30)]))
assert np.array_equal(samples.dates, np.concatenate([ds[i].dates for i in range(10,30)]))
Warning
The two codes below are not equivalent:
samples = ds[10:30]
boundaries = samples.boundaries
latitudes = samples.latitudes[boundaries[1]]
and:
samples = ds[10:30]
boundaries = samples.boundaries
latitudes = samples[boundaries[1]].latitudes
Only the first construct will work.
Examples
The following examples show various ways to define the window and the frequency parameters when opening a tabular dataset.
First example, the window width (6h) matches the frequency (6h), so the whole dataset is covered:
ds = open_dataset(
path,
start=1979,
end=2020,
window="(-6h,0]",
frequency="6h")
The schema below illustrates the window and frequency parameters in this case:
Second example, the window width (6h) is narrower than the frequency (12h), so there are gaps between the windows:
ds = open_dataset(
path,
start=1979,
end=2020,
window="(-5h,+1h]",
frequency="12h")
As illustrated in the schema below, there are gaps between the windows:
In the third example, the window width (8h) is wider than the frequency (6h), so there are overlaps between the windows:
ds = open_dataset(
path,
start=1979,
end=2020,
window="(-5h,+3h]",
frequency="6h")
As illustrated in the schema below, there are overlaps between the windows: