input.catalog.ValenspyEsmDatastore#

class ValenspyEsmDatastore(*args, **kwargs)[source]#

Bases: esm_datastore

Subclass of intake_esm.ESMDataStore for ValEnsPy.

This extends the ESMDataStore class with a adittional search functionality for time based searching using the time_period column.

__init__(*args, **kwargs)[source]#

Intake Catalog representing an ESM Collection.

Methods

__init__(*args, **kwargs)

Intake Catalog representing an ESM Collection.

close()

Close open resources corresponding to this data source.

configure_new(**kwargs)

Create a new instance of this source with altered arguments

describe()

Description from the entry spec

discover()

Open resource and populate the source attributes.

filter(func)

Create a Catalog of a subset of entries based on a condition

force_reload()

Imperative reload data now

from_dict(entries, **kwargs)

Create Catalog from the given set of entries

get(**kwargs)

Create a new instance of this source with altered arguments

items()

Get an iterator over (key, source) tuples for the catalog entries.

keys()

Get keys for the catalog entries

keys_info()

Get keys for the catalog entries and their metadata

nunique()

Count distinct observations across dataframe columns in the catalog.

pop(key)

Remove entry from catalog and return it

read()

Load entire dataset into a container and return it

read_chunked()

Return iterator over container fragments of data source

read_partition(i)

Return a part of the data corresponding to i-th partition.

reload()

Reload catalog if sufficient time has passed

save(url[, storage_options])

Output this catalog to a file as YAML

search([require_all_on])

Search for entries in the catalog.

serialize(name[, directory, catalog_type, ...])

Serialize catalog to corresponding json and csv files.

to_dask(**kwargs)

Convert result to an xarray dataset.

to_dataset_dict([xarray_open_kwargs, ...])

Load catalog entries into a dictionary of xarray datasets.

to_datatree([levels])

Load catalog entries into a tree of xarray datasets.

to_spark()

Provide an equivalent data object in Apache Spark

unique()

Return unique values for given columns in the catalog.

values()

Get an iterator over the sources for catalog entries.

walk([sofar, prefix, depth])

Get all entries in this catalog and sub-catalogs

yaml()

Return YAML representation of this data-source

Attributes

auth

cat

classname

container

description

df

Return pandas DataFrame.

dtype

entry

gui

has_been_persisted

The base class does not interact with persistence

is_persisted

The base class does not interact with persistence

key_template

Return string template used to create catalog entry keys

kwargs

metadata

name

npartitions

on_server

partition_access

shape

version

_close()#

Subclasses should close all open resources

_create_derived_variables(datasets, skip_on_error)#
_entry = None#
_get_cache(urlpath)#

The base class does not interact with caches

_get_entries() dict[str, ESMDataSource]#
_get_entry(name)#
_get_partition(i)#

Subclasses should return a container object for this partition

This function will never be called with an out-of-range value for i.

_get_schema()#

Subclasses should return an instance of base.Schema

_ipython_display_()#

Display the entry as a rich object in an IPython session

_ipython_key_completions_()#
_load()#

Override this: load catalog entries

_load_metadata()#

load metadata only if needed

_make_entries_container()#

Subclasses may override this to return some other dict-like.

See RemoteCatalog below for the motivating example for this hook. This is typically useful for large Catalogs backed by dynamic resources such as databases.

The object returned by this method must implement:

  • __iter__() -> an iterator of entry names

  • __getitem__(key) -> an Entry

  • items() -> an iterator of (key, Entry) pairs

For best performance the object should also implement:

  • __len__() -> int

  • __contains__(key) -> boolean

In __len__ or __contains__ are not implemented, intake will fall back on iterating through the entire catalog to compute its length or check for containment, which may be expensive on large catalogs.

_repr_html_() str#

Return an html representation for the catalog object. Mainly for IPython notebook

_schema = None#
_validate_derivedcat() None#
_yaml()#
auth = None#
cat = None#
property classname#
close()#

Close open resources corresponding to this data source.

configure_new(**kwargs)#

Create a new instance of this source with altered arguments

Enables the picking of options and re-evaluating templates from any user-parameters associated with this source, or overriding any of the init arguments.

Returns a new data source instance. The instance will be recreated from the original entry definition in a catalog if this source was originally created from a catalog.

container = 'xarray'#
describe()#

Description from the entry spec

description = None#
property df: DataFrame#

Return pandas DataFrame.

discover()#

Open resource and populate the source attributes.

dtype = None#
property entry#
filter(func)#

Create a Catalog of a subset of entries based on a condition

Warning

This function operates on CatalogEntry objects not DataSource objects.

Note

Note that, whatever specific class this is performed on, the return instance is a Catalog. The entries are passed unmodified, so they will still reference the original catalog instance and include its details such as directory,.

Parameters:

func (function) – This should take a CatalogEntry and return True or False. Those items returning True will be included in the new Catalog, with the same entry names

Returns:

New catalog with Entries that still refer to their parents

Return type:

Catalog

force_reload()#

Imperative reload data now

classmethod from_dict(entries, **kwargs)#

Create Catalog from the given set of entries

Parameters:
  • entries (dict-like) – A mapping of name:entry which supports dict-like functionality, e.g., is derived from collections.abc.Mapping.

  • kwargs (passed on the constructor) – Things like metadata, name; see __init__.

Return type:

Catalog instance

get(**kwargs)#

Create a new instance of this source with altered arguments

Enables the picking of options and re-evaluating templates from any user-parameters associated with this source, or overriding any of the init arguments.

Returns a new data source instance. The instance will be recreated from the original entry definition in a catalog if this source was originally created from a catalog.

property gui#
property has_been_persisted#

The base class does not interact with persistence

property is_persisted#

The base class does not interact with persistence

items()#

Get an iterator over (key, source) tuples for the catalog entries.

property key_template: str#

Return string template used to create catalog entry keys

Returns:

string template used to create catalog entry keys

Return type:

str

keys() list[str]#

Get keys for the catalog entries

Returns:

keys for the catalog entries

Return type:

list

keys_info() DataFrame#

Get keys for the catalog entries and their metadata

Returns:

keys for the catalog entries and their metadata

Return type:

pandas.DataFrame

Examples

>>> import intake
>>> cat = intake.open_esm_datastore('./tests/sample-catalogs/cesm1-lens-netcdf.json')
>>> cat.keys_info()
                component experiment stream
key
ocn.20C.pop.h         ocn        20C  pop.h
ocn.CTRL.pop.h        ocn       CTRL  pop.h
ocn.RCP85.pop.h       ocn      RCP85  pop.h
property kwargs#
metadata = {}#
name = 'esm_datastore'#
npartitions = 0#
nunique() Series#

Count distinct observations across dataframe columns in the catalog.

Examples

>>> import intake
>>> cat = intake.open_esm_datastore('pangeo-cmip6.json')
>>> cat.nunique()
activity_id          10
institution_id       23
source_id            48
experiment_id        29
member_id            86
table_id             19
variable_id         187
grid_label            7
zstore            27437
dcpp_init_year       59
dtype: int64
on_server = False#
partition_access = False#
pop(key)#

Remove entry from catalog and return it

This relies on the _entries attribute being mutable, which it normally is. Note that if a catalog automatically reloads, any entry removed here may soon reappear

Parameters:

key (str) – Key to give the entry in the cat

read()#

Load entire dataset into a container and return it

read_chunked()#

Return iterator over container fragments of data source

read_partition(i)#

Return a part of the data corresponding to i-th partition.

By default, assumes i should be an integer between zero and npartitions; override for more complex indexing schemes.

reload()#

Reload catalog if sufficient time has passed

save(url, storage_options=None)#

Output this catalog to a file as YAML

Parameters:
  • url (str) – Location to save to, perhaps remote

  • storage_options (dict) – Extra arguments for the file-system

search(require_all_on: str | list[str] | None = None, **query)[source]#

Search for entries in the catalog.

Standard search function of the intake_esm.esm_datastore class extended with time based searching based on the time_period column.

Parameters:
  • require_all_on (str, optional) – A dataframe column or a list of dataframe columns across which all entries must satisfy the query criteria. If None, return entries that fulfill any of the criteria specified in the query, by default None.

  • **query – keyword arguments corresponding to user’s query to execute against the dataframe.

See also

intake_esm.esm_datastore.search

serialize(name: Annotated[str, Strict(strict=True)], directory: Annotated[Path, PathType(path_type=dir)] | Annotated[str, Strict(strict=True)] | None = None, catalog_type: str = 'dict', to_csv_kwargs: dict[Any, Any] | None = None, json_dump_kwargs: dict[Any, Any] | None = None, storage_options: dict[str, Any] | None = None) None#

Serialize catalog to corresponding json and csv files.

Parameters:
  • name (str) – name to use when creating ESM catalog json file and csv catalog.

  • directory (str, PathLike, default None) – The path to the local directory. If None, use the current directory

  • catalog_type (str, default 'dict') – Whether to save the catalog table as a dictionary in the JSON file or as a separate CSV file.

  • to_csv_kwargs (dict, optional) – Additional keyword arguments passed through to the to_csv method.

  • json_dump_kwargs (dict, optional) – Additional keyword arguments passed through to the dump function.

  • storage_options (dict) – fsspec parameters passed to the backend file-system such as Google Cloud Storage, Amazon Web Service S3.

Notes

Large catalogs can result in large JSON files. To keep the JSON file size manageable, call with catalog_type=’file’ to save catalog as a separate CSV file.

Examples

>>> import intake
>>> cat = intake.open_esm_datastore('pangeo-cmip6.json')
>>> cat_subset = cat.search(
...     source_id='BCC-ESM1',
...     grid_label='gn',
...     table_id='Amon',
...     experiment_id='historical',
... )
>>> cat_subset.serialize(name='cmip6_bcc_esm1', catalog_type='file')
shape = None#
to_dask(**kwargs) Dataset#

Convert result to an xarray dataset.

This is only possible if the search returned exactly one result.

Parameters:

kwargs (dict) – Parameters forwarded to to_dataset_dict.

Return type:

Dataset

to_dataset_dict(xarray_open_kwargs: dict[str, Any] | None = None, xarray_combine_by_coords_kwargs: dict[str, Any] | None = None, preprocess: Callable | None = None, storage_options: dict[Annotated[str, Strict(strict=True)], Any] | None = None, progressbar: Annotated[bool, Strict(strict=True)] | None = None, aggregate: Annotated[bool, Strict(strict=True)] | None = None, skip_on_error: Annotated[bool, Strict(strict=True)] = False, **kwargs) dict[str, Dataset]#

Load catalog entries into a dictionary of xarray datasets.

Column values, dataset keys and requested variables are added as global attributes on the returned datasets. The names of these attributes can be customized with intake_esm.utils.set_options.

Parameters:
  • xarray_open_kwargs (dict) – Keyword arguments to pass to open_dataset function

  • xarray_combine_by_coords_kwargs (: dict) – Keyword arguments to pass to combine_by_coords function.

  • preprocess (callable, optional) – If provided, call this function on each dataset prior to aggregation.

  • storage_options (dict, optional) – fsspec Parameters passed to the backend file-system such as Google Cloud Storage, Amazon Web Service S3.

  • progressbar (bool) – If True, will print a progress bar to standard error (stderr) when loading assets into Dataset.

  • aggregate (bool, optional) – If False, no aggregation will be done.

  • skip_on_error (bool, optional) – If True, skip datasets that cannot be loaded and/or variables we are unable to derive.

Returns:

dsets – A dictionary of xarray Dataset.

Return type:

dict

Examples

>>> import intake
>>> cat = intake.open_esm_datastore('glade-cmip6.json')
>>> sub_cat = cat.search(
...     source_id=['BCC-CSM2-MR', 'CNRM-CM6-1', 'CNRM-ESM2-1'],
...     experiment_id=['historical', 'ssp585'],
...     variable_id='pr',
...     table_id='Amon',
...     grid_label='gn',
... )
>>> dsets = sub_cat.to_dataset_dict()
>>> dsets.keys()
dict_keys(['CMIP.BCC.BCC-CSM2-MR.historical.Amon.gn', 'ScenarioMIP.BCC.BCC-CSM2-MR.ssp585.Amon.gn'])
>>> dsets['CMIP.BCC.BCC-CSM2-MR.historical.Amon.gn']
<xarray.Dataset>
Dimensions:    (bnds: 2, lat: 160, lon: 320, member_id: 3, time: 1980)
Coordinates:
* lon        (lon) float64 0.0 1.125 2.25 3.375 ... 355.5 356.6 357.8 358.9
* lat        (lat) float64 -89.14 -88.03 -86.91 -85.79 ... 86.91 88.03 89.14
* time       (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00
* member_id  (member_id) <U8 'r1i1p1f1' 'r2i1p1f1' 'r3i1p1f1'
Dimensions without coordinates: bnds
Data variables:
    lat_bnds   (lat, bnds) float64 dask.array<chunksize=(160, 2), meta=np.ndarray>
    lon_bnds   (lon, bnds) float64 dask.array<chunksize=(320, 2), meta=np.ndarray>
    time_bnds  (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray>
    pr         (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 600, 160, 320), meta=np.ndarray>
to_datatree(levels: list[str] = None, **kwargs)[source]#

Load catalog entries into a tree of xarray datasets.

Parameters:
  • xarray_open_kwargs (dict) – Keyword arguments to pass to open_dataset function

  • xarray_combine_by_coords_kwargs (: dict) – Keyword arguments to pass to combine_by_coords function.

  • preprocess (callable, optional) – If provided, call this function on each dataset prior to aggregation.

  • storage_options (dict, optional) – Parameters passed to the backend file-system such as Google Cloud Storage, Amazon Web Service S3.

  • progressbar (bool) – If True, will print a progress bar to standard error (stderr) when loading assets into Dataset.

  • aggregate (bool, optional) – If False, no aggregation will be done.

  • skip_on_error (bool, optional) – If True, skip datasets that cannot be loaded and/or variables we are unable to derive.

  • levels (list[str], optional) – List of fields to use as the datatree nodes. WARNING: This will overwrite the fields used to create the unique aggregation keys.

Returns:

dsets – A tree of xarray Dataset.

Return type:

DataTree

Examples

>>> import intake
>>> cat = intake.open_esm_datastore('glade-cmip6.json')
>>> sub_cat = cat.search(
...     source_id=['BCC-CSM2-MR', 'CNRM-CM6-1', 'CNRM-ESM2-1'],
...     experiment_id=['historical', 'ssp585'],
...     variable_id='pr',
...     table_id='Amon',
...     grid_label='gn',
... )
>>> dsets = sub_cat.to_datatree()
>>> dsets['CMIP/BCC.BCC-CSM2-MR/historical/Amon/gn'].ds
<xarray.Dataset>
Dimensions:    (bnds: 2, lat: 160, lon: 320, member_id: 3, time: 1980)
Coordinates:
* lon        (lon) float64 0.0 1.125 2.25 3.375 ... 355.5 356.6 357.8 358.9
* lat        (lat) float64 -89.14 -88.03 -86.91 -85.79 ... 86.91 88.03 89.14
* time       (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00
* member_id  (member_id) <U8 'r1i1p1f1' 'r2i1p1f1' 'r3i1p1f1'
Dimensions without coordinates: bnds
Data variables:
    lat_bnds   (lat, bnds) float64 dask.array<chunksize=(160, 2), meta=np.ndarray>
    lon_bnds   (lon, bnds) float64 dask.array<chunksize=(320, 2), meta=np.ndarray>
    time_bnds  (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray>
    pr         (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 600, 160, 320), meta=np.ndarray>
to_spark()#

Provide an equivalent data object in Apache Spark

The mapping of python-oriented data containers to Spark ones will be imperfect, and only a small number of drivers are expected to be able to produce Spark objects. The standard arguments may b translated, unsupported or ignored, depending on the specific driver.

This method requires the package intake-spark

unique() Series#

Return unique values for given columns in the catalog.

values()#

Get an iterator over the sources for catalog entries.

property version#
walk(sofar=None, prefix=None, depth=2)#

Get all entries in this catalog and sub-catalogs

Parameters:
  • sofar (dict or None) – Within recursion, use this dict for output

  • prefix (list of str or None) – Names of levels already visited

  • depth (int) – Number of levels to descend; needed to truncate circular references and for cleaner output

Returns:

  • Dict where the keys are the entry names in dotted syntax, and the

  • values are entry instances.

yaml()#

Return YAML representation of this data-source

The output may be roughly appropriate for inclusion in a YAML catalog. This is a best-effort implementation