input.catalog.ValenspyEsmDatastore#
- class ValenspyEsmDatastore(*args, **kwargs)[source]#
Bases:
esm_datastore
Subclass of intake_esm.ESMDataStore for ValEnsPy.
This extends the ESMDataStore class with a adittional search functionality for time based searching using the time_period column.
Methods
__init__
(*args, **kwargs)Intake Catalog representing an ESM Collection.
close
()Close open resources corresponding to this data source.
configure_new
(**kwargs)Create a new instance of this source with altered arguments
describe
()Description from the entry spec
discover
()Open resource and populate the source attributes.
filter
(func)Create a Catalog of a subset of entries based on a condition
Imperative reload data now
from_dict
(entries, **kwargs)Create Catalog from the given set of entries
get
(**kwargs)Create a new instance of this source with altered arguments
items
()Get an iterator over (key, source) tuples for the catalog entries.
keys
()Get keys for the catalog entries
Get keys for the catalog entries and their metadata
nunique
()Count distinct observations across dataframe columns in the catalog.
pop
(key)Remove entry from catalog and return it
read
()Load entire dataset into a container and return it
Return iterator over container fragments of data source
Return a part of the data corresponding to i-th partition.
reload
()Reload catalog if sufficient time has passed
save
(url[, storage_options])Output this catalog to a file as YAML
search
([require_all_on])Search for entries in the catalog.
serialize
(name[, directory, catalog_type, ...])Serialize catalog to corresponding json and csv files.
to_dask
(**kwargs)Convert result to an xarray dataset.
to_dataset_dict
([xarray_open_kwargs, ...])Load catalog entries into a dictionary of xarray datasets.
to_datatree
([levels])Load catalog entries into a tree of xarray datasets.
to_spark
()Provide an equivalent data object in Apache Spark
unique
()Return unique values for given columns in the catalog.
values
()Get an iterator over the sources for catalog entries.
walk
([sofar, prefix, depth])Get all entries in this catalog and sub-catalogs
yaml
()Return YAML representation of this data-source
Attributes
Return pandas
DataFrame
.The base class does not interact with persistence
The base class does not interact with persistence
Return string template used to create catalog entry keys
- _close()#
Subclasses should close all open resources
- _create_derived_variables(datasets, skip_on_error)#
- _entry = None#
- _get_cache(urlpath)#
The base class does not interact with caches
- _get_entries() dict[str, ESMDataSource] #
- _get_entry(name)#
- _get_partition(i)#
Subclasses should return a container object for this partition
This function will never be called with an out-of-range value for i.
- _get_schema()#
Subclasses should return an instance of base.Schema
- _ipython_display_()#
Display the entry as a rich object in an IPython session
- _ipython_key_completions_()#
- _load()#
Override this: load catalog entries
- _load_metadata()#
load metadata only if needed
- _make_entries_container()#
Subclasses may override this to return some other dict-like.
See RemoteCatalog below for the motivating example for this hook. This is typically useful for large Catalogs backed by dynamic resources such as databases.
The object returned by this method must implement:
__iter__()
-> an iterator of entry names__getitem__(key)
-> an Entryitems()
-> an iterator of(key, Entry)
pairs
For best performance the object should also implement:
__len__()
-> int__contains__(key)
-> boolean
In
__len__
or__contains__
are not implemented, intake will fall back on iterating through the entire catalog to compute its length or check for containment, which may be expensive on large catalogs.
- _repr_html_() str #
Return an html representation for the catalog object. Mainly for IPython notebook
- _schema = None#
- _validate_derivedcat() None #
- _yaml()#
- auth = None#
- cat = None#
- property classname#
- close()#
Close open resources corresponding to this data source.
- configure_new(**kwargs)#
Create a new instance of this source with altered arguments
Enables the picking of options and re-evaluating templates from any user-parameters associated with this source, or overriding any of the init arguments.
Returns a new data source instance. The instance will be recreated from the original entry definition in a catalog if this source was originally created from a catalog.
- container = 'xarray'#
- describe()#
Description from the entry spec
- description = None#
- property df: DataFrame#
Return pandas
DataFrame
.
- discover()#
Open resource and populate the source attributes.
- dtype = None#
- property entry#
- filter(func)#
Create a Catalog of a subset of entries based on a condition
Warning
This function operates on CatalogEntry objects not DataSource objects.
Note
Note that, whatever specific class this is performed on, the return instance is a Catalog. The entries are passed unmodified, so they will still reference the original catalog instance and include its details such as directory,.
- Parameters:
func (function) – This should take a CatalogEntry and return True or False. Those items returning True will be included in the new Catalog, with the same entry names
- Returns:
New catalog with Entries that still refer to their parents
- Return type:
Catalog
- force_reload()#
Imperative reload data now
- classmethod from_dict(entries, **kwargs)#
Create Catalog from the given set of entries
- Parameters:
entries (dict-like) – A mapping of name:entry which supports dict-like functionality, e.g., is derived from
collections.abc.Mapping
.kwargs (passed on the constructor) – Things like metadata, name; see
__init__
.
- Return type:
Catalog instance
- get(**kwargs)#
Create a new instance of this source with altered arguments
Enables the picking of options and re-evaluating templates from any user-parameters associated with this source, or overriding any of the init arguments.
Returns a new data source instance. The instance will be recreated from the original entry definition in a catalog if this source was originally created from a catalog.
- property gui#
- property has_been_persisted#
The base class does not interact with persistence
- property is_persisted#
The base class does not interact with persistence
- items()#
Get an iterator over (key, source) tuples for the catalog entries.
- property key_template: str#
Return string template used to create catalog entry keys
- Returns:
string template used to create catalog entry keys
- Return type:
str
- keys() list[str] #
Get keys for the catalog entries
- Returns:
keys for the catalog entries
- Return type:
list
- keys_info() DataFrame #
Get keys for the catalog entries and their metadata
- Returns:
keys for the catalog entries and their metadata
- Return type:
pandas.DataFrame
Examples
>>> import intake >>> cat = intake.open_esm_datastore('./tests/sample-catalogs/cesm1-lens-netcdf.json') >>> cat.keys_info() component experiment stream key ocn.20C.pop.h ocn 20C pop.h ocn.CTRL.pop.h ocn CTRL pop.h ocn.RCP85.pop.h ocn RCP85 pop.h
- property kwargs#
- metadata = {}#
- name = 'esm_datastore'#
- npartitions = 0#
- nunique() Series #
Count distinct observations across dataframe columns in the catalog.
Examples
>>> import intake >>> cat = intake.open_esm_datastore('pangeo-cmip6.json') >>> cat.nunique() activity_id 10 institution_id 23 source_id 48 experiment_id 29 member_id 86 table_id 19 variable_id 187 grid_label 7 zstore 27437 dcpp_init_year 59 dtype: int64
- on_server = False#
- partition_access = False#
- pop(key)#
Remove entry from catalog and return it
This relies on the _entries attribute being mutable, which it normally is. Note that if a catalog automatically reloads, any entry removed here may soon reappear
- Parameters:
key (str) – Key to give the entry in the cat
- read()#
Load entire dataset into a container and return it
- read_chunked()#
Return iterator over container fragments of data source
- read_partition(i)#
Return a part of the data corresponding to i-th partition.
By default, assumes i should be an integer between zero and npartitions; override for more complex indexing schemes.
- reload()#
Reload catalog if sufficient time has passed
- save(url, storage_options=None)#
Output this catalog to a file as YAML
- Parameters:
url (str) – Location to save to, perhaps remote
storage_options (dict) – Extra arguments for the file-system
- search(require_all_on: str | list[str] | None = None, **query)[source]#
Search for entries in the catalog.
Standard search function of the intake_esm.esm_datastore class extended with time based searching based on the time_period column.
- Parameters:
require_all_on (str, optional) – A dataframe column or a list of dataframe columns across which all entries must satisfy the query criteria. If None, return entries that fulfill any of the criteria specified in the query, by default None.
**query – keyword arguments corresponding to user’s query to execute against the dataframe.
See also
intake_esm.esm_datastore.search
- serialize(name: Annotated[str, Strict(strict=True)], directory: Annotated[Path, PathType(path_type=dir)] | Annotated[str, Strict(strict=True)] | None = None, catalog_type: str = 'dict', to_csv_kwargs: dict[Any, Any] | None = None, json_dump_kwargs: dict[Any, Any] | None = None, storage_options: dict[str, Any] | None = None) None #
Serialize catalog to corresponding json and csv files.
- Parameters:
name (str) – name to use when creating ESM catalog json file and csv catalog.
directory (str, PathLike, default None) – The path to the local directory. If None, use the current directory
catalog_type (str, default 'dict') – Whether to save the catalog table as a dictionary in the JSON file or as a separate CSV file.
to_csv_kwargs (dict, optional) – Additional keyword arguments passed through to the
to_csv
method.json_dump_kwargs (dict, optional) – Additional keyword arguments passed through to the
dump
function.storage_options (dict) – fsspec parameters passed to the backend file-system such as Google Cloud Storage, Amazon Web Service S3.
Notes
Large catalogs can result in large JSON files. To keep the JSON file size manageable, call with catalog_type=’file’ to save catalog as a separate CSV file.
Examples
>>> import intake >>> cat = intake.open_esm_datastore('pangeo-cmip6.json') >>> cat_subset = cat.search( ... source_id='BCC-ESM1', ... grid_label='gn', ... table_id='Amon', ... experiment_id='historical', ... ) >>> cat_subset.serialize(name='cmip6_bcc_esm1', catalog_type='file')
- shape = None#
- to_dask(**kwargs) Dataset #
Convert result to an xarray dataset.
This is only possible if the search returned exactly one result.
- Parameters:
kwargs (dict) – Parameters forwarded to
to_dataset_dict
.- Return type:
- to_dataset_dict(xarray_open_kwargs: dict[str, Any] | None = None, xarray_combine_by_coords_kwargs: dict[str, Any] | None = None, preprocess: Callable | None = None, storage_options: dict[Annotated[str, Strict(strict=True)], Any] | None = None, progressbar: Annotated[bool, Strict(strict=True)] | None = None, aggregate: Annotated[bool, Strict(strict=True)] | None = None, skip_on_error: Annotated[bool, Strict(strict=True)] = False, **kwargs) dict[str, Dataset] #
Load catalog entries into a dictionary of xarray datasets.
Column values, dataset keys and requested variables are added as global attributes on the returned datasets. The names of these attributes can be customized with
intake_esm.utils.set_options
.- Parameters:
xarray_open_kwargs (dict) – Keyword arguments to pass to
open_dataset
functionxarray_combine_by_coords_kwargs (: dict) – Keyword arguments to pass to
combine_by_coords
function.preprocess (callable, optional) – If provided, call this function on each dataset prior to aggregation.
storage_options (dict, optional) – fsspec Parameters passed to the backend file-system such as Google Cloud Storage, Amazon Web Service S3.
progressbar (bool) – If True, will print a progress bar to standard error (stderr) when loading assets into
Dataset
.aggregate (bool, optional) – If False, no aggregation will be done.
skip_on_error (bool, optional) – If True, skip datasets that cannot be loaded and/or variables we are unable to derive.
- Returns:
dsets – A dictionary of xarray
Dataset
.- Return type:
dict
Examples
>>> import intake >>> cat = intake.open_esm_datastore('glade-cmip6.json') >>> sub_cat = cat.search( ... source_id=['BCC-CSM2-MR', 'CNRM-CM6-1', 'CNRM-ESM2-1'], ... experiment_id=['historical', 'ssp585'], ... variable_id='pr', ... table_id='Amon', ... grid_label='gn', ... ) >>> dsets = sub_cat.to_dataset_dict() >>> dsets.keys() dict_keys(['CMIP.BCC.BCC-CSM2-MR.historical.Amon.gn', 'ScenarioMIP.BCC.BCC-CSM2-MR.ssp585.Amon.gn']) >>> dsets['CMIP.BCC.BCC-CSM2-MR.historical.Amon.gn'] <xarray.Dataset> Dimensions: (bnds: 2, lat: 160, lon: 320, member_id: 3, time: 1980) Coordinates: * lon (lon) float64 0.0 1.125 2.25 3.375 ... 355.5 356.6 357.8 358.9 * lat (lat) float64 -89.14 -88.03 -86.91 -85.79 ... 86.91 88.03 89.14 * time (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00 * member_id (member_id) <U8 'r1i1p1f1' 'r2i1p1f1' 'r3i1p1f1' Dimensions without coordinates: bnds Data variables: lat_bnds (lat, bnds) float64 dask.array<chunksize=(160, 2), meta=np.ndarray> lon_bnds (lon, bnds) float64 dask.array<chunksize=(320, 2), meta=np.ndarray> time_bnds (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray> pr (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 600, 160, 320), meta=np.ndarray>
- to_datatree(levels: list[str] = None, **kwargs)[source]#
Load catalog entries into a tree of xarray datasets.
- Parameters:
xarray_open_kwargs (dict) – Keyword arguments to pass to
open_dataset
functionxarray_combine_by_coords_kwargs (: dict) – Keyword arguments to pass to
combine_by_coords
function.preprocess (callable, optional) – If provided, call this function on each dataset prior to aggregation.
storage_options (dict, optional) – Parameters passed to the backend file-system such as Google Cloud Storage, Amazon Web Service S3.
progressbar (bool) – If True, will print a progress bar to standard error (stderr) when loading assets into
Dataset
.aggregate (bool, optional) – If False, no aggregation will be done.
skip_on_error (bool, optional) – If True, skip datasets that cannot be loaded and/or variables we are unable to derive.
levels (list[str], optional) – List of fields to use as the datatree nodes. WARNING: This will overwrite the fields used to create the unique aggregation keys.
- Returns:
dsets – A tree of xarray
Dataset
.- Return type:
DataTree
Examples
>>> import intake >>> cat = intake.open_esm_datastore('glade-cmip6.json') >>> sub_cat = cat.search( ... source_id=['BCC-CSM2-MR', 'CNRM-CM6-1', 'CNRM-ESM2-1'], ... experiment_id=['historical', 'ssp585'], ... variable_id='pr', ... table_id='Amon', ... grid_label='gn', ... ) >>> dsets = sub_cat.to_datatree() >>> dsets['CMIP/BCC.BCC-CSM2-MR/historical/Amon/gn'].ds <xarray.Dataset> Dimensions: (bnds: 2, lat: 160, lon: 320, member_id: 3, time: 1980) Coordinates: * lon (lon) float64 0.0 1.125 2.25 3.375 ... 355.5 356.6 357.8 358.9 * lat (lat) float64 -89.14 -88.03 -86.91 -85.79 ... 86.91 88.03 89.14 * time (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00 * member_id (member_id) <U8 'r1i1p1f1' 'r2i1p1f1' 'r3i1p1f1' Dimensions without coordinates: bnds Data variables: lat_bnds (lat, bnds) float64 dask.array<chunksize=(160, 2), meta=np.ndarray> lon_bnds (lon, bnds) float64 dask.array<chunksize=(320, 2), meta=np.ndarray> time_bnds (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray> pr (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 600, 160, 320), meta=np.ndarray>
- to_spark()#
Provide an equivalent data object in Apache Spark
The mapping of python-oriented data containers to Spark ones will be imperfect, and only a small number of drivers are expected to be able to produce Spark objects. The standard arguments may b translated, unsupported or ignored, depending on the specific driver.
This method requires the package intake-spark
- unique() Series #
Return unique values for given columns in the catalog.
- values()#
Get an iterator over the sources for catalog entries.
- property version#
- walk(sofar=None, prefix=None, depth=2)#
Get all entries in this catalog and sub-catalogs
- Parameters:
sofar (dict or None) – Within recursion, use this dict for output
prefix (list of str or None) – Names of levels already visited
depth (int) – Number of levels to descend; needed to truncate circular references and for cleaner output
- Returns:
Dict where the keys are the entry names in dotted syntax, and the
values are entry instances.
- yaml()#
Return YAML representation of this data-source
The output may be roughly appropriate for inclusion in a YAML catalog. This is a best-effort implementation