input.manager.InputManager#

class InputManager(machine: str, datasets_info: dict = None, datasets_subset: str | list = None, description=None, input_convertors: dict = {'ALARO_K': <valenspy.input.converter.InputConverter object>, 'CCLM': <valenspy.input.converter.InputConverter object>, 'CLIMATE_GRID': <valenspy.input.converter.InputConverter object>, 'EOBS': <valenspy.input.converter.InputConverter object>, 'ERA5': <valenspy.input.converter.InputConverter object>, 'ERA5-Land': <valenspy.input.converter.InputConverter object>, 'MAR': <valenspy.input.converter.InputConverter object>, 'RADCLIM': <valenspy.input.converter.InputConverter object>}, esmcat_data: dict = {'aggregation_control': {'aggregations': [{'attribute_name': 'time_period_start', 'options': {'dim': 'time'}, 'type': 'join_existing'}, {'attribute_name': 'variable_id', 'type': 'union'}], 'groupby_attrs': ['source_id', 'source_type', 'domain_id', 'experiment_id', 'version', 'resolution', 'frequency', 'driving_source_id', 'institution_id', 'realization', 'post_processing'], 'variable_column_name': 'variable_id'}, 'assets': {'column_name': 'path', 'format': 'netcdf'}, 'attributes': [], 'esmcat_version': '0.1.0', 'id': 'test'}, xarray_open_kwargs: dict = {}, xarray_combine_by_coords_kwargs: dict = {}, intake_esm_kwargs: dict = {'sep': '/'})[source]#

Bases: object

A class to find, manage, preprocess and load input data for ValEnsPy.

The InputManager class consists of an ValEnsPy specific intake-esm catalog (ValenspyEsmDatastore) and a CatalogBuilder. The Catalog Builder is used to create the catalog, a df with dataset information per file, using minimal information about the datasets and their path structure. This catalog is then used to create an esm_datastore (ValenspyEsmDatastore) which can be used to search and load the datasets. The InputManager class provides a preprocessing function based on the input convertors to convert the datasets to ValEnsPy

__init__(machine: str, datasets_info: dict = None, datasets_subset: str | list = None, description=None, input_convertors: dict = {'ALARO_K': <valenspy.input.converter.InputConverter object>, 'CCLM': <valenspy.input.converter.InputConverter object>, 'CLIMATE_GRID': <valenspy.input.converter.InputConverter object>, 'EOBS': <valenspy.input.converter.InputConverter object>, 'ERA5': <valenspy.input.converter.InputConverter object>, 'ERA5-Land': <valenspy.input.converter.InputConverter object>, 'MAR': <valenspy.input.converter.InputConverter object>, 'RADCLIM': <valenspy.input.converter.InputConverter object>}, esmcat_data: dict = {'aggregation_control': {'aggregations': [{'attribute_name': 'time_period_start', 'options': {'dim': 'time'}, 'type': 'join_existing'}, {'attribute_name': 'variable_id', 'type': 'union'}], 'groupby_attrs': ['source_id', 'source_type', 'domain_id', 'experiment_id', 'version', 'resolution', 'frequency', 'driving_source_id', 'institution_id', 'realization', 'post_processing'], 'variable_column_name': 'variable_id'}, 'assets': {'column_name': 'path', 'format': 'netcdf'}, 'attributes': [], 'esmcat_version': '0.1.0', 'id': 'test'}, xarray_open_kwargs: dict = {}, xarray_combine_by_coords_kwargs: dict = {}, intake_esm_kwargs: dict = {'sep': '/'})[source]#

Initialize an InputManager.

Parameters:

machine (str) – The name of the catalog. If dataset_info is not passed it will be used to load the dataset_info from the built-in dataset_paths.yaml file.
datasets_info (dict | str, optional) –
A dictionary containing datasets and their dataset information needed to build the catalog. This can be a dictionary or a path to a YAML file. Default is None. If None, the built-in dataset info for the provided catalog_id is used if it exists. The dictionary should contain dataset names as keys and their dataset information as values. The datasetinfo should contain the following keys: - root: The root directory of the dataset. - pattern: The regex pattern for matching files in the dataset. This is the reletave path starting from the root and in the following format:

<indentifier_name>/<indentifier_name>/<indentifier_name>_fixed_part_<variable_id>/<another_identifier>_<year>.nc
- meta_data: A dictionary containing metadata for the dataset.
datasets_subset (str | list, optional) – The name of the dataset(s) to load. If None, all datasets are loaded.
description (str, optional) – A description of the catalog. Default is None. This is used to create the description in the intake-esm catalog.
input_convertors (dict, optional) – A dictionary containing input convertors for the datasets. The keys are dataset names and the values are functions that convert the dataset to ValEnsPy compliant format. Default is INPUT_CONVERTORS, the set of predefined input convertors.
esmcat_data (dict, optional) – A dictionary containing the description of esmcat data to create the intake-esm catalog. Default is esmcat_default_data. See intake_esm.esm_datastore.from_dict for more information.
xarray_open_kwargs (dict, optional) – A dictionary containing default arguments to pass to xarray_open_kwargs in intake_esm.esm_datastore.to_dataset_dict, intake_esm.esm_datastore.to_dask and/or intake_esm.esm_datastore.to_datatree.
xarray_combine_by_coords_kwargs (dict, optional) – A dictionary containing default arguments to pass to xarray_combine_by_coords_kwargs in intake_esm.esm_datastore.to_dataset_dict, intake_esm.esm_datastore.to_dask and/or intake_esm.esm_datastore.to_datatree.
intake_esm_kwargs (dict, optional) – A dictionary containing additional arguments for the creation of the intake_esm catalog. Default is an empty dictionary. See intake_esm.esm_datastore for more information.

Methods

`__init__`(machine[, datasets_info, ...])	Initialize an InputManager.
`add_input_convertor`(dataset_name, ...)	Add an input convertor to the InputManager.
`update_catalog_from_dataset_info`(...[, metadata])	Update the catalog with a new dataset.
`update_catalog_from_yaml`(yaml_path)	Update the catalog from a YAML file.

Attributes

`intake_to_xarray_kwargs`	Easy access of kwargs to be used passed that can be passed to `intake_esm.esm_datastore.to_dataset_dict`, `intake_esm.esm_datastore.to_dask` and/or `intake_esm.esm_datastore.to_datatree`
`preprocess`	A preprocessor function to convert the input dataset to ValEnsPy compliant data.
`skipped_files`	The files that where skipped during the catalog creation.

_update_catalog(dataset_name, dataset_info_dict)[source]#

Update the catalog (df with dataset information per file) with a new dataset.

The catalog_builder is used to parse the dataset information and update the catalog.

Parameters:: dataset_info_dict (dict) – A dictionary containing dataset information. The keys are dataset names and the values are dictionaries with the following keys: - root: The root directory of the dataset. - pattern: The regex pattern for matching files in the dataset. - meta_data: A dictionary containing metadata for the dataset.

add_input_convertor(dataset_name, input_convertor)[source]#: Add an input convertor to the InputManager.

property intake_to_xarray_kwargs#

Easy access of kwargs to be used passed that can be passed to intake_esm.esm_datastore.to_dataset_dict, intake_esm.esm_datastore.to_dask and/or intake_esm.esm_datastore.to_datatree

Three types of kwargs are created: - preprocess: The preprocessor function which applies the input convertor to the dataset if an input convertor exists (i.e. source_id is in INPUT_CONVERTORS). - xarray_open_kwargs: The kwargs to be passed to xarray.open_dataset. - xarray_combine_by_coords_kwargs: The kwargs to be passed to xarray.combine_by_coords.

property preprocess#

A preprocessor function to convert the input dataset to ValEnsPy compliant data.

This function applys the input convertor to the dataset if an input convertor exists (i.e. source_id is in this managers input convertors).

property skipped_files#: The files that where skipped during the catalog creation.

update_catalog_from_dataset_info(dataset_name, dataset_root_dir, dataset_pattern, metadata={})[source]#

Update the catalog with a new dataset.

For the dataset, parse the dataset information, validate it, add it to the catalog and update the esm_datastore.

Parameters:

dataset_name (str) – The name of the dataset.
dataset_root_dir (str) – The root directory of the dataset.
dataset_pattern (str) – The regex pattern for matching files in the dataset. This is the reletave path starting from the root and in the following format: <indentifier_name>/<indentifier_name>/<indentifier_name>_fixed_part_<variable_id>/<another_identifier>_<year>.nc
metadata (dict, optional) – Additional metadata to include in the catalog. Default is an empty dictionary.

update_catalog_from_yaml(yaml_path)[source]#

Update the catalog from a YAML file.

For each dataset, parse the dataset information, validate it, add it to the catalog and update the esm_datastore.

Parameters:

yaml_path (Path) –

The path to the YAML file containing datasets information a dictionary of dataset names and their dataset information. The datasetinfo should contain the following keys: - root: The root directory of the dataset. - pattern: The regex pattern for matching files in the dataset. This is the reletave path starting from the root and in the following format:

<indentifier_name>/<indentifier_name>/<indentifier_name>_fixed_part_<variable_id>/<another_identifier>_<year>.nc

meta_data: A dictionary containing metadata for the dataset.