input.manager.InputManager#
- class InputManager(machine: str, datasets_info: dict = None, description=None, input_convertors: dict = {'ALARO_K': <valenspy.input.converter.InputConverter object>, 'CCLM': <valenspy.input.converter.InputConverter object>, 'CLIMATE_GRID': <valenspy.input.converter.InputConverter object>, 'EOBS': <valenspy.input.converter.InputConverter object>, 'ERA5': <valenspy.input.converter.InputConverter object>, 'ERA5-Land': <valenspy.input.converter.InputConverter object>, 'MAR': <valenspy.input.converter.InputConverter object>, 'RADCLIM': <valenspy.input.converter.InputConverter object>}, esmcat_data: dict = {'aggregation_control': {'aggregations': [{'attribute_name': 'time_period_start', 'options': {'dim': 'time'}, 'type': 'join_existing'}, {'attribute_name': 'variable_id', 'type': 'union'}], 'groupby_attrs': ['source_id', 'source_type', 'domain_id', 'experiment_id', 'version', 'resolution', 'frequency', 'driving_source_id', 'institution_id', 'realization', 'post_processing'], 'variable_column_name': 'variable_id'}, 'assets': {'column_name': 'path', 'format': 'netcdf'}, 'attributes': [], 'esmcat_version': '0.1.0', 'id': 'test'}, xarray_open_kwargs: dict = {}, xarray_combine_by_coords_kwargs: dict = {}, intake_esm_kwargs: dict = {'sep': '/'})[source]#
Bases:
object
A class to find, manage, preprocess and load input data for ValEnsPy.
The InputManager class consists of an ValEnsPy specific intake-esm catalog (ValenspyEsmDatastore) and a CatalogBuilder. The Catalog Builder is used to create the catalog, a df with dataset information per file, using minimal information about the datasets and their path structure. This catalog is then used to create an esm_datastore (ValenspyEsmDatastore) which can be used to search and load the datasets. The InputManager class provides a preprocessing function based on the input convertors to convert the datasets to ValEnsPy
- __init__(machine: str, datasets_info: dict = None, description=None, input_convertors: dict = {'ALARO_K': <valenspy.input.converter.InputConverter object>, 'CCLM': <valenspy.input.converter.InputConverter object>, 'CLIMATE_GRID': <valenspy.input.converter.InputConverter object>, 'EOBS': <valenspy.input.converter.InputConverter object>, 'ERA5': <valenspy.input.converter.InputConverter object>, 'ERA5-Land': <valenspy.input.converter.InputConverter object>, 'MAR': <valenspy.input.converter.InputConverter object>, 'RADCLIM': <valenspy.input.converter.InputConverter object>}, esmcat_data: dict = {'aggregation_control': {'aggregations': [{'attribute_name': 'time_period_start', 'options': {'dim': 'time'}, 'type': 'join_existing'}, {'attribute_name': 'variable_id', 'type': 'union'}], 'groupby_attrs': ['source_id', 'source_type', 'domain_id', 'experiment_id', 'version', 'resolution', 'frequency', 'driving_source_id', 'institution_id', 'realization', 'post_processing'], 'variable_column_name': 'variable_id'}, 'assets': {'column_name': 'path', 'format': 'netcdf'}, 'attributes': [], 'esmcat_version': '0.1.0', 'id': 'test'}, xarray_open_kwargs: dict = {}, xarray_combine_by_coords_kwargs: dict = {}, intake_esm_kwargs: dict = {'sep': '/'})[source]#
Initialize an InputManager.
- Parameters:
machine (str) – The name of the catalog. If dataset_info is not passed it will be used to load the dataset_info from the built-in dataset_paths.yaml file.
datasets_info (dict | str, optional) –
A dictionary containing datasets and their dataset information needed to build the catalog. This can be a dictionary or a path to a YAML file. Default is None. If None, the built-in dataset info for the provided catalog_id is used if it exists. The dictionary should contain dataset names as keys and their dataset information as values. The datasetinfo should contain the following keys: - root: The root directory of the dataset. - pattern: The regex pattern for matching files in the dataset. This is the reletave path starting from the root and in the following format:
<indentifier_name>/<indentifier_name>/<indentifier_name>_fixed_part_<variable_id>/<another_identifier>_<year>.nc
meta_data: A dictionary containing metadata for the dataset.
description (str, optional) – A description of the catalog. Default is None. This is used to create the description in the intake-esm catalog.
input_convertors (dict, optional) – A dictionary containing input convertors for the datasets. The keys are dataset names and the values are functions that convert the dataset to ValEnsPy compliant format. Default is INPUT_CONVERTORS, the set of predefined input convertors.
esmcat_data (dict, optional) – A dictionary containing the description of esmcat data to create the intake-esm catalog. Default is esmcat_default_data. See
intake_esm.esm_datastore.from_dict
for more information.xarray_open_kwargs (dict, optional) – A dictionary containing default arguments to pass to xarray_open_kwargs in
intake_esm.esm_datastore.to_dataset_dict
,intake_esm.esm_datastore.to_dask
and/orintake_esm.esm_datastore.to_datatree
.xarray_combine_by_coords_kwargs (dict, optional) – A dictionary containing default arguments to pass to xarray_combine_by_coords_kwargs in
intake_esm.esm_datastore.to_dataset_dict
,intake_esm.esm_datastore.to_dask
and/orintake_esm.esm_datastore.to_datatree
.intake_esm_kwargs (dict, optional) – A dictionary containing additional arguments for the creation of the intake_esm catalog. Default is an empty dictionary. See
intake_esm.esm_datastore
for more information.
Methods
__init__
(machine[, datasets_info, ...])Initialize an InputManager.
add_input_convertor
(dataset_name, ...)Add an input convertor to the InputManager.
update_catalog_from_dataset_info
(...[, metadata])Update the catalog with a new dataset.
update_catalog_from_yaml
(yaml_path)Update the catalog from a YAML file.
Attributes
Easy access of kwargs to be used passed that can be passed to
intake_esm.esm_datastore.to_dataset_dict
,intake_esm.esm_datastore.to_dask
and/orintake_esm.esm_datastore.to_datatree
A preprocessor function to convert the input dataset to ValEnsPy compliant data.
The files that where skipped during the catalog creation.
- _update_catalog(dataset_name, dataset_info_dict)[source]#
Update the catalog (df with dataset information per file) with a new dataset.
The catalog_builder is used to parse the dataset information and update the catalog.
- Parameters:
dataset_info_dict (dict) – A dictionary containing dataset information. The keys are dataset names and the values are dictionaries with the following keys: - root: The root directory of the dataset. - pattern: The regex pattern for matching files in the dataset. - meta_data: A dictionary containing metadata for the dataset.
- add_input_convertor(dataset_name, input_convertor)[source]#
Add an input convertor to the InputManager.
- property intake_to_xarray_kwargs#
Easy access of kwargs to be used passed that can be passed to
intake_esm.esm_datastore.to_dataset_dict
,intake_esm.esm_datastore.to_dask
and/orintake_esm.esm_datastore.to_datatree
Three types of kwargs are created: - preprocess: The preprocessor function which applies the input convertor to the dataset if an input convertor exists (i.e. source_id is in INPUT_CONVERTORS). - xarray_open_kwargs: The kwargs to be passed to xarray.open_dataset. - xarray_combine_by_coords_kwargs: The kwargs to be passed to xarray.combine_by_coords.
- property preprocess#
A preprocessor function to convert the input dataset to ValEnsPy compliant data.
This function applys the input convertor to the dataset if an input convertor exists (i.e. source_id is in this managers input convertors).
- property skipped_files#
The files that where skipped during the catalog creation.
- update_catalog_from_dataset_info(dataset_name, dataset_root_dir, dataset_pattern, metadata={})[source]#
Update the catalog with a new dataset.
For the dataset, parse the dataset information, validate it, add it to the catalog and update the esm_datastore.
- Parameters:
dataset_name (str) – The name of the dataset.
dataset_root_dir (str) – The root directory of the dataset.
dataset_pattern (str) – The regex pattern for matching files in the dataset. This is the reletave path starting from the root and in the following format: <indentifier_name>/<indentifier_name>/<indentifier_name>_fixed_part_<variable_id>/<another_identifier>_<year>.nc
metadata (dict, optional) – Additional metadata to include in the catalog. Default is an empty dictionary.
- update_catalog_from_yaml(yaml_path)[source]#
Update the catalog from a YAML file.
For each dataset, parse the dataset information, validate it, add it to the catalog and update the esm_datastore.
- Parameters:
yaml_path (Path) –
The path to the YAML file containing datasets information a dictionary of dataset names and their dataset information. The datasetinfo should contain the following keys: - root: The root directory of the dataset. - pattern: The regex pattern for matching files in the dataset. This is the reletave path starting from the root and in the following format:
<indentifier_name>/<indentifier_name>/<indentifier_name>_fixed_part_<variable_id>/<another_identifier>_<year>.nc
meta_data: A dictionary containing metadata for the dataset.