input.esm_catalog_builder.CatalogBuilder#

class CatalogBuilder(catalog_id, datasets_info: dict | str = None)[source]#

Bases: object

__init__(catalog_id, datasets_info: dict | str = None)[source]#

Initialize the CatalogBuilder with a catalog ID and dataset information.

Parameters:

catalog_id (str) – The ID of the catalog. If dataset_info is not provided, this will be used to load pre-existing dataset_info from ValEnsPy if it exists.
datasets_info (dict | str, optional) –
A dictionary containing datasets and their dataset information needed to build the catalog. This can be a dictionary or a path to a YAML file. Default is None. If None, the built-in dataset info for the provided catalog_id is used if it exists. The dictionary should contain dataset names as keys and their dataset information as values. The datasetinfo should contain the following keys: - root: The root directory of the dataset. - pattern: The regex pattern for matching files in the dataset. This is the reletave path starting from the root and in the following format:

<indentifier_name>/<indentifier_name>/<indentifier_name>_fixed_part_<variable_id>/<another_identifier>_<year>.nc
- meta_data: A dictionary containing metadata for the dataset.

Methods

`__init__`(catalog_id[, datasets_info])	Initialize the CatalogBuilder with a catalog ID and dataset information.
`add_dataset`(dataset_name, dataset_info)	Update the dataset information for a specific dataset.
`create_df`()	Create a catalog by scanning dataset paths and extracting metadata.

_process_dataset_for_catalog(dataset_name, dataset_info)[source]#

Process all files in a dataset and extract metadata for each file.

Given a dataset name and its information, this function parses every file in the dataset, returning the metadata parsed from the file name and the dataset level metadata.

Parameters:

dataset_name (str) – The name of the dataset to process.
dataset_info (dict) – The dataset information containing the root directory, regex pattern, and metadata.

Returns:

A list of dictionaries containing metadata for each file in the dataset.

Return type:

list

_validate_dataset_info()[source]#: Validate the dataset information to ensure all required identifiers are present.

add_dataset(dataset_name, dataset_info)[source]#

Update the dataset information for a specific dataset.

Parameters:

dataset_name (str) – The name of the dataset to update.
dataset_info (dict) – The new dataset information to update.

create_df()[source]#: Create a catalog by scanning dataset paths and extracting metadata.