input.esm_catalog_builder.CatalogBuilder#

class CatalogBuilder(catalog_id, datasets_info: dict | str = None)[source]#

Bases: object

__init__(catalog_id, datasets_info: dict | str = None)[source]#

Initialize the CatalogBuilder with a catalog ID and dataset information.

Parameters:
  • catalog_id (str) – The ID of the catalog. If dataset_info is not provided, this will be used to load pre-existing dataset_info from ValEnsPy if it exists.

  • datasets_info (dict | str, optional) –

    A dictionary containing datasets and their dataset information needed to build the catalog. This can be a dictionary or a path to a YAML file. Default is None. If None, the built-in dataset info for the provided catalog_id is used if it exists. The dictionary should contain dataset names as keys and their dataset information as values. The datasetinfo should contain the following keys: - root: The root directory of the dataset. - pattern: The regex pattern for matching files in the dataset. This is the reletave path starting from the root and in the following format:

    <indentifier_name>/<indentifier_name>/<indentifier_name>_fixed_part_<variable_id>/<another_identifier>_<year>.nc

    • meta_data: A dictionary containing metadata for the dataset.

Methods

__init__(catalog_id[, datasets_info])

Initialize the CatalogBuilder with a catalog ID and dataset information.

add_dataset(dataset_name, dataset_info)

Update the dataset information for a specific dataset.

create_df()

Create a catalog by scanning dataset paths and extracting metadata.

_process_dataset_for_catalog(dataset_name, dataset_info)[source]#

Process all files in a dataset and extract metadata for each file.

Given a dataset name and its information, this function parses every file in the dataset, returning the metadata parsed from the file name and the dataset level metadata.

Parameters:
  • dataset_name (str) – The name of the dataset to process.

  • dataset_info (dict) – The dataset information containing the root directory, regex pattern, and metadata.

Returns:

A list of dictionaries containing metadata for each file in the dataset.

Return type:

list

_validate_dataset_info()[source]#

Validate the dataset information to ensure all required identifiers are present.

add_dataset(dataset_name, dataset_info)[source]#

Update the dataset information for a specific dataset.

Parameters:
  • dataset_name (str) – The name of the dataset to update.

  • dataset_info (dict) – The new dataset information to update.

create_df()[source]#

Create a catalog by scanning dataset paths and extracting metadata.