input.esm_catalog_builder.CatalogBuilder#
- class CatalogBuilder(catalog_id, datasets_info: dict | str = None)[source]#
Bases:
object
- __init__(catalog_id, datasets_info: dict | str = None)[source]#
Initialize the CatalogBuilder with a catalog ID and dataset information.
- Parameters:
catalog_id (str) – The ID of the catalog. If dataset_info is not provided, this will be used to load pre-existing dataset_info from ValEnsPy if it exists.
datasets_info (dict | str, optional) –
A dictionary containing datasets and their dataset information needed to build the catalog. This can be a dictionary or a path to a YAML file. Default is None. If None, the built-in dataset info for the provided catalog_id is used if it exists. The dictionary should contain dataset names as keys and their dataset information as values. The datasetinfo should contain the following keys: - root: The root directory of the dataset. - pattern: The regex pattern for matching files in the dataset. This is the reletave path starting from the root and in the following format:
<indentifier_name>/<indentifier_name>/<indentifier_name>_fixed_part_<variable_id>/<another_identifier>_<year>.nc
meta_data: A dictionary containing metadata for the dataset.
Methods
__init__
(catalog_id[, datasets_info])Initialize the CatalogBuilder with a catalog ID and dataset information.
add_dataset
(dataset_name, dataset_info)Update the dataset information for a specific dataset.
Create a catalog by scanning dataset paths and extracting metadata.
- _process_dataset_for_catalog(dataset_name, dataset_info)[source]#
Process all files in a dataset and extract metadata for each file.
Given a dataset name and its information, this function parses every file in the dataset, returning the metadata parsed from the file name and the dataset level metadata.
- Parameters:
dataset_name (str) – The name of the dataset to process.
dataset_info (dict) – The dataset information containing the root directory, regex pattern, and metadata.
- Returns:
A list of dictionaries containing metadata for each file in the dataset.
- Return type:
list
- _validate_dataset_info()[source]#
Validate the dataset information to ensure all required identifiers are present.