Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow whole config loading to scripts instead of partial loading. #81

Open
Tracked by #41
axiomcura opened this issue Aug 31, 2023 · 1 comment
Open
Tracked by #41

Comments

@axiomcura
Copy link
Member

This issue shows that problems of partial loading configuration into scripts. Partial loading requires more variables to be created to separately load configuration into these script

# aggregate module
rule aggregate:
    input:
        sql_files=get_data_path(
            input_type=config["aggregate_configs"]["params"]["input_data"],
            use_converted=DATA_CONFIGS["use_converted_plate_data"],
        ),
        barcodes=BARCODES,
        metadata=METADATA_DIR,
    output:
        aggregate_profile=get_data_path(input_type="aggregated"),
        cell_counts=get_data_path(input_type="cell_counts"),
    log:
        "logs/aggregate_{basename}.log",
    conda:
        "../envs/cytominer_env.yaml"
    params:
        single_cell_config=["single_cell_configs"],
        aggregate_config=config["aggregate_configs"],
    script:
        "../scripts/aggregate_cells.py"

To access the configs with partial loading, one must separately declare variables

# aggregate script 
 aggregate_output = str(snakemake.output["aggregate_profile"])
 single_cells_config=snakemake.params["single_cell_config"]

This issue becomes if more complex scripts requires more configurations to be loaded. It will not only increase the number of variables within the module but it will also increase the number of variables in the scripts therefore making it difficult to understand how the configs are being used.

This issue will be part of #41

@axiomcura
Copy link
Member Author

axiomcura commented Sep 1, 2023

Update

It seems that some module config paths are used as inputs requiring scripts to read and parse configuration files. Snakemake already does this inherently, therefore it is redundant to parse yaml files inside scripts: Here's and example:

annotate.smk

rule annotate:
    """
    Generates an annotated profile with given metadata and is stored instored
    in the `results/` directory.

    Utilizes pycytominer's annotate module:
    https://github.com/cytomining/pycytominer/blob/master/pycytominer/annotate.py

    :input profiles: single-cell or aggregate profiles.
    :input barcode: file containing unique barcodes that maps to a specific plate.
    :input metadata: metadata file associated with single-cell morphology dataset.

    :config: workflow config pointing to annotate configs.

    :output annotated: annotated profile.
    """
    input:
        profile=get_data_path(
            input_type=config["annotate_configs"]["params"]["input_data"],
            use_converted=DATA_CONFIGS["use_converted_plate_data"],
        ),
        barcodes=BARCODES,
        metadata=METADATA_DIR,
    output:
        get_data_path(input_type="annotated"),
    conda:
        "../envs/cytominer_env.yaml"
    log:
        "logs/annotate_{basename}.log",
    params:
        annotate_config=config["config_paths"]["annotate"],
    script:
        "../scripts/__annotate.py

annotate.py

# loading in annotate configs
    logging.info(f"Loading Annotation configuration from: {config}")

    annotate_path_obj = pathlib.Path(config)
    if not annotate_path_obj.is_file():
        e_msg = "Unable to find Annotation configuration file"
        logging.error(e_msg)
        raise FileNotFoundError(e_msg)

    annotate_config_path = annotate_path_obj.absolute()
    with open(annotate_config_path, "r") as yaml_contents:
        annotate_configs = yaml.safe_load(yaml_contents)["annotate_configs"]["params"]
        logging.info("Annotation configuration loaded")

This code can easily be removed by placing the config path at the workflow level by using the configfile parameter:

# cp_process.smk workflow
configfile: path/to/general_config.yaml
configfile: path/to/workflow_config.yaml

This will remove the redundant code that exists within all scripts, making it much easier to read

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant