add filter and rerun feature to pipeline #67

xiangchenjhu · 2024-07-05T14:28:44Z

Goal:
Add a re-run mechanism based on updated filter thresholds.
Provide an interface for switching between DFT calculators (ORCA or ASE).

Implementation:

Store filter thresholds in threshold.json.
Add an "ste" column for xTB energy difference calculations, initializing all the required columns at the start.
Introduce the DFTCalculator class for using different DFT calculators.
Allow the pipeline to run in two modes:

Initial run: Generates ligands from CSV files.
Re-run: Generates ligands from Parquet files.

(Determine which mode to apply by checking if input_dir is specified in the command line)

Unify the logging format.

Experiment:
Test both modes with different thresholds on a MacBook and a Kubernetes cluster using the /test dataset.

amitschang

I have some comments inline, please take a look, thanks!

bluephos/bluephos_pipeline.py

amitschang · 2024-07-08T15:18:06Z

bluephos/bluephos_pipeline.py

+            for _, row in df.iterrows():
+                if row["z"] is not None and abs(row["z"]) < t_nn:
+                    if row["ste"] is None or abs(row["ste"]) < t_ste:
+                        if row["energy diff"] is None:


Instead of iterating on rows here, could we first preform the filter, then yield per-row df after that, might be clearer, e.g.

filter = (df["z"] < t_nn) & (df["ste"] < t_ste) & (df["energy diff"].isna()) for _, row in df[filter].iterrows(): yield row.to_frame().transpose()

I don't think you really even need the blank frame to insert into

Thanks ! updated the code to perform the filter first and then yield each row as a single-row DataFrame, enhancing clarity and avoiding the need for a blank frame.

amitschang · 2024-07-08T15:21:41Z

bluephos/bluephos_pipeline.py

+        "out_dir": out_dir,
+        "input_dir": input_dir,


I don't think either out_dir or input_dir need to be in the context?

Yes, they are not necessary be here since they are not used in the tasks

amitschang · 2024-07-08T15:22:17Z

bluephos/bluephos_pipeline.py

    ap.add_argument("--features", required=True, help="Element feature file")
    ap.add_argument("--train", required=True, help="Train stats file")
    ap.add_argument("--weights", required=True, help="Full energy model weights")
+    ap.add_argument("--out_dir", required=False, help="Directory for output parquet files")


out_dir should already be an argument from the cli.get_argparser so should stick with the one that comes from there.

did you intend to keep this?

No, it's a miss item

amitschang · 2024-07-08T15:22:58Z

bluephos/bluephos_pipeline.py

+    ap.add_argument("--input_dir", required=False, help="Directory containing input parquet files")
+    ap.add_argument("--threshold_file", required=False, help="JSON file containing t_nn, t_ste, and t_ed threshold")
+    ap.add_argument(
+        "--package", required=False, default="orca", choices=["orca", "ase"], help="DFT package to use (default: orca)"


the name package seems to general. Maybe dft-package or dft-method or something

amitschang · 2024-07-08T15:39:28Z

bluephos/tasks/generateligandtable.py

+                    "structure": None,
+                    "z": None,
+                    "xyz": None,
+                    "ste": None,
+                    "energy diff": None,


do these actually need to be here from the start?

Yes, I think all those columns should be initialized here since the following tasks will not create new columns, and we are starting with an empty DataFrame.

We can also utilize the input DataFrame (which was already initialized) to avoid creating an empty one, ensuring that all necessary columns are initialized. However, this approach would require 'deleting' some invalid entries, which may introduce some overhead (In this task, I don't allow ligand pairs to exist in the DataFrame).

amitschang · 2024-07-08T15:47:37Z

bluephos/bluephos_pipeline.py

+        if not input_dir
+        else [
+            OptimizeGeometriesTask,
+            DFTTask,
+        ]


interesting way of doing it. I wonder if it would be generally advisable to include the NNTask as well, as it may be likely that when changing thresholds and desiring a rerun that the NN parameters might also be subject to change.

Thanks, this is an important scenario.
If the NN model or parameters change, we need to rerun it.
Currently, the simplest way is to make the NNTask mandatory, always rerunning it since its runtime is negligible compared to xTB and DFT.
A more accurate solution might be to add a command-line flag to indicate if we need to run the NNTask.

amitschang · 2024-07-08T15:51:43Z

bluephos/tasks/dft.py

+    if ste is None or abs(ste) >= t_ste or energy_diff is not None:
+        logger.info(f"Skipping DFT on molecule {mol_id} based on z or t_ste conditions.")
+        return row


we may want to create an issue to come back to this: Since we expect a funnel, we should presume there would be many cases where this filter skips DFT, and each time we have to take a task reserving 40CPUs, this might end up underutilizing the resources more than we'd like. We might want to branch at the previous task and only send passing entries to DFT (e.g. do the filter higher up, and with fewer cores)

Thanks for point out here. Yes, we can create an issue to re-evaluate this (#70).
I have some thought about it: logically, it would be ideal and elegant to allow only ligands that pass the filter to enter the next task(and we can do it).
In practice, for the DFT task, the ideal number of nodes is around 20. Therefore, we need to allocate the full 40 CPUs if the number of valid ligands from the previous XTB step is greater than 2. Even if many invalid ligands "leak" into the DFT step, the real impact on resource utilization will be negligible. So if the number of "valid" ligands requires more CPUs than the total available on the server, the impact of the leaked ligands can be neglected.

amitschang · 2024-07-08T15:53:18Z

bluephos/bluephos_pipeline.py

+        input_dir (str): Directory containing input parquet files.
+        t_nn (float): Threshold for 'z' score.
+        t_ste (float): Threshold for 'ste'.
+        t_ed (float): Threshold for 'energy diff'.


It seems t_ed is not actually filtered on anywhere in the pipeline. This is something they will use at the end to decide on wet-lab experiments?

Yes, t_ed is not currently used since it's part of the final step. I keep the score in the results to avoid recalculating it in the future. we may need to ask Alexander about how to use it.

yeah, that is the core deliverable. Clearly we need the energy diff in output, but the threshold seems outside the pipeline. For example, after a run they can consume the results and filter any given way to decide what to manufacture

amitschang · 2024-07-08T15:56:40Z

bluephos/tasks/dft.py

+        xyz_value = remove_second_row(row["xyz"])
+        logger.info(f"Starting DFT calculation for {base_name}...")
+        energy_diff = dft_calculator.extract_results(temp_dir, base_name, xyz_value)
+        row["energy diff"] = energy_diff


since this is changing name, could we make it a bit more portable by both adding "dft" to the name and removing space. Maybe like dft_ediff or dft_energy_diff.

The portability comes from use in SQL (no need for quoting) and dotted notation dataframe usage

amitschang

thanks for the updates! I think something might have been missed, but also I had some follow up comments as well, please check 😄

amitschang · 2024-07-09T18:50:26Z

bluephos/bluephos_pipeline.py

+    # Load thresholds from the provided JSON file if available
+    if args.threshold_file:


one reason is we can configure the pipeline already from a file - I realize now that we hadn't integrated this well with the context and cli part, so I opened a PR for that (ssec-jhu/dplutils#96) - but they could already be configured via --set-context t_nn=0.5 for example until the above feature is in

amitschang · 2024-07-09T18:52:58Z

bluephos/bluephos_pipeline.py

+            ligand_pair_df = initialize_dataframe()
+
+            ligand_pair_df.at[0, "halide_identifier"] = halide["halide_identifier"]
+            ligand_pair_df.at[0, "halide_SMILES"] = halide["halide_SMILES"]
+            ligand_pair_df.at[0, "acid_identifier"] = acid["acid_identifier"]
+            ligand_pair_df.at[0, "acid_SMILES"] = acid["acid_SMILES"]
+
+            yield ligand_pair_df


I don't see much advantage of this over the previous, and the initialize_dataframe looks only used here - seems simpler and sufficient (?) to remove that function and just keep as-is?

Yes, currently it's no need to use this initialization function.

amitschang · 2024-07-09T18:53:12Z

bluephos/bluephos_pipeline.py

+from bluephos.tasks.dft import DFTTask
+
+
+def initialize_dataframe():


is this function really needed?

Same as above

amitschang · 2024-07-09T18:53:44Z

bluephos/bluephos_pipeline.py

    ap.add_argument("--features", required=True, help="Element feature file")
    ap.add_argument("--train", required=True, help="Train stats file")
    ap.add_argument("--weights", required=True, help="Full energy model weights")
+    ap.add_argument("--out_dir", required=False, help="Directory for output parquet files")


did you intend to keep this?

amitschang · 2024-07-09T19:06:37Z

bluephos/parameters/threshold.json

+    "_comment": "Threshold parameters for the BluePhos Discovery Pipeline",
+    "thresholds": {"t_nn": 1.5, "t_ste": 1.5, "t_ed": 0.3},
+    "_descriptions": {
+        "t_nn": "Threshold for the neural network (NN) score. Determines the maximum allowed absolute value for the NN score to consider a candidate.",


might be a good place to document this rather in the help string of argparse, or in the module docstring - or even actually in module docstring and set argparse description from that - was even thinking that might be a good default in the cli utilities.

Thank you for the suggestion. After consideration, I moved the threshold definition into the CLI for better portability and flexibility.

Thanks for the comments/suggestions!

amitschang

Thanks!

Xiang Chen added 5 commits July 5, 2024 10:27

add filter and rerun feature to pipeline

c3ec407

update for lint error

f026dc1

format test file

e5ea761

update logger

e395d75

format

8a12980

xiangchenjhu marked this pull request as ready for review July 5, 2024 16:32

xiangchenjhu requested a review from a team as a code owner July 5, 2024 16:32

xiangchenjhu assigned xiangchenjhu and unassigned xiangchenjhu Jul 5, 2024

xiangchenjhu requested review from amitschang and ehunter68 July 5, 2024 16:33

This was referenced Jul 5, 2024

Input filtering for completed work #43

Closed

ORCA Configuration Input Location #69

Open

Xiang Chen added 3 commits July 5, 2024 13:34

change name dft_orca to dft

14270bc

reorgnize function and comment

41a9b42

unify logging in dft_extract.py

1f550fa

amitschang reviewed Jul 8, 2024

View reviewed changes

Xiang Chen added 3 commits July 9, 2024 12:18

update based on PR comments

6fd1f63

format

dc9de37

update test

9441f72

xiangchenjhu requested a review from amitschang July 9, 2024 16:38

amitschang reviewed Jul 9, 2024

View reviewed changes

Xiang Chen added 3 commits July 10, 2024 10:01

remove initialize

fdce67f

use directely inputing into CLI replace seperate Json file

a26042e

remove unused items

533aadd

xiangchenjhu requested a review from amitschang July 10, 2024 16:25

remove unused df initialization and t_ef

2e45c2a

amitschang approved these changes Jul 11, 2024

View reviewed changes

xiangchenjhu merged commit b59be9d into main Jul 11, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add filter and rerun feature to pipeline #67

add filter and rerun feature to pipeline #67

xiangchenjhu commented Jul 5, 2024 •

edited

Loading

amitschang left a comment

amitschang Jul 8, 2024

xiangchenjhu Jul 8, 2024

amitschang Jul 8, 2024

xiangchenjhu Jul 8, 2024

amitschang Jul 8, 2024

amitschang Jul 9, 2024

xiangchenjhu Jul 10, 2024

amitschang Jul 8, 2024

amitschang Jul 8, 2024

xiangchenjhu Jul 9, 2024 •

edited

Loading

amitschang Jul 8, 2024

xiangchenjhu Jul 8, 2024

amitschang Jul 8, 2024

xiangchenjhu Jul 8, 2024 •

edited

Loading

amitschang Jul 8, 2024

xiangchenjhu Jul 8, 2024

amitschang Jul 8, 2024

amitschang Jul 8, 2024

amitschang left a comment

amitschang Jul 9, 2024

amitschang Jul 9, 2024

xiangchenjhu Jul 10, 2024 •

edited

Loading

amitschang Jul 9, 2024

xiangchenjhu Jul 10, 2024 •

edited

Loading

amitschang Jul 9, 2024

amitschang Jul 9, 2024

xiangchenjhu Jul 10, 2024

xiangchenjhu Jul 10, 2024 •

edited

Loading

amitschang left a comment

		# Load thresholds from the provided JSON file if available
		if args.threshold_file:

		from bluephos.tasks.dft import DFTTask


		def initialize_dataframe():

add filter and rerun feature to pipeline #67

add filter and rerun feature to pipeline #67

Conversation

xiangchenjhu commented Jul 5, 2024 • edited Loading

amitschang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiangchenjhu Jul 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiangchenjhu Jul 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amitschang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiangchenjhu Jul 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiangchenjhu Jul 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiangchenjhu Jul 10, 2024 • edited Loading

Choose a reason for hiding this comment

amitschang left a comment

Choose a reason for hiding this comment

xiangchenjhu commented Jul 5, 2024 •

edited

Loading

xiangchenjhu Jul 9, 2024 •

edited

Loading

xiangchenjhu Jul 8, 2024 •

edited

Loading

xiangchenjhu Jul 10, 2024 •

edited

Loading

xiangchenjhu Jul 10, 2024 •

edited

Loading

xiangchenjhu Jul 10, 2024 •

edited

Loading