Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add filter and rerun feature to pipeline #67

Merged
merged 15 commits into from
Jul 11, 2024
Merged

Conversation

xiangchenjhu
Copy link
Collaborator

@xiangchenjhu xiangchenjhu commented Jul 5, 2024

Goal:
Add a re-run mechanism based on updated filter thresholds.
Provide an interface for switching between DFT calculators (ORCA or ASE).

Implementation:

  1. Store filter thresholds in threshold.json.

  2. Add an "ste" column for xTB energy difference calculations, initializing all the required columns at the start.

  3. Introduce the DFTCalculator class for using different DFT calculators.

  4. Allow the pipeline to run in two modes:

  • Initial run: Generates ligands from CSV files.
  • Re-run: Generates ligands from Parquet files.

(Determine which mode to apply by checking if input_dir is specified in the command line)

  1. Unify the logging format.

Experiment:
Test both modes with different thresholds on a MacBook and a Kubernetes cluster using the /test dataset.

@xiangchenjhu xiangchenjhu marked this pull request as ready for review July 5, 2024 16:32
@xiangchenjhu xiangchenjhu requested a review from a team as a code owner July 5, 2024 16:32
Copy link
Member

@amitschang amitschang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some comments inline, please take a look, thanks!

bluephos/bluephos_pipeline.py Outdated Show resolved Hide resolved
Comment on lines 85 to 88
for _, row in df.iterrows():
if row["z"] is not None and abs(row["z"]) < t_nn:
if row["ste"] is None or abs(row["ste"]) < t_ste:
if row["energy diff"] is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of iterating on rows here, could we first preform the filter, then yield per-row df after that, might be clearer, e.g.

filter = (df["z"] < t_nn) & (df["ste"] < t_ste) & (df["energy diff"].isna())
for _, row in df[filter].iterrows():
    yield row.to_frame().transpose()

I don't think you really even need the blank frame to insert into

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! updated the code to perform the filter first and then yield each row as a single-row DataFrame, enhancing clarity and avoiding the need for a blank frame.

Comment on lines 138 to 139
"out_dir": out_dir,
"input_dir": input_dir,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think either out_dir or input_dir need to be in the context?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they are not necessary be here since they are not used in the tasks

ap.add_argument("--features", required=True, help="Element feature file")
ap.add_argument("--train", required=True, help="Train stats file")
ap.add_argument("--weights", required=True, help="Full energy model weights")
ap.add_argument("--out_dir", required=False, help="Directory for output parquet files")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out_dir should already be an argument from the cli.get_argparser so should stick with the one that comes from there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you intend to keep this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's a miss item

ap.add_argument("--input_dir", required=False, help="Directory containing input parquet files")
ap.add_argument("--threshold_file", required=False, help="JSON file containing t_nn, t_ste, and t_ed threshold")
ap.add_argument(
"--package", required=False, default="orca", choices=["orca", "ase"], help="DFT package to use (default: orca)"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the name package seems to general. Maybe dft-package or dft-method or something

Comment on lines 68 to 72
"structure": None,
"z": None,
"xyz": None,
"ste": None,
"energy diff": None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do these actually need to be here from the start?

Copy link
Collaborator Author

@xiangchenjhu xiangchenjhu Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think all those columns should be initialized here since the following tasks will not create new columns, and we are starting with an empty DataFrame.

We can also utilize the input DataFrame (which was already initialized) to avoid creating an empty one, ensuring that all necessary columns are initialized. However, this approach would require 'deleting' some invalid entries, which may introduce some overhead (In this task, I don't allow ligand pairs to exist in the DataFrame).

Comment on lines 123 to 127
if not input_dir
else [
OptimizeGeometriesTask,
DFTTask,
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting way of doing it. I wonder if it would be generally advisable to include the NNTask as well, as it may be likely that when changing thresholds and desiring a rerun that the NN parameters might also be subject to change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is an important scenario.
If the NN model or parameters change, we need to rerun it.
Currently, the simplest way is to make the NNTask mandatory, always rerunning it since its runtime is negligible compared to xTB and DFT.
A more accurate solution might be to add a command-line flag to indicate if we need to run the NNTask.

Comment on lines +49 to +51
if ste is None or abs(ste) >= t_ste or energy_diff is not None:
logger.info(f"Skipping DFT on molecule {mol_id} based on z or t_ste conditions.")
return row
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may want to create an issue to come back to this: Since we expect a funnel, we should presume there would be many cases where this filter skips DFT, and each time we have to take a task reserving 40CPUs, this might end up underutilizing the resources more than we'd like. We might want to branch at the previous task and only send passing entries to DFT (e.g. do the filter higher up, and with fewer cores)

Copy link
Collaborator Author

@xiangchenjhu xiangchenjhu Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for point out here. Yes, we can create an issue to re-evaluate this (#70).
I have some thought about it: logically, it would be ideal and elegant to allow only ligands that pass the filter to enter the next task(and we can do it).
In practice, for the DFT task, the ideal number of nodes is around 20. Therefore, we need to allocate the full 40 CPUs if the number of valid ligands from the previous XTB step is greater than 2. Even if many invalid ligands "leak" into the DFT step, the real impact on resource utilization will be negligible. So if the number of "valid" ligands requires more CPUs than the total available on the server, the impact of the leaked ligands can be neglected.

input_dir (str): Directory containing input parquet files.
t_nn (float): Threshold for 'z' score.
t_ste (float): Threshold for 'ste'.
t_ed (float): Threshold for 'energy diff'.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems t_ed is not actually filtered on anywhere in the pipeline. This is something they will use at the end to decide on wet-lab experiments?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, t_ed is not currently used since it's part of the final step. I keep the score in the results to avoid recalculating it in the future. we may need to ask Alexander about how to use it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that is the core deliverable. Clearly we need the energy diff in output, but the threshold seems outside the pipeline. For example, after a run they can consume the results and filter any given way to decide what to manufacture

xyz_value = remove_second_row(row["xyz"])
logger.info(f"Starting DFT calculation for {base_name}...")
energy_diff = dft_calculator.extract_results(temp_dir, base_name, xyz_value)
row["energy diff"] = energy_diff
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is changing name, could we make it a bit more portable by both adding "dft" to the name and removing space. Maybe like dft_ediff or dft_energy_diff.

The portability comes from use in SQL (no need for quoting) and dotted notation dataframe usage

@xiangchenjhu xiangchenjhu requested a review from amitschang July 9, 2024 16:38
Copy link
Member

@amitschang amitschang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the updates! I think something might have been missed, but also I had some follow up comments as well, please check 😄

Comment on lines 169 to 170
# Load thresholds from the provided JSON file if available
if args.threshold_file:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one reason is we can configure the pipeline already from a file - I realize now that we hadn't integrated this well with the context and cli part, so I opened a PR for that (ssec-jhu/dplutils#96) - but they could already be configured via --set-context t_nn=0.5 for example until the above feature is in

Comment on lines 48 to 55
ligand_pair_df = initialize_dataframe()

ligand_pair_df.at[0, "halide_identifier"] = halide["halide_identifier"]
ligand_pair_df.at[0, "halide_SMILES"] = halide["halide_SMILES"]
ligand_pair_df.at[0, "acid_identifier"] = acid["acid_identifier"]
ligand_pair_df.at[0, "acid_SMILES"] = acid["acid_SMILES"]

yield ligand_pair_df
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see much advantage of this over the previous, and the initialize_dataframe looks only used here - seems simpler and sufficient (?) to remove that function and just keep as-is?

Copy link
Collaborator Author

@xiangchenjhu xiangchenjhu Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, currently it's no need to use this initialization function.

from bluephos.tasks.dft import DFTTask


def initialize_dataframe():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this function really needed?

Copy link
Collaborator Author

@xiangchenjhu xiangchenjhu Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

ap.add_argument("--features", required=True, help="Element feature file")
ap.add_argument("--train", required=True, help="Train stats file")
ap.add_argument("--weights", required=True, help="Full energy model weights")
ap.add_argument("--out_dir", required=False, help="Directory for output parquet files")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you intend to keep this?

"_comment": "Threshold parameters for the BluePhos Discovery Pipeline",
"thresholds": {"t_nn": 1.5, "t_ste": 1.5, "t_ed": 0.3},
"_descriptions": {
"t_nn": "Threshold for the neural network (NN) score. Determines the maximum allowed absolute value for the NN score to consider a candidate.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be a good place to document this rather in the help string of argparse, or in the module docstring - or even actually in module docstring and set argparse description from that - was even thinking that might be a good default in the cli utilities.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion. After consideration, I moved the threshold definition into the CLI for better portability and flexibility.

Copy link
Collaborator Author

@xiangchenjhu xiangchenjhu Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments/suggestions!

@xiangchenjhu xiangchenjhu requested a review from amitschang July 10, 2024 16:25
Copy link
Member

@amitschang amitschang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@xiangchenjhu xiangchenjhu merged commit b59be9d into main Jul 11, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants