-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add filter and rerun feature to pipeline #67
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some comments inline, please take a look, thanks!
bluephos/bluephos_pipeline.py
Outdated
for _, row in df.iterrows(): | ||
if row["z"] is not None and abs(row["z"]) < t_nn: | ||
if row["ste"] is None or abs(row["ste"]) < t_ste: | ||
if row["energy diff"] is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of iterating on rows here, could we first preform the filter, then yield per-row df after that, might be clearer, e.g.
filter = (df["z"] < t_nn) & (df["ste"] < t_ste) & (df["energy diff"].isna())
for _, row in df[filter].iterrows():
yield row.to_frame().transpose()
I don't think you really even need the blank frame to insert into
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks ! updated the code to perform the filter first and then yield each row as a single-row DataFrame, enhancing clarity and avoiding the need for a blank frame.
bluephos/bluephos_pipeline.py
Outdated
"out_dir": out_dir, | ||
"input_dir": input_dir, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think either out_dir or input_dir need to be in the context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they are not necessary be here since they are not used in the tasks
bluephos/bluephos_pipeline.py
Outdated
ap.add_argument("--features", required=True, help="Element feature file") | ||
ap.add_argument("--train", required=True, help="Train stats file") | ||
ap.add_argument("--weights", required=True, help="Full energy model weights") | ||
ap.add_argument("--out_dir", required=False, help="Directory for output parquet files") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out_dir should already be an argument from the cli.get_argparser
so should stick with the one that comes from there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you intend to keep this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's a miss item
bluephos/bluephos_pipeline.py
Outdated
ap.add_argument("--input_dir", required=False, help="Directory containing input parquet files") | ||
ap.add_argument("--threshold_file", required=False, help="JSON file containing t_nn, t_ste, and t_ed threshold") | ||
ap.add_argument( | ||
"--package", required=False, default="orca", choices=["orca", "ase"], help="DFT package to use (default: orca)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the name package
seems to general. Maybe dft-package
or dft-method
or something
"structure": None, | ||
"z": None, | ||
"xyz": None, | ||
"ste": None, | ||
"energy diff": None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do these actually need to be here from the start?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think all those columns should be initialized here since the following tasks will not create new columns, and we are starting with an empty DataFrame.
We can also utilize the input DataFrame (which was already initialized) to avoid creating an empty one, ensuring that all necessary columns are initialized. However, this approach would require 'deleting' some invalid entries, which may introduce some overhead (In this task, I don't allow ligand pairs to exist in the DataFrame).
bluephos/bluephos_pipeline.py
Outdated
if not input_dir | ||
else [ | ||
OptimizeGeometriesTask, | ||
DFTTask, | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting way of doing it. I wonder if it would be generally advisable to include the NNTask
as well, as it may be likely that when changing thresholds and desiring a rerun that the NN parameters might also be subject to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this is an important scenario.
If the NN model or parameters change, we need to rerun it.
Currently, the simplest way is to make the NNTask mandatory, always rerunning it since its runtime is negligible compared to xTB and DFT.
A more accurate solution might be to add a command-line flag to indicate if we need to run the NNTask.
if ste is None or abs(ste) >= t_ste or energy_diff is not None: | ||
logger.info(f"Skipping DFT on molecule {mol_id} based on z or t_ste conditions.") | ||
return row |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we may want to create an issue to come back to this: Since we expect a funnel, we should presume there would be many cases where this filter skips DFT, and each time we have to take a task reserving 40CPUs, this might end up underutilizing the resources more than we'd like. We might want to branch at the previous task and only send passing entries to DFT (e.g. do the filter higher up, and with fewer cores)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for point out here. Yes, we can create an issue to re-evaluate this (#70).
I have some thought about it: logically, it would be ideal and elegant to allow only ligands that pass the filter to enter the next task(and we can do it).
In practice, for the DFT task, the ideal number of nodes is around 20. Therefore, we need to allocate the full 40 CPUs if the number of valid ligands from the previous XTB step is greater than 2. Even if many invalid ligands "leak" into the DFT step, the real impact on resource utilization will be negligible. So if the number of "valid" ligands requires more CPUs than the total available on the server, the impact of the leaked ligands can be neglected.
bluephos/bluephos_pipeline.py
Outdated
input_dir (str): Directory containing input parquet files. | ||
t_nn (float): Threshold for 'z' score. | ||
t_ste (float): Threshold for 'ste'. | ||
t_ed (float): Threshold for 'energy diff'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems t_ed
is not actually filtered on anywhere in the pipeline. This is something they will use at the end to decide on wet-lab experiments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, t_ed is not currently used since it's part of the final step. I keep the score in the results to avoid recalculating it in the future. we may need to ask Alexander about how to use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, that is the core deliverable. Clearly we need the energy diff in output, but the threshold seems outside the pipeline. For example, after a run they can consume the results and filter any given way to decide what to manufacture
bluephos/tasks/dft.py
Outdated
xyz_value = remove_second_row(row["xyz"]) | ||
logger.info(f"Starting DFT calculation for {base_name}...") | ||
energy_diff = dft_calculator.extract_results(temp_dir, base_name, xyz_value) | ||
row["energy diff"] = energy_diff |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this is changing name, could we make it a bit more portable by both adding "dft" to the name and removing space. Maybe like dft_ediff
or dft_energy_diff
.
The portability comes from use in SQL (no need for quoting) and dotted notation dataframe usage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the updates! I think something might have been missed, but also I had some follow up comments as well, please check 😄
bluephos/bluephos_pipeline.py
Outdated
# Load thresholds from the provided JSON file if available | ||
if args.threshold_file: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one reason is we can configure the pipeline already from a file - I realize now that we hadn't integrated this well with the context and cli part, so I opened a PR for that (ssec-jhu/dplutils#96) - but they could already be configured via --set-context t_nn=0.5
for example until the above feature is in
bluephos/bluephos_pipeline.py
Outdated
ligand_pair_df = initialize_dataframe() | ||
|
||
ligand_pair_df.at[0, "halide_identifier"] = halide["halide_identifier"] | ||
ligand_pair_df.at[0, "halide_SMILES"] = halide["halide_SMILES"] | ||
ligand_pair_df.at[0, "acid_identifier"] = acid["acid_identifier"] | ||
ligand_pair_df.at[0, "acid_SMILES"] = acid["acid_SMILES"] | ||
|
||
yield ligand_pair_df |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see much advantage of this over the previous, and the initialize_dataframe
looks only used here - seems simpler and sufficient (?) to remove that function and just keep as-is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, currently it's no need to use this initialization function.
bluephos/bluephos_pipeline.py
Outdated
from bluephos.tasks.dft import DFTTask | ||
|
||
|
||
def initialize_dataframe(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this function really needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
bluephos/bluephos_pipeline.py
Outdated
ap.add_argument("--features", required=True, help="Element feature file") | ||
ap.add_argument("--train", required=True, help="Train stats file") | ||
ap.add_argument("--weights", required=True, help="Full energy model weights") | ||
ap.add_argument("--out_dir", required=False, help="Directory for output parquet files") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you intend to keep this?
bluephos/parameters/threshold.json
Outdated
"_comment": "Threshold parameters for the BluePhos Discovery Pipeline", | ||
"thresholds": {"t_nn": 1.5, "t_ste": 1.5, "t_ed": 0.3}, | ||
"_descriptions": { | ||
"t_nn": "Threshold for the neural network (NN) score. Determines the maximum allowed absolute value for the NN score to consider a candidate.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be a good place to document this rather in the help string of argparse, or in the module docstring - or even actually in module docstring and set argparse description from that - was even thinking that might be a good default in the cli utilities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the suggestion. After consideration, I moved the threshold definition into the CLI for better portability and flexibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments/suggestions!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Goal:
Add a re-run mechanism based on updated filter thresholds.
Provide an interface for switching between DFT calculators (ORCA or ASE).
Implementation:
Store filter thresholds in threshold.json.
Add an "ste" column for xTB energy difference calculations, initializing all the required columns at the start.
Introduce the DFTCalculator class for using different DFT calculators.
Allow the pipeline to run in two modes:
(Determine which mode to apply by checking if input_dir is specified in the command line)
Experiment:
Test both modes with different thresholds on a MacBook and a Kubernetes cluster using the /test dataset.