Handling duplicated SMILES with Libinvent #166

marco-chimfarm · 2024-11-25T11:05:53Z

marco-chimfarm
Nov 25, 2024

HI everyone,

I am trying to use Libinvent to propose R-groups to decorate a scaffold and using maize as scoring workflow.
However, the output is puzzling to me:

As you can see, the first few compounds have all the same structure, just with the R-group SMILES having a different attachment point.
Due to this discussion I am not using the Diversity Filter, as I understood that the zero-scoring of the duplicates would be tied to that.

Could you help me figure out what is happening?

Answered by halx

Feb 26, 2025

So, the idea to zero-score duplicate SMILES is to promote a level of diversity. We have done internal tests to switch this off and we found that it did not seem to have any benefit regarding learning rate but lowered diversity. You also need to keep on mind that the final aggregated total score is a single float value calculated as the average from the individual SMILES scores. So the effect may be rather minimal in practice. If you start sampling excessive number of duplicates, you are running out of chemical space anyway and should probably stop RL.

View full answer

halx · 2024-11-26T08:58:22Z

halx
Nov 26, 2024
Maintainer

What exacty is the question? I do not see any of you structures just scaffold SMILES.

0 replies

marco-chimfarm · 2025-02-24T11:27:20Z

marco-chimfarm
Feb 24, 2025
Author

Hi Hannes, sorry about the long delay, I completely forgot I asked this question...

What I meant is:
Libinvent is producing the same molecule while using different R-groups. In this case, it's creating a phenyl ring and it's moving around the attachment point (see the "R-groups" column in my main post). Since the phenyl ring is symmetrical, what happens is that the final products is the same for all R-groups.

I was wondering how these duplicates are handled in the REINVENT code, since they all get a score of 0 except of the first one. Could you explain why has it been set up like this? To me it looks like this will give the model "mixed signals" about what optimise for, making the exploration less efficient.

Again, sorry for the delay in replying and thanks for your assistance.

4 replies

halx Feb 24, 2025
Maintainer

It would be good to see you config file.

Duplicates should only get a score of zero if you are using one of the diversity filters. You need to be careful with those as almost all of them are (Myrcko-Bemis) scaffold based and you can only escape a scaffold by adding another ring (system). Also, those filters inlcude a SMILES component i.e. every reoccuring instance of the same SMILES is scored as zero. So, too many times (as per your configuration) the same scaffold and the same SMILES will result in a zero score.

marco-chimfarm Feb 24, 2025
Author

Thanks for your reply! I was aware of the diversity filter interaction with duplicates, so I commented it out from the TOML config. Here is my REINVENT TOML config file:

# https://github.com/MolecularAI/REINVENT4/blob/main/configs/toml/PARAMS.md

run_type = "staged_learning"
device = "cpu"
tb_logdir = "tb_logs_RL"
json_out_config = "_RL.json"

[parameters]
prior_file = "REINVENT4/priors/linkinvent.prior"
agent_file = "REINVENT4/priors/linkinvent.prior"

smiles_file = "REINVENT/tests/test_simple_libinvent/scaffold.smi" 

summary_csv_prefix = "test_simple_libinvent_stage1"

use_checkpoint = false
unique_sequences = true
randomize_smiles = true

[learning_strategy]
type = "dap"
sigma = 128
rate = 0.0001

# [diversity_filter]
# type = "IdenticalMurckoScaffold"
# bucket_size = 25
# minscore = 0.4
# minsimilarity = 0.4
# penalty_multiplier = 0.5

[[stage]]
chkpt_file = 'RL.chkpt'
max_score = 1.0
min_steps = 3
max_steps = 20

[stage.scoring]
type = "custom_sum"
parallel = true

[[stage.scoring.component]]
[stage.scoring.component.NumHeavyAtoms]
[[stage.scoring.component.NumHeavyAtoms.endpoint]]
name = 'NumberOfHA'
weight = 0.5
transform.type = "reverse_sigmoid"
transform.high = 35
transform.low = 20
transform.k = 0.5

Edit: if it can help, I'm using REINVENT from within maize to dock the generated compounds.

halx Feb 24, 2025
Maintainer

Would you be able to share your scaffold`(s)?

marco-chimfarm Feb 25, 2025
Author

Hi, I cannot share my scaffold unfortunately. However, I was able to replicate the issue using Ibuprofen ([*]Cc1ccc(C(C(O)=O)C)cc1) as a scaffold (this time by using REINVENT 4.4.22 as a standalone software, without Maize).
I have attached the relevant output csv file: test_simple_libinvent_stage1_1.csv. The config file is the same as the one I previously posted.

halx · 2025-02-26T09:50:30Z

halx
Feb 26, 2025
Maintainer

So, the idea to zero-score duplicate SMILES is to promote a level of diversity. We have done internal tests to switch this off and we found that it did not seem to have any benefit regarding learning rate but lowered diversity. You also need to keep on mind that the final aggregated total score is a single float value calculated as the average from the individual SMILES scores. So the effect may be rather minimal in practice. If you start sampling excessive number of duplicates, you are running out of chemical space anyway and should probably stop RL.

0 replies

marco-chimfarm · 2025-02-27T11:41:47Z

marco-chimfarm
Feb 27, 2025
Author

Hi Hannes, thanks for your reply!
I have seen that in the v4.5.11 changelog it is mentiones that in v4.4.30 the issue "Diversity filter setup in config was ignored" was fixed. Could it be potentially related to this issue?

Just to clarify what you mean: Is the model being updated on the final "complete" (scaffold + R-group) SMILES using the average score of all "complete" SMILES in each step? I.e.:

"complete" SMILES	R-group	Score
CCCC	*CC	100
CCCC	*C(C)	0
CCCC	*(CC)	0

Will the final score that the model uses be 33 [(100 + 0 +0)/ 3]?

2 replies

halx Feb 27, 2025
Maintainer

Sorry, my mistake. It is slightly more complicated than that. The total score is actually never averaged, that happens to the loss. See equation 6 in our paper. which is averaged before backpropagation.

halx Feb 27, 2025
Maintainer

The fix regarding the DF was that when set it was not used. Since you did not use DF it does not affect you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling duplicated SMILES with Libinvent #166

{{title}}

Replies: 4 comments 6 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Handling duplicated SMILES with Libinvent #166

marco-chimfarm Nov 25, 2024

Replies: 4 comments · 6 replies

halx Nov 26, 2024 Maintainer

marco-chimfarm Feb 24, 2025 Author

halx Feb 24, 2025 Maintainer

marco-chimfarm Feb 24, 2025 Author

halx Feb 24, 2025 Maintainer

marco-chimfarm Feb 25, 2025 Author

halx Feb 26, 2025 Maintainer

marco-chimfarm Feb 27, 2025 Author

halx Feb 27, 2025 Maintainer

halx Feb 27, 2025 Maintainer

marco-chimfarm
Nov 25, 2024

Replies: 4 comments 6 replies

halx
Nov 26, 2024
Maintainer

marco-chimfarm
Feb 24, 2025
Author

halx Feb 24, 2025
Maintainer

marco-chimfarm Feb 24, 2025
Author

halx Feb 24, 2025
Maintainer

marco-chimfarm Feb 25, 2025
Author

halx
Feb 26, 2025
Maintainer

marco-chimfarm
Feb 27, 2025
Author

halx Feb 27, 2025
Maintainer

halx Feb 27, 2025
Maintainer