Skip to content

Check for ECFP collisions #330

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Sep 12, 2024
Merged

Check for ECFP collisions #330

merged 7 commits into from
Sep 12, 2024

Conversation

stewarthe6
Copy link
Collaborator

Sometimes different smiles result in the same ECFP features. Now you will receive a warning when that happens

Copy link
Collaborator

@paulsonak paulsonak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. do you need to featurize the smiles or can you call this after the features have been calculated for the model/etc?
  2. Might be worth adding a recommendation to the warning like, "we recommend trying a larger ECFP radius to reduce collisions"

Copy link
Collaborator

@mcloughlin2 mcloughlin2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Amanda, it would be helpful to suggest increasing the ECFP radius in the warning message. Otherwise the code looks fine.

@stewarthe6
Copy link
Collaborator Author

stewarthe6 commented Sep 11, 2024

  1. I think I already do? I have two functions. check_ecfp_collisions which featurizes smiles strings before calling check_ecfp_feature_collisions which does the actual check. I call check_ecfp_feature_collisions in featurizer.py after features have been calculated.
  2. I updated the warning to suggest increasing ecfp_radius.

I have two followup questions:

  1. Do we need to escape literal SMILES strings when they have backslashes?
  2. Instead of logging a warning, should this be a warnings.warn?

@paulsonak
Copy link
Collaborator

do we need to escape literal SMILES strings when they have backslashes?

When would this need to happen? I think in general, no. In specific pandas operations (str.replace for instance) you can specify regex=False but I don't think any of that applies here. We would be getting errors and failures all the time if this was needed right?

@mcloughlin2
Copy link
Collaborator

When you specify a SMILES string as a string literal (e.g.,
smiles = "[O]/C(C1=CC=CC1)=C([S])\N"), you should escape the backslashes; otherwise you may get a syntax error from Python complaining about a malformed escape sequence. If you’re reading the SMILES from a file, you don’t have to (and should not) escape them. The exception to the latter rule is if the SMILES string is going to be passed to some other piece of code (like a web app) that is going to interpret it as a string literal; then you have to double the backslashes. We have to do this when passing SMILES strings to the Allchemy REST API, for example.

@stewarthe6 stewarthe6 merged commit 9d21c4a into 1.7.0 Sep 12, 2024
8 of 11 checks passed
@stewarthe6 stewarthe6 deleted the feat_ecfp_collision branch September 12, 2024 16:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants