Skip to content

[ENG-5862] SPAM - Fix Wiki Spamming #11171

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: feature/pbs-25-10
Choose a base branch
from

Conversation

antkryt
Copy link
Contributor

@antkryt antkryt commented Jun 5, 2025

Purpose

verify if spammy domains are detected

Changes

  • merge check_resource_for_domains_postcommit and check_resource_with_spam_services tasks to avoid race condition
  • compare note to value not enum
  • log detected domains to sentry

QA Notes

You can test it with domain xakw1.com on staging3. Currently project won't be banned with this domain, regardless of whether it's public or not.

Documentation

Side Effects

Ticket

https://openscience.atlassian.net/browse/ENG-5862

@antkryt
Copy link
Contributor Author

antkryt commented Jun 5, 2025

The issue described in the ticket is not entirely accurate. I found the domain "xawk1.com" on staging, and content containing this domain will not be banned even if the project is public (here is an example). And on the other side, if you repeat the steps from the ticket using some other domains, everything works as expected.

However, the spam check will always be triggered because the DomainReference is always created and you can verify it in django admin (the only place it’s created is in the function _check_resource_for_domains). Therefore, only two possibilities remain:

if notable_domain.note == NotableDomain.Note.EXCLUDE_FROM_ACCOUNT_CREATION_AND_CONTENT is not being triggered
or
resource.confirm_spam(save=True, domains=list(spammy_domains)) silently fails (changes are not saved in the database or racing condition and changes are overwritten with some other process).

Probably changes made by check_resource_for_domains_postcommit are overwritten by check_resource_with_spam_services. Both of them start almost at the same time and load same (non-spam) version of the resource at the beginning. Which task is completed last, such changes will be saved in the database (typically check_resource_with_spam_services ends last because we make request to external service)

Copy link
Contributor

@Johnetordoff Johnetordoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antkryt I'm a little confused exactly what this accomplishes, could you write some test cases to illustrate how you are changing the current behavior. I see that the tests are changed, but I don't see new test cases. A good test case here should show the spam content or spam domain of a wiki is checked when a project is made public.

@antkryt
Copy link
Contributor Author

antkryt commented Jun 10, 2025

@Johnetordoff I'm not sure that it's possible to write a test case to illustrate what I'm changing here, so I'll try to explain better.

Both check_resource_for_domains_postcommit and check_resource_with_spam_services are doing the same thing: start after response is sent to client (run_postcommit decorator) -> load node with spam_status=UNKNOWN -> process -> save changes to db.

As you can see, tasks don't know anything about each other and changes that are made (it's two different parallel processes). So, if check_resource_with_spam_services finishes after check_resource_for_domains_postcommit, then changes made by check_resource_for_domains_postcommit will be overwritten (and vice versa). Locally everything works because SPAM_SERVICES_ENABLED is false and only check_resource_for_domains_postcommit is running.

Also, I've changed signature of the def _check_resource_for_domains(guid, content) to def _check_resource_for_domains(resource, content) and removed confirm_spam() call from it, that's why I've updated some tests to make them work as before

As for checking spam during privacy change tests, we have bunch of those (see osf_tests/test_node.py::TestNodeSpam). I'll add a few tests to check if new check_resource_for_spam_postcommit function works properly

Copy link
Contributor

@Johnetordoff Johnetordoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not quite the behavior we are looking for, but correct techniques.


@mock.patch.object(settings, 'SPAM_SERVICES_ENABLED', True)
@mock.patch('osf.external.spam.tasks._check_resource_for_domains')
def test_check_resource_for_spam_postcommit_with_spammy_domains(self, mock_check_domains, project, user):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THe ticket instructions to reproduce the bug are:

  1. Create project
  2. Update wiki with spammy domain
  3. Make project public

These are the steps the tests should follow, you are testing behavior that is too specific we know checking for spam domans works, when a project is saved when public, but we need to check a project's wiki content too, not just it's desciption etc.

Copy link
Contributor

@Johnetordoff Johnetordoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good new tests, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants