Skip to content

Improve score by supporting extra_phrase for extra words in rules #4432

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: develop
Choose a base branch
from

Conversation

alok1304
Copy link
Collaborator

@alok1304 alok1304 commented Jun 19, 2025

Follow up of:

Add new phrases like extra_phrase this is special for extra-words. This phrase is represented in the format [[n]], where n indicates the maximum number of extra-words allowed at that position in the rule.

If extra-words appear at the correct position and their count does not exceed the allowed limit n, then the score is increased to 100.

Reference #4420

Tasks

  • Reviewed contribution guidelines
  • PR is descriptively titled 📑 and links the original issue above 🔗
  • Tests pass -- look for a green checkbox ✔️ a few minutes after opening your PR
    Run tests locally to check for errors.
  • Commits are in uniquely-named feature branch and has no merge conflicts 📁
  • Updated documentation pages (if applicable)
  • Updated CHANGELOG.rst (if applicable)

Signed-off-by: Alok Kumar alokkumarjipura9973@gmail.com

Copy link
Member

@AyanSinhaMahapatra AyanSinhaMahapatra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alok1304! Looking much better

See comments for your consideration. I've updated your PR description to mention that this is a follow up PR, since there is important context and reviews in the previous PR, we need to preserve this as required.

"""
Return True if any of the matches in ``license_matches`` List of LicenseMatch
has extra words are in the correct place.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to check both a bit explicitly:

  1. For all the matches which have extra words, they are in correct location
  2. For all the matches which does not have extra words, they are correct detections

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And add a test accordingly

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And add a test accordingly

where I should add a test and like how I implement for all license_matches

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alok1304 alok1304 force-pushed the improve-score-extra-words branch from b9c7c16 to a50b3db Compare June 24, 2025 06:27
@alok1304
Copy link
Collaborator Author

alok1304 commented Jun 24, 2025

I addet test for 3-seq where there is no detection of copyrights statements , Ref: https://github.com/xyzzy-022/xyzzy/blob/5a16eb998470241b33ad3caa6a4946d0448a16b6/LEGAL.md?plain=1#L97
this file when we scan we got extra-words so I added extra-phrase marker in that corresponding matched rule. Such that we can improve the score.

@alok1304 alok1304 force-pushed the improve-score-extra-words branch from 8a25b51 to 43c6bdb Compare June 24, 2025 07:44
alok1304 added 9 commits July 1, 2025 17:01
…_log`

Add test for is correct position of `extra-words` according to `extra-phrases` that is present in rules.

if we find `extra-words` are in the right place then we set score to `100`.
And also show in `detection_log` why we increasing the score to keep track of this.

Signed-off-by: Alok Kumar <alokkumarjipura9973@gmail.com>
Add new phrases like `extra_phrase` this is special for extra-words.
This phrase is represented in the format [[n]], where n indicates the maximum number of extra-words allowed at that position in the rule.

If extra-words appear at the correct position and their count does not exceed the allowed limit `n`, then the score is increased to `100`.

Signed-off-by: Alok Kumar <alokkumarjipura9973@gmail.com>
Signed-off-by: Alok Kumar <alokkumarjipura9973@gmail.com>
due to `extra_phrase` in rules, this shows that rules containing `extra-words`

Signed-off-by: Alok Kumar <alokkumarjipura9973@gmail.com>
Signed-off-by: Alok Kumar <alokkumarjipura9973@gmail.com>
Signed-off-by: Alok Kumar <alokkumarjipura9973@gmail.com>
add a new `extra-phrase` for a rule i.e bsd-new

Signed-off-by: Alok Kumar <alokkumarjipura9973@gmail.com>
Signed-off-by: Alok Kumar <alokkumarjipura9973@gmail.com>
Signed-off-by: Alok Kumar <alokkumarjipura9973@gmail.com>
@alok1304 alok1304 force-pushed the improve-score-extra-words branch from b38f718 to 61284f4 Compare July 1, 2025 11:31
alok1304 added 2 commits July 1, 2025 19:09
Signed-off-by: Alok Kumar <alokkumarjipura9973@gmail.com>
…rds`

Signed-off-by: Alok Kumar <alokkumarjipura9973@gmail.com>
Signed-off-by: Alok Kumar <alokkumarjipura9973@gmail.com>
@alok1304
Copy link
Collaborator Author

alok1304 commented Jul 2, 2025

Actually, the previous test case is failing because of extra-phrase is removed from rules while loading, after that when scan same rules then matcher is 3-seq not any one of these"1-hash", "1-spdx-id", "2-aho" this is due to extra-phrase marker that are present in rules. This test case is failing https://github.com/aboutcode-org/scancode-toolkit/blob/develop/tests/licensedcode/test_detection_validate.py#L101

   def check_rule_or_license_can_be_detected_exactly(licensish):
        """
        Check that a rule or license can be detected exactly, either by thyself (or with an exact match
        to any rule for deprecated rules).
        """
        idx = cache.get_index()
        deadline = time() + 20  # ms
        matches = idx.match(query_string=licensish.text, _skip_hash_match=True, deadline=deadline)
        # ensure we can self-detect exactly
        expected = [licensish.identifier]
        results = [m.rule.identifier for m in matches]
    
        if results != expected:
            expected.append(f'file://{licensish.rule_file()}')
            assert results == expected
        icm = is_correct_detection(matches)
        if not icm:
            expected.append(f'file://{licensish.rule_file()}')
>           assert results == expected

E           AssertionError: assert ['bsd-new_newlib3.RULE'] == ['bsd-new_newlib3.RULE',\n 'file:///home/vsts/work/1/s/src/licensedcode/data/rules/bsd-new_newlib3.RULE']
E             Right contains one more item: 'file:///home/vsts/work/1/s/src/licensedcode/data/rules/bsd-new_newlib3.RULE'
E             Full diff:
E               [
E                'bsd-new_newlib3.RULE',
E             -  'file:///home/vsts/work/1/s/src/licensedcode/data/rules/bsd-new_newlib3.RULE',
E               ]

because is_correct_detection gives False because it not include 3-seq in matcher see: https://github.com/aboutcode-org/scancode-toolkit/blob/develop/src/licensedcode/detection.py#L1090
That's why the test case is failing. So, I added 3-seq see: https://github.com/aboutcode-org/scancode-toolkit/pull/4432/files#diff-5f6ccc1d8f9540cae715bddaf337246f4c69c48ca596b2086e05b18c541aef4aR1098

Now, all the test cases are passed :)

Signed-off-by: Alok Kumar <alokkumarjipura9973@gmail.com>
@@ -1087,7 +1095,7 @@ def is_correct_detection(license_matches):
]

return (
all(matcher in ("1-hash", "1-spdx-id", "2-aho") for matcher in matchers)
all(matcher in ("1-hash", "1-spdx-id", "2-aho", "3-seq") for matcher in matchers)
Copy link
Collaborator Author

@alok1304 alok1304 Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i add 3-seq because of extra-phrase is removed from rules while loading, after that when we scan same rules, now matcher is 3-seq not any one of these "1-hash", "1-spdx-id", "2-aho" this is due to extra-phrase marker that are present in rules. That's why this gives 3-seq. We already loaded the rules where extra-phrase is removed, but not in the actual rule file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants