Improve score by supporting `extra_phrase` for extra words in rules #4432

alok1304 · 2025-06-19T10:56:02Z

Follow up of:

Improve score by supporting extra_phrase for extra words in rules #4424

Add new phrases like extra_phrase this is special for extra-words. This phrase is represented in the format [[n]], where n indicates the maximum number of extra-words allowed at that position in the rule.

If extra-words appear at the correct position and their count does not exceed the allowed limit n, then the score is increased to 100.

Reference #4420

Tasks

Reviewed contribution guidelines
PR is descriptively titled 📑 and links the original issue above 🔗
Tests pass -- look for a green checkbox ✔️ a few minutes after opening your PR
Run tests locally to check for errors.
Commits are in uniquely-named feature branch and has no merge conflicts 📁
Updated documentation pages (if applicable)
Updated CHANGELOG.rst (if applicable)

Signed-off-by: Alok Kumar alokkumarjipura9973@gmail.com

AyanSinhaMahapatra

Thanks @alok1304! Looking much better

See comments for your consideration. I've updated your PR description to mention that this is a follow up PR, since there is important context and reviews in the previous PR, we need to preserve this as required.

src/licensedcode/data/rules/bsd-new_578.RULE

src/licensedcode/detection.py

AyanSinhaMahapatra · 2025-06-23T14:50:35Z

src/licensedcode/detection.py

+    """
+    Return True if any of the matches in ``license_matches`` List of LicenseMatch
+    has extra words are in the correct place.
+    """


We need to check both a bit explicitly:

For all the matches which have extra words, they are in correct location

For all the matches which does not have extra words, they are correct detections

And add a test accordingly

And add a test accordingly

where I should add a test and like how I implement for all license_matches

Also i added test for is_extra_words_position_valid see: https://github.com/aboutcode-org/scancode-toolkit/pull/4432/files#diff-d1520ccce311f8f4d4932ba68589bc23b098f8090a307b0b440edb4a846ae21cR1325

alok1304 · 2025-06-24T06:33:56Z

I addet test for 3-seq where there is no detection of copyrights statements , Ref: https://github.com/xyzzy-022/xyzzy/blob/5a16eb998470241b33ad3caa6a4946d0448a16b6/LEGAL.md?plain=1#L97
this file when we scan we got extra-words so I added extra-phrase marker in that corresponding matched rule. Such that we can improve the score.

…_log` Add test for is correct position of `extra-words` according to `extra-phrases` that is present in rules. if we find `extra-words` are in the right place then we set score to `100`. And also show in `detection_log` why we increasing the score to keep track of this. Signed-off-by: Alok Kumar <alokkumarjipura9973@gmail.com>

Add new phrases like `extra_phrase` this is special for extra-words. This phrase is represented in the format [[n]], where n indicates the maximum number of extra-words allowed at that position in the rule. If extra-words appear at the correct position and their count does not exceed the allowed limit `n`, then the score is increased to `100`. Signed-off-by: Alok Kumar <alokkumarjipura9973@gmail.com>