Skip to content

Commit a410f6e

Browse files
authored
Merge pull request #2961 from nexB/add-license-detection
Combine license matches in new LicenseDetection Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
2 parents 9f91bf5 + 8f07fdf commit a410f6e

File tree

2,169 files changed

+514647
-166723
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

2,169 files changed

+514647
-166723
lines changed

CHANGELOG.rst

Lines changed: 121 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,36 @@
11
Changelog
22
=========
33

4+
v33.0.0 (next next, roadmap)
45

6+
----------------------------
57

6-
v32.0.0 (next next, roadmap)
7-
----------------------------------
8-
9-
Package detection:
10-
~~~~~~~~~~~~~~~~~~
118

129
- We now support new package manifest formats:
1310

1411
- OpenWRT packages.
1512
- Yocto/BitBake .bb recipes.
1613

14+
15+
v32.0.0 (next, roadmap)
16+
-----------------------
17+
18+
Important API changes:
19+
~~~~~~~~~~~~~~~~~~~~~~
20+
21+
This is a major release with major API and output format changes and signicant
22+
feature updates.
23+
24+
In particular changed to the output format for the licenses and packages, and
25+
we changed some of the command line options.
26+
27+
The output format version is now 3.0.0
28+
29+
30+
31+
Package detection:
32+
~~~~~~~~~~~~~~~~~~
33+
1734
- Update ``GemfileLockParser`` to track the gem which the Gemfile.lock is for,
1835
which we assign to the new ``GemfileLockParser.primary_gem`` field. Update
1936
``GemfileLockHandler.parse()`` to handle the case where there is a primary gem
@@ -39,48 +56,6 @@ Package detection:
3956

4057
https://github.com/nexB/scancode-toolkit/issues/3081
4158

42-
License detection:
43-
~~~~~~~~~~~~~~~~~~~
44-
45-
- There is a major update to license detection where we now combine one or
46-
matches in a larger license detecion. This remove a larger number of false
47-
positive or ambiguous license detections.
48-
49-
- The data structure of the JSON output has changed for licenses. We now
50-
return match details once for each matched license expression rather than
51-
once for each license in a matched expression. There is a new top-level
52-
"license_references" attribute that contains the data details for each
53-
detected license only once. This data can contain the reference license text
54-
as an option.
55-
56-
- There is a new "scancode-reindex-licenses" command that replace the
57-
"scancode --reindex-licenses" command line option which has been
58-
removed. This new command supports simpler reindexing using custom
59-
license texts and license rules contributed by plugins or stored in an
60-
additional directory. The "--reindex-licenses-for-all-languages" CLI option
61-
is also moved to the "scancode-reindex-licenses" command as an option
62-
"--all-languages".
63-
64-
- We can now detect licenses using custom license texts and license rules.
65-
These can be provided as a one off in a directory or packaged as a plugin
66-
for consistent reuse and deployment. There is an option "--additional-directory"
67-
with the "scancode-reindex-licenses" command and also a new "--only-builtin"
68-
option to only use the builtin licenses to build the cache.
69-
70-
- Scancode LICENSE and RULE files now also contain their data as YAML frontmatter,
71-
which previously used to be in their respective YAML files. This reduces number of
72-
files in those directories, 'rules' and 'licenses' to half. Git line history is
73-
preserved for the files.
74-
75-
- A new command line option "--get-license-data" is added to dump license data in
76-
JSON, YAML and HTML formats, and also generates a local index and a static website
77-
to view the data. This will essentially be an API/way to get scancode license data
78-
as opposed to just reading the files.
79-
80-
81-
Package detection:
82-
~~~~~~~~~~~~~~~~~~~~~
83-
8459
- Code for parsing a Maven POM, npm package.json, freebsd manifest and haxelib
8560
JSON have been separated into two functions: one that creates a PackageData
8661
object from the parsed Resource, and another that calls the previous function
@@ -89,6 +64,105 @@ Package detection:
8964
libraries.
9065

9166

67+
License detection:
68+
~~~~~~~~~~~~~~~~~~~
69+
70+
- This is a major update to license detection where we now combine one or more
71+
license matches in a larger license detection. This approach improves the
72+
accuracy of license detection and removes a larger number of false positive
73+
or ambiguous license detections. See for details
74+
https://github.com/nexB/scancode-toolkit/issues/2878
75+
76+
- The data structure of the JSON output has changed for licenses at file level:
77+
78+
- The``licenses`` attribute is deleted.
79+
80+
- A new ``license_detections`` attribute contains license detections in that file.
81+
This object has three attributes: ``license_expression``, ``detection_log``
82+
and ``matches``. ``matches`` is a list of license matches and is roughly
83+
the same as ``licenses`` in the previous version with additional structure
84+
changes detailed below.
85+
86+
- A new attribute ``license_clues`` contains license matches with the
87+
same data structure as the ``matches`` attribute in ``license_detections``.
88+
This contains license matches that are mere clues and where not considered
89+
to be a proper conclusive license detection.
90+
91+
- The ``license_expressions`` list of license expressions is deleted and
92+
replaced by a ``detected_license_expression`` single expression.
93+
Similarly ``spdx_license_expressions`` was removed and replaced by
94+
``detected_license_expression_spdx``.
95+
96+
- See `license updates documentation <https://scancode-toolkit.readthedocs.io/en/latest/explanations/license-detection-reference.html#change-in-license-data-format-resource>`_
97+
for examples and details.
98+
99+
- The data structure of license attributes in ``package_data`` and the codebase
100+
level ``packages`` has been updated accordingly:
101+
102+
- There is a new ``license_detections`` attribute for the primary, top-level
103+
declared licenses of a package and an ``other_license_detections`` attribute
104+
for the other secondary detections.
105+
106+
- The ``license_expression`` is replaced by the ``declared_license_expression``
107+
and ``other_license_expression`` attributes with their SPDX counterparts
108+
``declared_license_expression_spdx`` and ``other_license_expression_spdx``.
109+
These expressions are parallel to detections.
110+
111+
- The ``declared_license`` attribute is renamed ``extracted_license_statement``
112+
and is now a YAML-encoded string.
113+
114+
See `license updates documentation <https://scancode-toolkit.readthedocs.io/en/latest/explanations/license-detection-reference.html#change-in-license-data-format-package>`_
115+
for examples and details.
116+
117+
- The license matches structure has changed: we used to report one match for each
118+
license ``key`` of a matched license expression. We now report instead one
119+
single match for each matched license expression, and list the license keys
120+
as a ``licenses`` attribute. This avoids data duplication.
121+
Inside each match, we list each match and matched rule attributred directly
122+
avoiding nesting. See `license updates doc <https://scancode-toolkit.readthedocs.io/en/latest/explanations/license-detection-reference.html#licensematch-result-data>`_
123+
for examples and details.
124+
125+
- There is a new ``--licenses-reference`` command line option to report
126+
reference license metadata and texts once for each license matched across the
127+
scan; we now have two codebase level attributes: ``license_references`` and
128+
``rule_references`` that list unique detected license and license rules.
129+
See `license updates documentation <https://scancode-toolkit.readthedocs.io/en/latest/explanations/license-detection-reference.html#comparision-before-after-license-references>`_
130+
for examples and details.
131+
132+
- We replaced the ``scancode --reindex-licenses`` command line option with a
133+
new separate command named ``scancode-reindex-licenses``.
134+
135+
- The ``--reindex-licenses-for-all-languages`` CLI option is also moved to
136+
the ``scancode-reindex-licenses`` command as an option ``--all-languages``.
137+
138+
- We can now detect licenses using custom license texts and license rules
139+
stored in a directory or packaged as a plugin for consistent reuse and deployment.
140+
141+
- There is an ``--additional-directory`` option with the ``scancode-reindex-licenses``
142+
command to add the licenses from a directory.
143+
144+
- There is also a ``--only-builtin`` option to use ony builtin licenses
145+
ignoring any additional license plugins.
146+
147+
- See https://github.com/nexB/scancode-toolkit/issues/480 for more details.
148+
149+
- We combined the licensedata file and text file of each license in a single
150+
file with a .LICENSE extension. The .yml data file is now included at the
151+
top of each .LICENSE file as "YAML frontmatter". The same applies to license
152+
rules and their .RULE and .yml files. This halves the number of data files
153+
from about 60,000 to 30,000. Git line history is preserved for the combined
154+
text + yml files.
155+
156+
- See https://github.com/nexB/scancode-toolkit/issues/3049
157+
158+
- Theer is a new ``--get-license-data`` scancode command line option to export
159+
license data in JSON, YAML and HTML, with indexes and a static website for use
160+
in the licensedb web site. This becomes the API way to getr scancode license
161+
data.
162+
163+
See https://github.com/nexB/scancode-toolkit/issues/2738
164+
165+
92166
v31.2.1 - 2022-10-05
93167
----------------------------------
94168

docs/source/explanations/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
:maxdepth: 2
88

99
overview
10+
license-detection-reference
1011

1112
..
1213
[ToAdd]

0 commit comments

Comments
 (0)