-
Notifications
You must be signed in to change notification settings - Fork 111
An app to detect licenses from the provided input license text #450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 61 commits
0a4915a
d5c4018
4c1483a
9c25575
49fe7a5
cb9de2d
d2bd8d0
a113c14
0fa7fbb
ea6f4bb
947faaf
dd62121
60af913
28d0dcf
e46107a
4468085
8e7c72b
a87ace7
91ca751
cfb7a37
7eedf84
2d4f7f6
bbe58d0
5602a6e
a19a762
b77976d
0333932
b206145
b53e45c
fe66aa2
ce9b294
1293066
dddb4ad
a6b34de
1bdf9d1
ab56010
fda2356
b82d1a2
67462c5
0f675ff
76e8a03
8727cb7
04b8d21
3f7b640
658df1f
8edf207
787133a
4a08ea9
32c3931
4442f7c
840b395
60683d0
68205f8
968c538
3108089
6cd0243
65277f1
bdae386
81d1c6d
bd15385
5c88367
57a1d62
3def711
ebcb6e2
cda3a40
9c404cc
6fab923
4760050
e3281f7
45acd64
b719948
e3c8d79
3bca8e7
a165bd1
73a6571
dd2e8ab
5c30510
a5fa2b4
9787f44
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
# SPDX-License-Identifier: Apache-2.0 | ||
# | ||
# http://nexb.com and https://github.com/nexB/scancode.io | ||
# The ScanCode.io software is licensed under the Apache License version 2.0. | ||
# Data generated with ScanCode.io is provided as-is without warranties. | ||
# ScanCode is a trademark of nexB Inc. | ||
# | ||
# You may not use this software except in compliance with the License. | ||
# You may obtain a copy of the License at: http://apache.org/licenses/LICENSE-2.0 | ||
# Unless required by applicable law or agreed to in writing, software distributed | ||
# under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR | ||
# CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||
# specific language governing permissions and limitations under the License. | ||
# | ||
# Data Generated with ScanCode.io is provided on an "AS IS" BASIS, WITHOUT WARRANTIES | ||
# OR CONDITIONS OF ANY KIND, either express or implied. No content created from | ||
# ScanCode.io should be considered or used as legal advice. Consult an Attorney | ||
# for any legal advice. | ||
# | ||
# ScanCode.io is a free software code scanning tool from nexB Inc. and others. | ||
# Visit https://github.com/nexB/scancode.io for support and download. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
# SPDX-License-Identifier: Apache-2.0 | ||
# | ||
# http://nexb.com and https://github.com/nexB/scancode.io | ||
# The ScanCode.io software is licensed under the Apache License version 2.0. | ||
# Data generated with ScanCode.io is provided as-is without warranties. | ||
# ScanCode is a trademark of nexB Inc. | ||
# | ||
# You may not use this software except in compliance with the License. | ||
# You may obtain a copy of the License at: http://apache.org/licenses/LICENSE-2.0 | ||
# Unless required by applicable law or agreed to in writing, software distributed | ||
# under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR | ||
# CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||
# specific language governing permissions and limitations under the License. | ||
# | ||
# Data Generated with ScanCode.io is provided on an "AS IS" BASIS, WITHOUT WARRANTIES | ||
# OR CONDITIONS OF ANY KIND, either express or implied. No content created from | ||
# ScanCode.io should be considered or used as legal advice. Consult an Attorney | ||
# for any legal advice. | ||
# | ||
# ScanCode.io is a free software code scanning tool from nexB Inc. and others. | ||
# Visit https://github.com/nexB/scancode.io for support and download. | ||
|
||
from django.apps import AppConfig | ||
|
||
|
||
class ScantextConfig(AppConfig): | ||
name = "scantext" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
# SPDX-License-Identifier: Apache-2.0 | ||
# | ||
# http://nexb.com and https://github.com/nexB/scancode.io | ||
# The ScanCode.io software is licensed under the Apache License version 2.0. | ||
# Data generated with ScanCode.io is provided as-is without warranties. | ||
# ScanCode is a trademark of nexB Inc. | ||
# | ||
# You may not use this software except in compliance with the License. | ||
# You may obtain a copy of the License at: http://apache.org/licenses/LICENSE-2.0 | ||
# Unless required by applicable law or agreed to in writing, software distributed | ||
# under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR | ||
# CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||
# specific language governing permissions and limitations under the License. | ||
# | ||
# Data Generated with ScanCode.io is provided on an "AS IS" BASIS, WITHOUT WARRANTIES | ||
# OR CONDITIONS OF ANY KIND, either express or implied. No content created from | ||
# ScanCode.io should be considered or used as legal advice. Consult an Attorney | ||
# for any legal advice. | ||
# | ||
# ScanCode.io is a free software code scanning tool from nexB Inc. and others. | ||
# Visit https://github.com/nexB/scancode.io for support and download. | ||
|
||
from django import forms | ||
|
||
|
||
class LicenseScanForm(forms.Form): | ||
input_text = forms.CharField( | ||
strip=False, | ||
widget=forms.Textarea( | ||
attrs={ | ||
"rows": 15, | ||
"class": "textarea has-fixed-size", | ||
"placeholder": "Paste your license text here.", | ||
} | ||
), | ||
required=False, | ||
) | ||
input_file = forms.FileField( | ||
required=False, | ||
widget=forms.ClearableFileInput( | ||
attrs={"class": "file-input", "multiple": False}, | ||
), | ||
) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,198 @@ | ||
# | ||
# Copyright (c) nexB Inc. and others. All rights reserved. | ||
# ScanCode is a trademark of nexB Inc. | ||
# SPDX-License-Identifier: Apache-2.0 | ||
# See http://www.apache.org/licenses/LICENSE-2.0 for the license text. | ||
# See https://github.com/nexB/scancode-toolkit for support or download. | ||
# See https://aboutcode.org for more information about nexB OSS projects. | ||
# | ||
|
||
import attr | ||
from licensedcode import query | ||
from licensedcode.spans import Span | ||
from licensedcode.stopwords import STOPWORDS | ||
from licensedcode.tokenize import index_tokenizer | ||
from licensedcode.tokenize import matched_query_text_tokenizer | ||
|
||
TRACE = False | ||
TRACE_MATCHED_TEXT = False | ||
TRACE_MATCHED_TEXT_DETAILS = False | ||
|
||
|
||
def logger_debug(*args): | ||
pass | ||
|
||
|
||
if TRACE or TRACE_MATCHED_TEXT or TRACE_MATCHED_TEXT_DETAILS: | ||
|
||
use_print = True | ||
if use_print: | ||
prn = print | ||
else: | ||
import logging | ||
import sys | ||
|
||
logger = logging.getLogger(__name__) | ||
# logging.basicConfig(level=logging.DEBUG, stream=sys.stdout) | ||
logging.basicConfig(stream=sys.stdout) | ||
logger.setLevel(logging.DEBUG) | ||
prn = logger.debug | ||
|
||
def logger_debug(*args): | ||
return prn(" ".join(isinstance(a, str) and a or repr(a) for a in args)) | ||
|
||
def _debug_print_matched_query_text(match, extras=5): | ||
""" | ||
Print a matched query text including `extras` tokens before and after | ||
the match. Used for debugging license matches. | ||
""" | ||
# Create a fake new match with extra tokens before and after | ||
new_match = match.combine(match) | ||
new_qstart = max([0, match.qstart - extras]) | ||
new_qend = min([match.qend + extras, len(match.query.tokens)]) | ||
new_qspan = Span(new_qstart, new_qend) | ||
new_match.qspan = new_qspan | ||
|
||
logger_debug(new_match) | ||
logger_debug(" MATCHED QUERY TEXT with extras") | ||
qt = new_match.matched_text(whole_lines=False) | ||
logger_debug(qt) | ||
|
||
|
||
@attr.s(slots=True) | ||
class Token: | ||
""" | ||
Used to represent a token in collected query-side matched texts and SPDX | ||
identifiers. | ||
|
||
``matches`` is a lits of LicenseMatch to accomodate for overlapping matches. | ||
For example, say we have these two matched text portions: | ||
QueryText: this is licensed under GPL or MIT | ||
Match1: this is licensed under GPL | ||
Match2: licensed under GPL or MIT | ||
|
||
Each Token would be to assigned one or more LicenseMatch: | ||
this: Match1 : yellow | ||
is: Match1 : yellow | ||
licensed: Match1, Match2 : orange (mixing yellow and pink colors) | ||
under: Match1, Match2 : orange (mixing yellow and pink colors) | ||
GPL: Match1, Match2 : orange (mixing yellow and pink colors) | ||
or: Match2 : pink | ||
MIT: Match2 : pink | ||
""" | ||
|
||
# original text value for this token. | ||
value = attr.ib() | ||
|
||
# line number, one-based | ||
line_num = attr.ib() | ||
|
||
# absolute position for known tokens, zero-based. -1 for unknown tokens | ||
pos = attr.ib(default=-1) | ||
|
||
# True if text/alpha False if this is punctuation or spaces | ||
is_text = attr.ib(default=False) | ||
|
||
# True if part of a match | ||
is_matched = attr.ib(default=False) | ||
|
||
# True if this is a known token | ||
is_known = attr.ib(default=False) | ||
|
||
# List of LicenseMatch ids that match this token | ||
match_ids = attr.ib(attr.Factory(list)) | ||
|
||
|
||
def tokenize_matched_text( | ||
location, | ||
query_string, | ||
dictionary, | ||
start_line=1, | ||
trace=TRACE_MATCHED_TEXT_DETAILS, | ||
): | ||
""" | ||
Yield Token objects with pos and line number collected from the file at | ||
`location` or the `query_string` string. `dictionary` is the index mapping | ||
of tokens to token ids. | ||
""" | ||
pos = 0 | ||
qls = query.query_lines( | ||
location=location, | ||
query_string=query_string, | ||
strip=False, | ||
start_line=start_line, | ||
) | ||
for line_num, line in qls: | ||
if trace: | ||
logger_debug( | ||
" tokenize_matched_text:", "line_num:", line_num, "line:", line | ||
) | ||
|
||
for is_text, token_str in matched_query_text_tokenizer(line): | ||
if trace: | ||
logger_debug(" is_text:", is_text, "token_str:", repr(token_str)) | ||
|
||
# Determine if a token is is_known in the license index or not. This | ||
# is essential as we need to realign the query-time tokenization | ||
# with the full text to report proper matches. | ||
if is_text and token_str and token_str.strip(): | ||
|
||
# we retokenize using the query tokenizer: | ||
# 1. to lookup for is_known tokens in the index dictionary | ||
|
||
# 2. to ensure the number of tokens is the same in both | ||
# tokenizers (though, of course, the case will differ as the | ||
# regular query tokenizer ignores case and punctuations). | ||
qtokenized = list(index_tokenizer(token_str)) | ||
if not qtokenized: | ||
|
||
yield Token( | ||
value=token_str, | ||
line_num=line_num, | ||
is_text=is_text, | ||
is_known=False, | ||
pos=-1, | ||
) | ||
|
||
elif len(qtokenized) == 1: | ||
is_known = qtokenized[0] in dictionary | ||
if is_known: | ||
p = pos | ||
pos += 1 | ||
else: | ||
p = -1 | ||
|
||
yield Token( | ||
value=token_str, | ||
line_num=line_num, | ||
is_text=is_text, | ||
is_known=is_known, | ||
pos=p, | ||
) | ||
else: | ||
# we have two or more tokens from the original query mapped | ||
# to a single matched text tokenizer token. | ||
for qtoken in qtokenized: | ||
is_known = qtoken in dictionary | ||
if is_known: | ||
p = pos | ||
pos += 1 | ||
else: | ||
p = -1 | ||
|
||
yield Token( | ||
value=qtoken, | ||
line_num=line_num, | ||
is_text=is_text, | ||
is_known=is_known, | ||
pos=p, | ||
) | ||
else: | ||
|
||
yield Token( | ||
value=token_str, | ||
line_num=line_num, | ||
is_text=False, | ||
is_known=False, | ||
pos=-1, | ||
) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,66 @@ | ||
<div class="modal license-details-modal"> | ||
<div class="modal-background"></div> | ||
<div class="modal-card" style="margin-top: 10vh"> | ||
<header class="modal-card-head"> | ||
<p class="modal-card-title">{{ license.license_expression }}</p> | ||
<button class="delete license-details-close-modal" aria-label="close"></button> | ||
</header> | ||
<section class="modal-card-body is-4by4"> | ||
<table class="table is-striped is-hoverable is-fullwidth is-size-6"> | ||
<tbody> | ||
<tr> | ||
<td><strong>Score</strong></td> | ||
<td>{{ license.score }}</td> | ||
</tr> | ||
<tr> | ||
<td><strong>Matched Line(s)</strong></td> | ||
<td>{% if license.start_line == license.end_line %} {{ license.start_line }} {% else %} {{ license.start_line }} - {{ license.end_line }} {% endif %}</td> | ||
</tr> | ||
<tr> | ||
<td><strong>Rule Identifier</strong></td> | ||
<td> | ||
{% if license.rule_text_url %} | ||
<a href="{{ license.rule_text_url }}" target="_blank">{{ license.rule_identifier }}</a> | ||
{% else %} | ||
{{ license.rule_identifier }} | ||
{% endif %} | ||
</td> | ||
</tr> | ||
<tr> | ||
<td><strong>Matcher</strong></td> | ||
<td>{{ license.matcher }}</td> | ||
</tr> | ||
<tr> | ||
<td><strong>Match Coverage</strong></td> | ||
<td>{{ license.match_coverage }}</td> | ||
</tr> | ||
<tr> | ||
<td><strong>Matched Length</strong></td> | ||
<td>{{ license.matched_length }}</td> | ||
</tr> | ||
<tr> | ||
<td><strong>Key(s)</strong></td> | ||
<td> | ||
{% for key in license.licenses %} | ||
<a href="{{ key.reference_url }}" target="_blank"><span class="mr-2">{{ key.key }}</span></a> | ||
{% endfor %} | ||
</td> | ||
</tr> | ||
<tr> | ||
<td><strong>Rule Relevance</strong></td> | ||
<td>{{ license.rule_relevance }}</td> | ||
</tr> | ||
<tr> | ||
<td><strong>Rule Length</strong></td> | ||
<td>{{ license.rule_length }}</td> | ||
</tr> | ||
</tbody> | ||
</table> | ||
</section> | ||
<footer class="modal-card-foot"> | ||
<button class="button is-outlined has-text-weight-semibold"> | ||
{% include 'scantext/includes/license_report.html' with license=license %} | ||
</button> | ||
</footer> | ||
</div> | ||
</div> |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
<a class="has-text-danger" href="https://github.com/nexB/scancode.io/issues/new?labels=bug&title=License+detection+error+as+`{{ license.license_expression|pprint }}` | ||
&body=Detection+level+details%0A```python%0A{%0A%20%20%20%20score+:+{{ license.score }}+%0A%20%20%20%20start_line+:+{{ license.start_line }}+%0A%20%20%20%20end_line+:+{{ license.end_line }}+%0A%20%20%20%20matched_length+:+{{ license.matched_length }}+%0A%20%20%20%20match_coverage+:+{{ license.match_coverage }}+%0A%20%20%20%20rule_identifier+:+{{ license.rule_identifier }}%0A}%0A```+%0A%0AMatched+Text%0A```%0A{{ license.matched_text }}%0A```+%0A%0AInput+Text%0A```%0A{{ license.matched_text }}%0A```" target="_blank">Report on Github</a> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
@tdruez here it is. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is not a "design discussion", this is code... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 😕 Discussed only in the meet. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Reporting license details should be done at There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The project idea says |
Uh oh!
There was an error while loading. Please reload this page.