Skip to content

Commit 4939a9b

Browse files
committed
Improve copyright detection more
Empty lines can stop a notice continuity. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
1 parent b56b961 commit 4939a9b

File tree

9 files changed

+34
-28
lines changed

9 files changed

+34
-28
lines changed

src/cluecode/copyrights.py

Lines changed: 19 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -381,19 +381,30 @@ def get_tokens(numbered_lines, splitter=re.compile(r'[\t =;]+').split):
381381
382382
We perform a simple tokenization on spaces, tabs and some punctuation: =;
383383
"""
384+
last_line = ""
384385
for start_line, line in numbered_lines:
385386
pos = 0
386387

387388
if TRACE_TOK:
388389
logger_debug(' get_tokens: bare line: ' + repr(line))
389390

390-
# if not line.strip():
391-
# yield Token(value="\n", label="EMPTY_LINE", start_line=start_line, pos=pos)
392-
# pos += 1
393-
# continue
391+
if not line.strip():
392+
stripped = last_line.lower().strip(string.punctuation)
393+
if (
394+
stripped.startswith("copyright")
395+
or stripped.endswith(("by", "copyright","0", "1", "2", "3", "4", "5", "6", "7", "8", "9"))
396+
):
397+
continue
398+
else:
399+
yield Token(value="\n", label="EMPTY_LINE", start_line=start_line, pos=pos)
400+
pos += 1
401+
last_line = ""
402+
continue
394403

395404
line = prepare_text_line(line)
396405

406+
last_line = line
407+
397408
if TRACE_TOK:
398409
logger_debug(' get_tokens: preped line: ' + repr(line))
399410

@@ -922,6 +933,7 @@ def build_detection_from_node(
922933
(r'DeclareUnicodeCharacter$', 'JUNK'),
923934
(r'^Language-Team$', 'JUNK'),
924935
(r'^Last-Translator$', 'JUNK'),
936+
(r'^Translated$', 'JUNK'),
925937
(r'^OMAP730$', 'JUNK'),
926938
(r'^Law\.$', 'JUNK'),
927939
(r'^dylid$', 'JUNK'),
@@ -1601,7 +1613,7 @@ def build_detection_from_node(
16011613
(r'^Branched$', 'NN'),
16021614

16031615
(r'^Improved$', 'NN'),
1604-
(r'^Designe[dr]$', 'NN'),
1616+
(r'^Designed$', 'NN'),
16051617
(r'^Organised$', 'NN'),
16061618
(r'^Re-organised$', 'NN'),
16071619
(r'^Swap$', 'NN'),
@@ -1900,6 +1912,8 @@ def build_detection_from_node(
19001912

19011913
# Various rare company names/suffix
19021914
(r'^FedICT$', 'COMPANY'),
1915+
(r'^10gen$', 'COMPANY'),
1916+
19031917

19041918
# Division, District
19051919
(r'^(District|Division)\)?[,\.]?$', 'COMP'),

src/licensedcode/data/licenses/ko-man-page.LICENSE

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,6 @@ category: Permissive
66
owner: Korean Manpage Project
77
homepage_url: https://git.centos.org/rpms/man-pages-ko/blob/c7/f/SOURCES/Man_Page_Copyright
88
spdx_license_key: LicenseRef-scancode-ko-man-page
9-
ignorable_copyrights:
10-
- Copyright Man Page
11-
ignorable_holders:
12-
- Man Page
139
---
1410

1511
Translated Copyright Man Page
@@ -26,4 +22,4 @@ The copyrights of all translated manpages in the Korean Manpage Project are inhe
2622

2723
Exception
2824

29-
It is possible the documents on this site may contain false information due to a technical error or mistranslation. However, Korean Manpage Project does not guarantee anything even in this case. If there is false information, please let administrator know or report the error to the appropriate place on the homepage. The documents of this site are subject to change, delete, or move without notice due to error correction of the documents.
25+
It is possible the documents on this site may contain false information due to a technical error or mistranslation. However, Korean Manpage Project does not guarantee anything even in this case. If there is false information, please let administrator know or report the error to the appropriate place on the homepage. The documents of this site are subject to change, delete, or move without notice due to error correction of the documents.

src/licensedcode/data/licenses/mediatek-proprietary-2008.LICENSE

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,9 +7,9 @@ owner: MediaTek
77
homepage_url: https://github.com/LIFECorp/mediatek/blob/master/external/mp3dec/mp3dec_exp.h
88
spdx_license_key: LicenseRef-scancode-mediatek-proprietary-2008
99
ignorable_copyrights:
10-
- MediaTek Inc. (c) 2008 BY OPENING
10+
- MediaTek Inc. (c) 2008
1111
ignorable_holders:
12-
- MediaTek Inc. BY OPENING
12+
- MediaTek Inc.
1313
---
1414

1515
This software is protected by Copyright and the information contained
@@ -40,4 +40,4 @@ THE TRANSACTION CONTEMPLATED HEREUNDER SHALL BE CONSTRUED IN ACCORDANCE
4040
WITH THE LAWS OF THE STATE OF CALIFORNIA, USA, EXCLUDING ITS CONFLICT OF
4141
LAWS PRINCIPLES. ANY DISPUTES, CONTROVERSIES OR CLAIMS ARISING THEREOF AND
4242
RELATED THERETO SHALL BE SETTLED BY ARBITRATION IN SAN FRANCISCO, CA, UNDER
43-
THE RULES OF THE INTERNATIONAL CHAMBER OF COMMERCE (ICC).
43+
THE RULES OF THE INTERNATIONAL CHAMBER OF COMMERCE (ICC).

src/licensedcode/data/licenses/qhull.LICENSE

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,4 +38,4 @@ This software includes Qhull from The Geometry Center. Qhull is copyrighted as n
3838

3939
4. When distributing modified versions of Qhull, or other software products that include Qhull, you must provide notice that the original source code may be obtained as noted above.
4040

41-
5. There is no warranty or other guarantee of fitness for Qhull, it is provided solely "as is". Bug reports or fixes may be sent to qhull_bug@qhull.org; the authors may or may not act on them as they desire.
41+
5. There is no warranty or other guarantee of fitness for Qhull, it is provided solely "as is". Bug reports or fixes may be sent to qhull_bug@qhull.org; the authors may or may not act on them as they desire.

src/licensedcode/data/licenses/vhfpl-1.1.LICENSE

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,10 +10,10 @@ notes: The original name of Cenon was vhf interservice GmbH at http://www.vhf-in
1010
spdx_license_key: LicenseRef-scancode-vhfpl-1.1
1111
ignorable_copyrights:
1212
- Copyright (c) 2003 vhf interservice GmbH service@vhf.de
13-
- Copyright (c) 2003/2004 vhf interservice GmbH, Im Marxle 3
13+
- Copyright (c) 2003/2004 vhf interservice GmbH, Im Marxle
1414
ignorable_holders:
1515
- vhf interservice GmbH
16-
- vhf interservice GmbH, Im Marxle 3
16+
- vhf interservice GmbH, Im Marxle
1717
ignorable_emails:
1818
- service@vhf.de
1919
---
@@ -120,4 +120,4 @@ INCLUDING THE WARRANTY OF DESIGN, MERCHANTABILITY AND FITNESS FOR A
120120
PARTICULAR PURPOSE.
121121

122122
----------------------------------------------------------------------------
123-
Copyright (C) 2003 vhf interservice GmbH service@vhf.de
123+
Copyright (C) 2003 vhf interservice GmbH service@vhf.de

src/licensedcode/data/rules/sgi-freeb-2.0_3.RULE

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,12 +4,12 @@ is_license_reference: yes
44
ignorable_copyrights:
55
- Copyright (c) September 18, 2008 Silicon Graphics, Inc.
66
ignorable_holders:
7-
- September 18, Silicon Graphics, Inc.
7+
- September Silicon Graphics, Inc.
88
ignorable_urls:
99
- http://oss.sgi.com/projects/FreeB/
1010
---
1111

1212
SGI FREE SOFTWARE LICENSE B (Version 2.0, Sept. 18, 2008)
1313
* Copyright (C) September 18, 2008 Silicon Graphics, Inc. All Rights Reserved.
1414
* This document is licensed under the SGI Free Software B License Version
15-
* 2.0. For details, see http://oss.sgi.com/projects/FreeB/ .
15+
* 2.0. For details, see http://oss.sgi.com/projects/FreeB/ .

src/licensedcode/data/rules/uoi-ncsa_and_other.RULE

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,6 @@
22
license_expression: uoi-ncsa AND other-permissive AND other-copyleft
33
is_license_notice: yes
44
minimum_coverage: 95
5-
ignorable_copyrights:
6-
- Copyright ----- The LLVM project
7-
ignorable_holders:
8-
- The LLVM project
95
ignorable_urls:
106
- http://www.opensource.org/licenses/UoI-NCSA.php
117
- http://www.opensource.org/licenses/mit-license.php
@@ -121,4 +117,4 @@ you or your employer own the rights to a patent and would like to contribute
121117
code to LLVM that relies on it, we require that the copyright owner sign an
122118
agreement that allows any other user of LLVM to freely use your patent. Please
123119
contact the `oversight group <mailto:llvm-oversight@cs.uiuc.edu>`_ for more
124-
details.
120+
details.

tests/packagedcode/data/debian/copyright/debian-slim-2021-04-07/usr/share/doc/libcrypt1/copyright-detailed.expected.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -498,10 +498,10 @@ copyright: |
498498
Copyright Maarten Bosmans FSF
499499
Copyright Guido U. Draheim, Maarten Bosmans FSF
500500
Copyright Mike Frysinger FSF
501-
Copyright Scott James Remnant, Dan Nicholson GPL
501+
Copyright Scott James Remnant, Dan Nicholson
502502
Copyright Tim Toolan FSF
503503
Copyright Philip Withnall FSF
504-
Copyright Steven G. Johnson, Daniel Richard G. GPL
504+
Copyright Steven G. Johnson, Daniel Richard G.
505505
Copyright Francesco Salvestrini FSF
506506
Copyright Andrew Collier FSF
507507
Copyright 2002, 2003, 2004 SuSE Linux AG, Germany

tests/summarycode/data/tallies/end-2-end/bug-1141.expected.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -68,11 +68,11 @@
6868
"count": 2
6969
},
7070
{
71-
"value": "Copyright (c) - Members of the Gmerlin project",
71+
"value": "Copyright (c) Members of the Gmerlin project",
7272
"count": 1
7373
},
7474
{
75-
"value": "Copyright (c) - Members of the Gmerlin project gmerlin-general@lists.sourceforge.net http://gmerlin.sourceforge.net",
75+
"value": "Copyright (c) Members of the Gmerlin project gmerlin-general@lists.sourceforge.net http://gmerlin.sourceforge.net",
7676
"count": 1
7777
}
7878
],

0 commit comments

Comments
 (0)