Skip to content

Commit 1275d07

Browse files
Ignore large data files and bump scancode-toolkit (#1508)
* Ignore scanning large data files Ignore scanning large data files which are larger than 1 MB to avoid crashing scans on memory spikes. Also rollback #1504 Reference: aboutcode-org/scancode-toolkit#3711 Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com> * Bump scancode-toolkit to version v32.3.1 Also remove platform constraints from rust-inspector and go-inspector. Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com> * Increase size limit to skip scanning data file Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com> * Add a scancodeio setting SCANCODEIO_SCAN_MAX_FILE_SIZE Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com> * Use scancode with conda bugfix Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com> * Address feedback Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com> * Bump scancode-toolkit to v32.3.2 Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com> * Add scan_max_file_size to project settings Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com> * Update CHANGELOG and docs on project settings Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com> * Add scan_max_file_size to project settings UI Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com> --------- Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com> Co-authored-by: tdruez <tdruez@nexb.com>
1 parent 70deb20 commit 1275d07

15 files changed

+141
-16
lines changed

CHANGELOG.rst

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,16 @@ v34.9.4 (unreleased)
5858
checkbox on paginated list.
5959
https://github.com/aboutcode-org/scancode.io/issues/1524
6060

61+
- Update scancode-toolkit to v32.3.2. See CHANGELOG for updates:
62+
https://github.com/aboutcode-org/scancode-toolkit/releases/tag/v32.3.2
63+
https://github.com/aboutcode-org/scancode-toolkit/releases/tag/v32.3.1
64+
65+
- Adds a project settings ``scan_max_file_size`` and a scancode.io settings field
66+
``SCANCODEIO_SCAN_MAX_FILE_SIZE`` to skip scanning files above a certain
67+
file size (in bytes) as a temporary fix for large memory spikes while
68+
scanning for licenses in certain large files.
69+
https://github.com/aboutcode-org/scancode-toolkit/issues/3711
70+
6171
v34.9.3 (2024-12-31)
6272
--------------------
6373

docs/application-settings.rst

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -165,6 +165,18 @@ The value unit is second and is defined as an integer::
165165

166166
Default: ``120`` (2 minutes)
167167

168+
SCANCODEIO_SCAN_MAX_FILE_SIZE
169+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
170+
171+
Maximum file size allowed for a file to be scanned when scanning a codebase.
172+
173+
The value unit is bytes and is defined as an integer, see the following
174+
example of setting this at 5 MB::
175+
176+
SCANCODEIO_SCAN_MAX_FILE_SIZE=5242880
177+
178+
Default: ``None`` (all files will be scanned)
179+
168180
.. _scancodeio_settings_pipelines_dirs:
169181

170182
SCANCODEIO_PIPELINES_DIRS

docs/project-configuration.rst

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,7 @@ Content of a ``scancode-config.yml`` file:
5454
ignored_patterns:
5555
- '*.tmp'
5656
- 'tests/*'
57+
scan_max_file_size: 5242880
5758
ignored_dependency_scopes:
5859
- package_type: npm
5960
scope: devDependencies
@@ -86,6 +87,24 @@ product_version
8687

8788
The product version of this project, as specified within the DejaCode application.
8889

90+
scan_max_file_size
91+
^^^^^^^^^^^^^^^^^^
92+
93+
Maximum file size allowed for a file to be scanned when scanning a codebase.
94+
95+
The value unit is bytes and is defined as an integer, see the following
96+
example of setting this at 5 MB::
97+
98+
scan_max_file_size=5242880
99+
100+
Default is ``None``, in which case all files will be scanned.
101+
102+
.. note::
103+
This is the same as the scancodeio setting ``SCANCODEIO_SCAN_MAX_FILE_SIZE``
104+
set using the .env file, and the project setting ``scan_max_file_size`` takes
105+
precedence over the scancodeio setting ``SCANCODEIO_SCAN_MAX_FILE_SIZE``.
106+
107+
89108
ignored_patterns
90109
^^^^^^^^^^^^^^^^
91110

scancodeio/settings.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -100,6 +100,9 @@
100100
# Default to 2 minutes.
101101
SCANCODEIO_SCAN_FILE_TIMEOUT = env.int("SCANCODEIO_SCAN_FILE_TIMEOUT", default=120)
102102

103+
# Default to None which scans all files
104+
SCANCODEIO_SCAN_MAX_FILE_SIZE = env.int("SCANCODEIO_SCAN_MAX_FILE_SIZE", default=None)
105+
103106
# List views pagination, controls the number of items displayed per page.
104107
# Syntax in .env: SCANCODEIO_PAGINATE_BY=project=10,project_error=10
105108
SCANCODEIO_PAGINATE_BY = env.dict(

scanpipe/forms.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -437,6 +437,7 @@ class ProjectSettingsForm(forms.ModelForm):
437437
"ignored_vulnerabilities",
438438
"policies",
439439
"attribution_template",
440+
"scan_max_file_size",
440441
"product_name",
441442
"product_version",
442443
]
@@ -511,6 +512,15 @@ class ProjectSettingsForm(forms.ModelForm):
511512
),
512513
widget=forms.Textarea(attrs={"class": "textarea is-dynamic", "rows": 3}),
513514
)
515+
scan_max_file_size = forms.IntegerField(
516+
label="Max file size to scan",
517+
required=False,
518+
help_text=(
519+
"Maximum file size in bytes which should be skipped from scanning."
520+
"File size is in bytes. Example: 5 MB is 5242880 bytes."
521+
),
522+
widget=forms.NumberInput(attrs={"class": "input"}),
523+
)
514524
product_name = forms.CharField(
515525
label="Product name",
516526
required=False,

scanpipe/models.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -920,6 +920,16 @@ def get_ignored_dependency_scopes_index(self):
920920

921921
return dict(ignored_scope_index)
922922

923+
@cached_property
924+
def get_scan_max_file_size(self):
925+
"""
926+
Return a the ``scan_max_file_size`` settings value defined in this
927+
Project env.
928+
"""
929+
scan_max_file_size = self.get_env(field_name="scan_max_file_size")
930+
if scan_max_file_size:
931+
return scan_max_file_size
932+
923933
@cached_property
924934
def ignored_dependency_scopes_index(self):
925935
"""

scanpipe/pipes/flag.py

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@
2020
# ScanCode.io is a free software code scanning tool from nexB Inc. and others.
2121
# Visit https://github.com/aboutcode-org/scancode.io for support and download.
2222

23+
2324
NO_STATUS = ""
2425

2526
SCANNED = "scanned"
@@ -43,6 +44,7 @@
4344
IGNORED_DEFAULT_IGNORES = "ignored-default-ignores"
4445
IGNORED_DATA_FILE_NO_CLUES = "ignored-data-file-no-clues"
4546
IGNORED_DOC_FILE = "ignored-doc-file"
47+
IGNORED_BY_MAX_FILE_SIZE = "ignored-by-max-file-size"
4648

4749
COMPLIANCE_LICENSES = "compliance-licenses"
4850
COMPLIANCE_SOURCEMIRROR = "compliance-sourcemirror"
@@ -102,6 +104,19 @@ def flag_ignored_patterns(project, patterns):
102104
return update_count
103105

104106

107+
def flag_and_ignore_files_over_max_size(resource_qs, file_size_limit):
108+
"""
109+
Flag codebase resources which are over the max file size for scanning
110+
and return all other files within the file size limit.
111+
"""
112+
if not file_size_limit:
113+
return resource_qs
114+
115+
return resource_qs.filter(size__gte=file_size_limit).update(
116+
status=IGNORED_BY_MAX_FILE_SIZE
117+
)
118+
119+
105120
def analyze_scanned_files(project):
106121
"""Set the status for CodebaseResource to unknown or no license."""
107122
scanned_files = project.codebaseresources.files().status(SCANNED)

scanpipe/pipes/scancode.py

Lines changed: 25 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@
3434
from django.apps import apps
3535
from django.conf import settings
3636
from django.db.models import ObjectDoesNotExist
37+
from django.db.models import Q
3738

3839
from commoncode import fileutils
3940
from commoncode.resource import VirtualCodebase
@@ -58,6 +59,7 @@
5859
Utilities to deal with ScanCode toolkit features and objects.
5960
"""
6061

62+
6163
scanpipe_app = apps.get_app_config("scanpipe")
6264

6365

@@ -291,6 +293,7 @@ def scan_resources(
291293
save_func,
292294
scan_func_kwargs=None,
293295
progress_logger=None,
296+
file_size_limit=None,
294297
):
295298
"""
296299
Run the `scan_func` on the codebase resources of the provided `resource_qs`.
@@ -310,9 +313,21 @@ def scan_resources(
310313
if not scan_func_kwargs:
311314
scan_func_kwargs = {}
312315

313-
resource_count = resource_qs.count()
316+
# Skip scannning files larger than the specified max size
317+
skipped_files_max_size = flag.flag_and_ignore_files_over_max_size(
318+
resource_qs=resource_qs,
319+
file_size_limit=file_size_limit,
320+
)
321+
if file_size_limit and skipped_files_max_size:
322+
logger.info(
323+
f"Skipped {skipped_files_max_size} files over the size of {file_size_limit}"
324+
)
325+
326+
scan_resource_qs = resource_qs.filter(~Q(status=flag.IGNORED_BY_MAX_FILE_SIZE))
327+
328+
resource_count = scan_resource_qs.count()
314329
logger.info(f"Scan {resource_count} codebase resources with {scan_func.__name__}")
315-
resource_iterator = resource_qs.iterator(chunk_size=2000)
330+
resource_iterator = scan_resource_qs.iterator(chunk_size=2000)
316331
progress = LoopProgress(resource_count, logger=progress_logger)
317332
max_workers = get_max_workers(keep_available=1)
318333

@@ -350,14 +365,7 @@ def scan_resources(
350365
"Please ensure that there is at least 2 GB of available memory per "
351366
"CPU core for successful execution."
352367
)
353-
354-
resource.project.add_error(
355-
exception=broken_pool_error,
356-
model="scan_resources",
357-
description=message,
358-
object_instance=resource,
359-
)
360-
continue
368+
raise broken_pool_error from InsufficientResourcesError(message)
361369

362370
save_func(resource, scan_results, scan_errors)
363371

@@ -374,11 +382,18 @@ def scan_for_files(project, resource_qs=None, progress_logger=None):
374382
if resource_qs is None:
375383
resource_qs = project.codebaseresources.no_status()
376384

385+
# Get max file size limit set in project settings, or alternatively
386+
# get it from scancodeio settings
387+
file_size_limit = project.get_scan_max_file_size
388+
if not file_size_limit:
389+
file_size_limit = settings.SCANCODEIO_SCAN_MAX_FILE_SIZE
390+
377391
scan_resources(
378392
resource_qs=resource_qs,
379393
scan_func=scan_file,
380394
save_func=save_scan_file_results,
381395
progress_logger=progress_logger,
396+
file_size_limit=file_size_limit,
382397
)
383398

384399

scanpipe/templates/scanpipe/project_settings.html

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -114,6 +114,18 @@
114114
</div>
115115
</div>
116116

117+
<div class="field">
118+
<label class="label" for="{{ form.scan_max_file_size.id_for_label }}">
119+
{{ form.scan_max_file_size.label }}
120+
</label>
121+
<div class="control">
122+
{{ form.scan_max_file_size }}
123+
</div>
124+
<div class="help">
125+
{{ form.scan_max_file_size.help_text|safe|linebreaksbr }}
126+
</div>
127+
</div>
128+
117129
<div class="field">
118130
<label class="label" for="{{ form.ignored_dependency_scopes.id_for_label }}">
119131
{{ form.ignored_dependency_scopes.label }}

scanpipe/tests/data/manifests/openpdf-parent-1.3.11_scan_package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@
1919
"errors": [],
2020
"warnings": [],
2121
"extra_data": {
22-
"spdx_license_list_version": "3.25",
22+
"spdx_license_list_version": "3.26",
2323
"files_count": 1
2424
}
2525
}

0 commit comments

Comments
 (0)