Skip to content

Commit 1af8d99

Browse files
Addon pipeline for source string collection (#1160)
* Add addon pipeline for string collection Signed-off-by: Keshav Priyadarshi <git@keshav.space> * Add test for collect_source_strings pipeline Signed-off-by: Keshav Priyadarshi <git@keshav.space> * Update dockerfile to install xgettext Signed-off-by: Keshav Priyadarshi <git@keshav.space> * Update CI to install xgettext Signed-off-by: Keshav Priyadarshi <git@keshav.space> * Update docs Signed-off-by: Keshav Priyadarshi <git@keshav.space> * Only supported on Linux Signed-off-by: Keshav Priyadarshi <git@keshav.space> Co-authored-by: Philippe Ombredanne <pombredanne@nexb.com> * Only supported on Linux Signed-off-by: Keshav Priyadarshi <git@keshav.space> Co-authored-by: Philippe Ombredanne <pombredanne@nexb.com> * Add CHANGELOG for CollectSourceStrings pipeline Signed-off-by: Keshav Priyadarshi <git@keshav.space> --------- Signed-off-by: Keshav Priyadarshi <git@keshav.space> Co-authored-by: Philippe Ombredanne <pombredanne@nexb.com>
1 parent d6389b2 commit 1af8d99

File tree

10 files changed

+228
-5
lines changed

10 files changed

+228
-5
lines changed

.github/workflows/ci.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,9 @@ jobs:
4444

4545
- name: Install universal ctags
4646
run: sudo apt-get install -y universal-ctags
47+
48+
- name: Install xgettext
49+
run: sudo apt-get install -y gettext
4750

4851
- name: Install dependencies
4952
run: make dev envfile

CHANGELOG.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,10 @@ v34.3.0 (unreleased)
77
- Associate resolved packages with their source codebase resource.
88
https://github.com/nexB/scancode.io/issues/1140
99

10+
- Add a new `CollectSourceStrings` pipeline (addon) for collecting source string using
11+
xgettext.
12+
https://github.com/nexB/scancode.io/pull/1160
13+
1014
v34.2.0 (2024-03-28)
1115
--------------------
1216

Dockerfile

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ ENV PYTHONPATH $PYTHONPATH:$APP_DIR
4040

4141
# OS requirements as per
4242
# https://scancode-toolkit.readthedocs.io/en/latest/getting-started/install.html
43-
# Also install universal-ctags for symbol collection.
43+
# Also install universal-ctags and xgettext for symbol and string collection.
4444
RUN apt-get update \
4545
&& apt-get install -y --no-install-recommends \
4646
bzip2 \
@@ -60,6 +60,7 @@ RUN apt-get update \
6060
git \
6161
wait-for-it \
6262
universal-ctags \
63+
gettext \
6364
&& apt-get clean \
6465
&& rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
6566

docs/built-in-pipelines.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,14 @@ Analyse Docker Windows Image
4242
:members:
4343
:member-order: bysource
4444

45+
.. _pipeline_collect_source_strings:
46+
47+
Collect Source Strings (addon)
48+
--------------------------------
49+
.. autoclass:: scanpipe.pipelines.collect_source_strings.CollectSourceStrings()
50+
:members:
51+
:member-order: bysource
52+
4553
.. _pipeline_collect_symbols:
4654

4755
Collect Codebase Symbols (addon)

docs/installation.rst

Lines changed: 15 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -261,13 +261,24 @@ See also `ScanCode-toolkit Prerequisites <https://scancode-toolkit.readthedocs.i
261261
latest/getting-started/install.html#prerequisites>`_ for more details.
262262

263263
For the :ref:`pipeline_collect_symbols` pipeline, `Universal Ctags <https://github.com/universal-ctags/ctags>`_ is needed.
264-
On **Linux** install it using::
265264

266-
sudo apt-get install universal-ctags
265+
* On **Linux** install it using::
267266

268-
On **MacOS** install Universal Ctags using Homebrew::
267+
sudo apt-get install universal-ctags
269268

270-
brew install universal-ctags
269+
* On **MacOS** install Universal Ctags using Homebrew::
270+
271+
brew install universal-ctags
272+
273+
For the :ref:`pipeline_collect_source_strings` pipeline, `gettext <https://www.gnu.org/software/gettext/>`_ is needed.
274+
275+
* On **Linux** install it using::
276+
277+
sudo apt-get install gettext
278+
279+
* On **MacOS** install gettext using Homebrew::
280+
281+
brew install gettext
271282

272283
Clone and Configure
273284
^^^^^^^^^^^^^^^^^^^
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
# SPDX-License-Identifier: Apache-2.0
2+
#
3+
# http://nexb.com and https://github.com/nexB/scancode.io
4+
# The ScanCode.io software is licensed under the Apache License version 2.0.
5+
# Data generated with ScanCode.io is provided as-is without warranties.
6+
# ScanCode is a trademark of nexB Inc.
7+
#
8+
# You may not use this software except in compliance with the License.
9+
# You may obtain a copy of the License at: http://apache.org/licenses/LICENSE-2.0
10+
# Unless required by applicable law or agreed to in writing, software distributed
11+
# under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR
12+
# CONDITIONS OF ANY KIND, either express or implied. See the License for the
13+
# specific language governing permissions and limitations under the License.
14+
#
15+
# Data Generated with ScanCode.io is provided on an "AS IS" BASIS, WITHOUT WARRANTIES
16+
# OR CONDITIONS OF ANY KIND, either express or implied. No content created from
17+
# ScanCode.io should be considered or used as legal advice. Consult an Attorney
18+
# for any legal advice.
19+
#
20+
# ScanCode.io is a free software code scanning tool from nexB Inc. and others.
21+
# Visit https://github.com/nexB/scancode.io for support and download.
22+
23+
from scanpipe.pipelines import Pipeline
24+
from scanpipe.pipes import source_strings
25+
26+
27+
class CollectSourceStrings(Pipeline):
28+
"""Collect source strings from codebase files and keep them in extra data field."""
29+
30+
download_inputs = False
31+
is_addon = True
32+
33+
@classmethod
34+
def steps(cls):
35+
return (cls.collect_and_store_resource_strings,)
36+
37+
def collect_and_store_resource_strings(self):
38+
"""
39+
Collect source strings from codebase files using gettext and store
40+
them in the extra data field.
41+
"""
42+
source_strings.collect_and_store_resource_strings(self.project, self.log)

scanpipe/pipes/source_strings.py

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
# SPDX-License-Identifier: Apache-2.0
2+
#
3+
# http://nexb.com and https://github.com/nexB/scancode.io
4+
# The ScanCode.io software is licensed under the Apache License version 2.0.
5+
# Data generated with ScanCode.io is provided as-is without warranties.
6+
# ScanCode is a trademark of nexB Inc.
7+
#
8+
# You may not use this software except in compliance with the License.
9+
# You may obtain a copy of the License at: http://apache.org/licenses/LICENSE-2.0
10+
# Unless required by applicable law or agreed to in writing, software distributed
11+
# under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR
12+
# CONDITIONS OF ANY KIND, either express or implied. See the License for the
13+
# specific language governing permissions and limitations under the License.
14+
#
15+
# Data Generated with ScanCode.io is provided on an "AS IS" BASIS, WITHOUT WARRANTIES
16+
# OR CONDITIONS OF ANY KIND, either express or implied. No content created from
17+
# ScanCode.io should be considered or used as legal advice. Consult an Attorney
18+
# for any legal advice.
19+
#
20+
# ScanCode.io is a free software code scanning tool from nexB Inc. and others.
21+
# Visit https://github.com/nexB/scancode.io for support and download.
22+
23+
from source_inspector import strings_xgettext
24+
25+
from scanpipe.pipes import LoopProgress
26+
27+
28+
class XgettextNotFound(Exception):
29+
pass
30+
31+
32+
def collect_and_store_resource_strings(project, logger=None):
33+
"""
34+
Collect source strings from codebase files using xgettext and store
35+
them in the extra data field.
36+
"""
37+
if not strings_xgettext.is_xgettext_installed():
38+
raise XgettextNotFound(
39+
"``xgettext`` not found. Install ``gettext`` to use this pipeline."
40+
)
41+
42+
project_files = project.codebaseresources.files()
43+
44+
resources = project_files.filter(
45+
is_binary=False,
46+
is_archive=False,
47+
is_media=False,
48+
)
49+
50+
resources_count = resources.count()
51+
52+
resource_iterator = resources.iterator(chunk_size=2000)
53+
progress = LoopProgress(resources_count, logger)
54+
55+
for resource in progress.iter(resource_iterator):
56+
_collect_and_store_resource_strings(resource)
57+
58+
59+
def _collect_and_store_resource_strings(resource):
60+
"""
61+
Collect strings from a resource using xgettext and store
62+
them in the extra data field.
63+
"""
64+
result = strings_xgettext.collect_strings(resource.location)
65+
strings = [item["string"] for item in result if "string" in item]
66+
resource.update_extra_data({"source_strings": strings})
Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
# SPDX-License-Identifier: Apache-2.0
2+
#
3+
# http://nexb.com and https://github.com/nexB/scancode.io
4+
# The ScanCode.io software is licensed under the Apache License version 2.0.
5+
# Data generated with ScanCode.io is provided as-is without warranties.
6+
# ScanCode is a trademark of nexB Inc.
7+
#
8+
# You may not use this software except in compliance with the License.
9+
# You may obtain a copy of the License at: http://apache.org/licenses/LICENSE-2.0
10+
# Unless required by applicable law or agreed to in writing, software distributed
11+
# under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR
12+
# CONDITIONS OF ANY KIND, either express or implied. See the License for the
13+
# specific language governing permissions and limitations under the License.
14+
#
15+
# Data Generated with ScanCode.io is provided on an "AS IS" BASIS, WITHOUT WARRANTIES
16+
# OR CONDITIONS OF ANY KIND, either express or implied. No content created from
17+
# ScanCode.io should be considered or used as legal advice. Consult an Attorney
18+
# for any legal advice.
19+
#
20+
# ScanCode.io is a free software code scanning tool from nexB Inc. and others.
21+
# Visit https://github.com/nexB/scancode.io for support and download.
22+
23+
import sys
24+
from pathlib import Path
25+
from unittest import skipIf
26+
27+
from django.test import TestCase
28+
29+
from scanpipe import pipes
30+
from scanpipe.models import Project
31+
from scanpipe.pipes import source_strings
32+
from scanpipe.pipes.input import copy_input
33+
34+
35+
class ScanPipeSourceStringsPipesTest(TestCase):
36+
data_location = Path(__file__).parent.parent / "data"
37+
38+
def setUp(self):
39+
self.project1 = Project.objects.create(name="Analysis")
40+
41+
@skipIf(sys.platform != "linux", "Only supported on Linux")
42+
def test_scanpipe_pipes_symbols_collect_and_store_resource_strings(self):
43+
dir = self.project1.codebase_path / "codefile"
44+
dir.mkdir(parents=True)
45+
46+
file_location = self.data_location / "d2d-javascript" / "from" / "main.js"
47+
copy_input(file_location, dir)
48+
49+
pipes.collect_and_create_codebase_resources(self.project1)
50+
51+
source_strings.collect_and_store_resource_strings(self.project1)
52+
53+
main_file = self.project1.codebaseresources.files()[0]
54+
result_extra_data_strings = main_file.extra_data.get("source_strings")
55+
56+
expected_extra_data_strings = [
57+
"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890!@#$%^&*()_-+=", # noqa
58+
"Enter the desired length of your password:",
59+
]
60+
self.assertCountEqual(expected_extra_data_strings, result_extra_data_strings)

scanpipe/tests/test_pipelines.py

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1240,3 +1240,30 @@ def test_scanpipe_collect_symbols_pipeline_integration(self):
12401240
result_extra_data_symbols = main_file.extra_data.get("source_symbols")
12411241
expected_extra_data_symbols = ["generatePassword", "passwordLength", "charSet"]
12421242
self.assertCountEqual(expected_extra_data_symbols, result_extra_data_symbols)
1243+
1244+
@skipIf(sys.platform != "linux", "Only supported on Linux")
1245+
def test_scanpipe_collect_source_strings_pipeline_integration(self):
1246+
pipeline_name = "collect_source_strings"
1247+
project1 = Project.objects.create(name="Analysis")
1248+
1249+
dir = project1.codebase_path / "codefile"
1250+
dir.mkdir(parents=True)
1251+
1252+
file_location = self.data_location / "d2d-javascript" / "from" / "main.js"
1253+
copy_input(file_location, dir)
1254+
1255+
pipes.collect_and_create_codebase_resources(project1)
1256+
1257+
run = project1.add_pipeline(pipeline_name)
1258+
pipeline = run.make_pipeline_instance()
1259+
1260+
exitcode, out = pipeline.execute()
1261+
self.assertEqual(0, exitcode, msg=out)
1262+
1263+
main_file = project1.codebaseresources.files()[0]
1264+
result_extra_data_strings = main_file.extra_data.get("source_strings")
1265+
expected_extra_data_strings = [
1266+
"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890!@#$%^&*()_-+=", # noqa
1267+
"Enter the desired length of your password:",
1268+
]
1269+
self.assertCountEqual(expected_extra_data_strings, result_extra_data_strings)

setup.cfg

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -131,6 +131,7 @@ scancodeio_pipelines =
131131
analyze_docker_image = scanpipe.pipelines.docker:Docker
132132
analyze_root_filesystem_or_vm_image = scanpipe.pipelines.root_filesystem:RootFS
133133
analyze_windows_docker_image = scanpipe.pipelines.docker_windows:DockerWindows
134+
collect_source_strings = scanpipe.pipelines.collect_source_strings:CollectSourceStrings
134135
collect_symbols = scanpipe.pipelines.collect_symbols:CollectSymbols
135136
find_vulnerabilities = scanpipe.pipelines.find_vulnerabilities:FindVulnerabilities
136137
inspect_elf_binaries = scanpipe.pipelines.inspect_elf_binaries:InspectELFBinaries

0 commit comments

Comments
 (0)