Skip to content

Commit fabf1f2

Browse files
authored
Merge pull request #20 from nexB/16-vm-images
Extract vm images #16
2 parents 78994e6 + 4a8ef69 commit fabf1f2

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

43 files changed

+2089
-814
lines changed

.gitattributes

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,3 @@
11
# Ignore all Git auto CR/LF line endings conversions
2-
* binary
2+
* -text
3+
pyproject.toml export-subst

.travis.yml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,9 +13,10 @@ python:
1313
- "3.6"
1414
- "3.7"
1515
- "3.8"
16+
- "3.9"
1617

1718
# Scripts to run at install stage
18-
install: ./configure
19+
install: ./configure --dev
1920

2021
# Scripts to run at script stage
21-
script: tmp/bin/pytest
22+
script: tmp/bin/pytest --ignore=tests/test_vmimage.py

AUTHORS.rst

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,12 @@ The following organizations or individuals have contributed to this repo:
22

33
- Abhishek Kumar @Abhishek-Dev09
44
- AlexB @a-tinsmith
5+
- Konrad Weihmann @priv-kweihmann
56
- Maximilian Huber @maxhbr
67
- Michael Rupprecht @michaelrup
78
- Philippe Ombredanne @pombredanne
9+
- Pierre Tardy @tardyp
810
- Qingmin Duanmu @qduanmu
911
- Rakesh Balusa @balusarakesh
1012
- Ravi Jain @JRavi2
11-
- Steven Esser @majurg
13+
- Steven Esser @majurg

CHANGELOG.rst

Lines changed: 31 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,36 +1,50 @@
1-
Release notes
2-
=============
1+
Changelog
2+
=========
33

4-
vNext
5-
-----
4+
v (next)
5+
--------
66

77

8-
Version 21.1.21
9-
---------------
8+
v21.6.1
9+
--------
1010

11-
- Bump dependencies and use latest typecode and binaries. This is to fix
12-
installation problems on multiple OSes.
11+
- Add support for VMDK, QCOW and VDI VM image filesystems extraction
12+
- Add new configuration mechanism to get third-party binary paths:
1313

14+
- Use an environment variable
15+
- Or use a plugin-provided path
16+
- Or use well-known system installation locations
17+
- Or use the system PATH
18+
- Or fail with an informative error message
1419

15-
Version 21.1.21
16-
---------------
20+
- Update to use latest skeleton
1721

18-
- Add new [full] extra requires that install all the dependencies
19-
- Fix bug related to commoncode libraries loading
22+
23+
v2021-2-24
24+
----------
25+
26+
- Fix incorrect documentation link
27+
28+
29+
v2021-1-21
30+
----------
31+
32+
- Fix bug related to CommonCode libraries loading
2033
- Improve the extra requirements
2134
- Set minimum version for dependencies
2235
- Improve documentation
36+
- Reorganize tests files
2337

2438

25-
Version 21.1.15
26-
---------------
39+
v2021-1-15
40+
----------
2741

2842
- Drop support for Python 2
2943
- Use the latest CommonCode and TypeCode libraries
3044
- Add azure-pipelines CI support
3145

3246

33-
Version 20.10
34-
-------------
47+
v20.10
48+
------
3549

36-
- Initial release.
50+
- Initial release as a split from ScanCode toolkit

NOTICE

Lines changed: 5 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,8 @@
11
#
2-
# Copyright (c) nexB Inc. and others.
3-
# SPDX-License-Identifier: Apache-2.0
4-
#
5-
# Visit https://aboutcode.org and https://github.com/nexB/ for support and download.
2+
# Copyright (c) nexB Inc. and others. All rights reserved.
63
# ScanCode is a trademark of nexB Inc.
7-
#
8-
# Licensed under the Apache License, Version 2.0 (the "License");
9-
# you may not use this file except in compliance with the License.
10-
# You may obtain a copy of the License at
11-
#
12-
# http://www.apache.org/licenses/LICENSE-2.0
13-
#
14-
# Unless required by applicable law or agreed to in writing, software
15-
# distributed under the License is distributed on an "AS IS" BASIS,
16-
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
17-
# See the License for the specific language governing permissions and
18-
# limitations under the License.
4+
# SPDX-License-Identifier: Apache-2.0
5+
# See http://www.apache.org/licenses/LICENSE-2.0 for the license text.
6+
# See https://github.com/nexB/extractcode for support or download.
7+
# See https://aboutcode.org for more information about nexB OSS projects.
198
#

README.rst

Lines changed: 231 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -4,35 +4,253 @@ ExtractCode
44
- license: Apache-2.0
55
- copyright: copyright (c) nexB. Inc. and others
66
- homepage_url: https://github.com/nexB/extractcode
7-
- keywords: archive, extraction, libarchive, 7zip, scancode-toolkit
7+
- keywords: archive, extraction, libarchive, 7zip, scancode-toolkit, extractcode
88

9+
Supports Windows, Linux and macOS on 64 bits processors and Python 3.6 to 3.9.
910

10-
ExtractCode is a universal archive extractor. It uses behind the scenes
11-
the Python standard library, a custom ctypes binding to libarchive and
12-
the 7zip command line to extract a large number of common and
13-
less common archives and compressed files. It tries to extract things
14-
in the same way on all OSes, including auto-renaming files that would
15-
not have valid names on certain filesystems or when there are multiple
16-
copies of the same path in a given archive.
17-
The extraction is driven from a "voting" system that considers the
18-
file extension(s) and name, the file type and mime type (using a ctypes
19-
binding to libmagic) to select the most appropriate extractor or
20-
uncompressor function. It can handle multi-level archives such as tar.gz.
2111

12+
**ExtractCode is a (mostly) universal archive extractor.**
2213

14+
Install with::
15+
16+
pip install extractcode[full]
17+
18+
19+
Why another extractor?
20+
----------------------
21+
22+
**it will extract!**
23+
24+
ExtractCode will extract things where other extractors may fail.
25+
26+
- Say you want to extract the tarball of the Linux kernel source code on Windows.
27+
It contains paths that are the same when ignoring the case and therefore will
28+
not extract OK on Windows: some file may be munged or the extract may file.
29+
30+
- Or a tarball (on any OS) may contain multiple times the exact same path. In
31+
these cases the paths showing up earlier in the archive may be "hidden" and
32+
overwritten by the same path showing up later in the archive giving the
33+
impression that there is only one file.
34+
35+
- Or an archive may be damaged a little but most files can still be extracted.
36+
37+
- Or the extracted files are such permissions that you cannot read them and are
38+
not owned by you.
39+
40+
- Or the archive may contain weird paths inluding relative paths that may be
41+
problematic to extract.
42+
43+
- Or the archive may contain special file types (character/device files) that
44+
may be problematic to extract.
45+
46+
- Or an archive may be a virtual disk or some file system(s) images that would
47+
typically need to be mounted to be accessed, and may require root access
48+
and guesswork to find out which partition and filesystem are at play and
49+
which driver to use.
50+
51+
In all these cases, ExtractCode will extract and try hard do the right thing to
52+
obtain the actual archived content when other tools may fail.
53+
54+
It can also extract recursively any type of (nested) archives-in-archives
55+
56+
As a downside, the extracted content may not be exactly what would be expected
57+
to use the contained files: for instance ... but this it is perfectly OK for
58+
file content analysis for software composition or forensic analysis.
59+
60+
Behind the scene, ExtractCode uses multiple tools such as:
61+
62+
- the Python standard library,
63+
- a custom ctypes binding to libarchive,
64+
- the 7zip command line tool, and
65+
- optionally libguestfs on Linux.
66+
67+
With these, it is possible to extract a large number of common and less common
68+
archives and compressed file types. ExtractCode tries to extract things in the
69+
same way on all supported OSes, including auto-renaming files that would have
70+
invalid, non-extractible names on certain filesystems or when there are multiple
71+
copies of the same path in a given archive (which is possible in a tar).
72+
73+
The extraction is driven from a "voting" system that considers the file
74+
extension(s) and name, the filetype and mimetype (using a ctypes binding to
75+
libmagic) to select the most appropriate extractor or decompressor function.
76+
It can handle multi-level archives such as tar.gz and can extract recursively
77+
any nested archives.
2378

2479
Visit https://aboutcode.org and https://github.com/nexB/ for support and download.
2580

81+
82+
We run CI tests on:
83+
84+
- Azure pipelines https://dev.azure.com/nexB/extractcode/_build
85+
86+
87+
Installation
88+
------------
89+
90+
To install this package with its full capability (where the binaries for
91+
7zip and libarchive are installed), use the `full` extra option::
92+
93+
pip install extractcode[full]
94+
95+
If you want to use the version of binaries (possibly) provided by your operating
96+
system, use the `minimal` option::
97+
98+
pip install extractcode
99+
100+
In this case, you will need to provide a working and compatible libarchive and
101+
7zip installed and configured in one of these ways such that ExtractCode can
102+
find them:
103+
104+
- **a typecode-libarchive and typecode-7z plugin**: See the standard ones at
105+
https://github.com/nexB/scancode-plugins/tree/main/builtins
106+
These can either bundle a libarchive library, a 7z executable or expose a
107+
system-installed libraries.
108+
It does so by providing plugin entry points as ``scancode_location_provider``
109+
for ``extractcode_libarchive`` that should point to a ``LocationProviderPlugin``
110+
subclass with a ``get_locations()`` method that must return a mapping with
111+
this key:
112+
113+
- 'extractcode.libarchive.dll': the absolute path to a **libarchive** shared object/DLL
114+
115+
See for example:
116+
117+
- https://github.com/nexB/scancode-plugins/blob/4da5fe8a5ab1c87b9b4af9e54d7ad60e289747f5/builtins/extractcode_libarchive-linux/setup.py#L40
118+
- https://github.com/nexB/scancode-plugins/blob/4da5fe8a5ab1c87b9b4af9e54d7ad60e289747f5/builtins/extractcode_libarchive-linux/src/extractcode_libarchive/__init__.py#L17
119+
120+
And in the same way, the ``scancode_location_provider`` for ``extractcode_7zip``
121+
should point to a ``LocationProviderPlugin`` subclass with a ``get_locations()``
122+
method that must return a mapping with this key:
123+
124+
- 'extractcode.sevenzip.exe': the absolute path to a **7zip** executable
125+
126+
See for example:
127+
128+
- https://github.com/nexB/scancode-plugins/blob/4da5fe8a5ab1c87b9b4af9e54d7ad60e289747f5/builtins/extractcode_7z-linux/setup.py#L40
129+
- https://github.com/nexB/scancode-plugins/blob/4da5fe8a5ab1c87b9b4af9e54d7ad60e289747f5/builtins/extractcode_7z-linux/src/extractcode_7z/__init__.py#L18
130+
131+
- use **environment variables** to point to installed binaries:
132+
133+
- EXTRACTCODE_LIBARCHIVE_PATH: the absolute path to a libarchive DLL
134+
- EXTRACTCODE_7Z_PATH: the absolute path to a 7zip executable
135+
136+
137+
- **a system-installed libarchive and 7zip executable** available in the system **PATH**.
138+
139+
140+
The supported binary tools versions are:
141+
142+
- libarchive 3.5.x
143+
- 7zip 16.5.x
144+
145+
146+
Development
147+
-----------
148+
26149
To set up the development environment::
27150

28-
source configure
151+
source configure --dev
152+
29153

30154
To run unit tests::
31155

32156
pytest -vvs -n 2
33157

158+
34159
To clean up development environment::
35160

36161
./configure --clean
37162

38163

164+
To run the command line tool in the activated environment::
165+
166+
./extractcode -h
167+
168+
169+
Configuration with environment variables
170+
----------------------------------------
171+
172+
ExtractCode will use these environment variables if set:
173+
174+
- EXTRACTCODE_LIBARCHIVE_PATH : the path to the ``libarchive.so`` libarchive
175+
shared library used to support some of the archive formats. If not provided,
176+
ExtractCode will look for a plugin-provided libarchive library path. See
177+
https://github.com/nexB/scancode-plugins/tree/main/builtins for such plugins.
178+
If no plugin contributes libarchive, then a final attempt is made to look for
179+
it in the PATH using standard DLL loading techniques.
180+
181+
- EXTRACTCODE_7Z_PATH : the path to the ``7z`` 7zip executable used to support
182+
some of the archive formats. If not provided, ExtractCode will look for a
183+
plugin-provided 7z executable path. See
184+
https://github.com/nexB/scancode-plugins/tree/main/builtins for such plugins.
185+
If no plugin contributes 7z, then a final attempt is made to look for
186+
it in the PATH.
187+
188+
- EXTRACTCODE_GUESTFISH_PATH : the path to the ``guestfish`` tool from
189+
libguestfs to use to extract VM images. If not provided, ExtractCode will look
190+
in the PATH for an installed ``guestfish`` executable instead.
191+
192+
193+
194+
Adding support for VM images extraction
195+
---------------------------------------
196+
197+
Adding support for VM images requires the manual installation of the
198+
libguestfs-tools system package. This is suported only on Linux.
199+
On Debian and Ubuntu you can use this command::
200+
201+
sudo apt-get install libguestfs-tools
202+
203+
204+
On Ubuntu only, an additional manual step is required as the kernel executable
205+
file cannot be read by users as required by libguestfish.
206+
207+
Run this command as a temporary and immediate fix::
208+
209+
sudo chmod 0644 /boot/vmlinuz-*
210+
for k in /boot/vmlinuz-*
211+
do sudo dpkg-statoverride --add --update root root 0644 /boot/vmlinuz-$k
212+
done
213+
214+
You likely want both this temporary fix and a more permanent fix; otherwise each
215+
kernel update will revert to the default permissions and ExtractCode will stop
216+
working for VM images extraction.
217+
218+
Therefore follow these instructions:
219+
220+
1. As sudo, create the file /etc/kernel/postinst.d/statoverride with this
221+
content, devised by Kees Cook (@kees) in
222+
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/759725/comments/3 ::
223+
224+
#!/bin/sh
225+
version="$1"
226+
# passing the kernel version is required
227+
[ -z "${version}" ] && exit 0
228+
dpkg-statoverride --update --add root root 0644 /boot/vmlinuz-${version}
229+
230+
2. Set executable permissions::
231+
232+
sudo chmod +x /etc/kernel/postinst.d/statoverride
233+
234+
See also these links for a complete discussion:
235+
236+
- https://bugs.launchpad.net/ubuntu/+source/linux/+bug/759725
237+
- https://bugzilla.redhat.com/show_bug.cgi?id=1670790
238+
- https://bugs.launchpad.net/ubuntu/+source/libguestfs/+bug/1813662/comments/24
239+
240+
241+
Alternative
242+
-----------
243+
244+
These other tools are related and were considered before creating ExtractCode:
245+
246+
These tools provide built-in, original extraction capabilities:
247+
248+
- https://libarchive.org/ (integrated in ExtractCode) (BSD license)
249+
- https://www.7-zip.org/ (integrated in ExtractCode) (LGPL license)
250+
- https://theunarchiver.com/command-line (maintenance status unknown) (LGPL license)
251+
252+
These tools are command line tools wrapping other extraction tools and are
253+
similar to ExtractCode but with different goals:
254+
255+
- https://github.com/wummel/patool (wrapper on many CLI tools) (GPL license)
256+
- https://github.com/dtrx-py/dtrx (wrapper on a few CLI tools) (recently revived) (GPL license)

0 commit comments

Comments
 (0)