Skip to content

Commit bbbffbc

Browse files
committed
Update documentation
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
1 parent 6e765bf commit bbbffbc

File tree

2 files changed

+126
-59
lines changed

2 files changed

+126
-59
lines changed

CHANGELOG.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ v (next)
55
--------
66

77

8-
v21.5.31
8+
v21.6.1
99
--------
1010

1111
- Add support for VMDK, QCOW and VDI VM image filesystems extraction

README.rst

Lines changed: 125 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -6,40 +6,89 @@ ExtractCode
66
- homepage_url: https://github.com/nexB/extractcode
77
- keywords: archive, extraction, libarchive, 7zip, scancode-toolkit, extractcode
88

9+
Supports Windows, Linux and macOS on 64 bits processors and Python 3.6 to 3.9.
910

10-
ExtractCode is a universal archive extractor. It uses behind the scenes
11-
multiple tools such as:
11+
12+
**ExtractCode is a (mostly) universal archive extractor.**
13+
14+
Install with::
15+
16+
pip install extractcode[full]
17+
18+
19+
Why another extractor?
20+
----------------------
21+
22+
**it will extract!**
23+
24+
ExtractCode will extract things where other extractors may fail.
25+
26+
- Say you want to extract the tarball of the Linux kernel source code on Windows.
27+
It contains paths that are the same when ignoring the case and therefore will
28+
not extract OK on Windows: some file may be munged or the extract may file.
29+
30+
- Or a tarball (on any OS) may contain multiple times the exact same path. In
31+
these cases the paths showing up earlier in the archive may be "hidden" and
32+
overwritten by the same path showing up later in the archive giving the
33+
impression that there is only one file.
34+
35+
- Or an archive may be damaged a little but most files can still be extracted.
36+
37+
- Or the extracted files are such permissions that you cannot read them and are
38+
not owned by you.
39+
40+
- Or the archive may contain weird paths inluding relative paths that may be
41+
problematic to extract.
42+
43+
- Or the archive may contain special file types (character/device files) that
44+
may be problematic to extract.
45+
46+
- Or an archive may be a virtual disk or some file system(s) images that would
47+
typically need to be mounted to be accessed, and may require root access
48+
and guesswork to find out which partition and filesystem are at play and
49+
which driver to use.
50+
51+
In all these cases, ExtractCode will extract and try hard do the right thing to
52+
obtain the actual archived content when other tools may fail.
53+
54+
It can also extract recursively any type of (nested) archives-in-archives
55+
56+
As a downside, the extracted content may not be exactly what would be expected
57+
to use the contained files: for instance ... but this it is perfectly OK for
58+
file content analysis for software composition or forensic analysis.
59+
60+
Behind the scene, ExtractCode uses multiple tools such as:
1261

1362
- the Python standard library,
1463
- a custom ctypes binding to libarchive,
15-
- the 7zip command line, and
64+
- the 7zip command line tool, and
1665
- optionally libguestfs on Linux.
1766

18-
With these it is possible to extract a large number of common and
19-
20-
less common archives and compressed files. ExtractCode tries to extract things
21-
in the same way on all OSes, including auto-renaming files that would not have
22-
valid names on certain filesystems or when there are multiple copies of the same
23-
path in a given archive (which is possible in a tar).
67+
With these, it is possible to extract a large number of common and less common
68+
archives and compressed file types. ExtractCode tries to extract things in the
69+
same way on all supported OSes, including auto-renaming files that would have
70+
invalid, non-extractible names on certain filesystems or when there are multiple
71+
copies of the same path in a given archive (which is possible in a tar).
2472

25-
The extraction is driven from a "voting" system that considers the
26-
file extension(s) and name, the filetype and mimetype (using a ctypes
27-
binding to libmagic) to select the most appropriate extractor or
28-
decompressor function. It can handle multi-level archives such as tar.gz and
29-
can extract recursively nested archives.
73+
The extraction is driven from a "voting" system that considers the file
74+
extension(s) and name, the filetype and mimetype (using a ctypes binding to
75+
libmagic) to select the most appropriate extractor or decompressor function.
76+
It can handle multi-level archives such as tar.gz and can extract recursively
77+
any nested archives.
3078

3179
Visit https://aboutcode.org and https://github.com/nexB/ for support and download.
3280

81+
3382
We run CI tests on:
3483

3584
- Azure pipelines https://dev.azure.com/nexB/extractcode/_build
3685

37-
We run CI tests on:
3886

39-
- Azure pipelines https://dev.azure.com/nexB/extractcode/_build
87+
Installation
88+
------------
4089

4190
To install this package with its full capability (where the binaries for
42-
7zip and libarchive are installed), use the `full` option::
91+
7zip and libarchive are installed), use the `full` extra option::
4392

4493
pip install extractcode[full]
4594

@@ -48,45 +97,47 @@ system, use the `minimal` option::
4897

4998
pip install extractcode
5099

51-
In this case, you will need to provide a working libarchive and 7zip
52-
available in one of these ways:
100+
In this case, you will need to provide a working and compatible libarchive and
101+
7zip installed and configured in one of these ways such that ExtractCode can
102+
find them:
53103

54-
- **a typecode-libarchive and typecode-7z plugin**: See the standard ones at
104+
- **a typecode-libarchive and typecode-7z plugin**: See the standard ones at
55105
https://github.com/nexB/scancode-plugins/tree/main/builtins
56106
These can either bundle a libarchive library, a 7z executable or expose a
57107
system-installed libraries.
58108
It does so by providing plugin entry points as ``scancode_location_provider``
59109
for ``extractcode_libarchive`` that should point to a ``LocationProviderPlugin``
60-
subclass with a ``get_locations()`` method that must return a mapping with this key:
110+
subclass with a ``get_locations()`` method that must return a mapping with
111+
this key:
61112

62-
- 'extractcode.libarchive.dll': the absolute path to a libarchive DLL
113+
- 'extractcode.libarchive.dll': the absolute path to a **libarchive** shared object/DLL
63114

64115
See for example:
65116

66117
- https://github.com/nexB/scancode-plugins/blob/4da5fe8a5ab1c87b9b4af9e54d7ad60e289747f5/builtins/extractcode_libarchive-linux/setup.py#L40
67118
- https://github.com/nexB/scancode-plugins/blob/4da5fe8a5ab1c87b9b4af9e54d7ad60e289747f5/builtins/extractcode_libarchive-linux/src/extractcode_libarchive/__init__.py#L17
68119

69-
And the ``scancode_location_provider`` for ``extractcode_7zip`` should point
70-
to a ``LocationProviderPlugin`` subclass with a ``get_locations()`` method that must
71-
return a mapping with this key:
120+
And in the same way, the ``scancode_location_provider`` for ``extractcode_7zip``
121+
should point to a ``LocationProviderPlugin`` subclass with a ``get_locations()``
122+
method that must return a mapping with this key:
72123

73-
- 'extractcode.sevenzip.exe': the absolute path to a 7zip executable
124+
- 'extractcode.sevenzip.exe': the absolute path to a **7zip** executable
74125

75126
See for example:
76127

77128
- https://github.com/nexB/scancode-plugins/blob/4da5fe8a5ab1c87b9b4af9e54d7ad60e289747f5/builtins/extractcode_7z-linux/setup.py#L40
78129
- https://github.com/nexB/scancode-plugins/blob/4da5fe8a5ab1c87b9b4af9e54d7ad60e289747f5/builtins/extractcode_7z-linux/src/extractcode_7z/__init__.py#L18
79130

80-
- **environment variables**:
131+
- use **environment variables** to point to installed binaries:
81132

82133
- EXTRACTCODE_LIBARCHIVE_PATH: the absolute path to a libarchive DLL
83134
- EXTRACTCODE_7Z_PATH: the absolute path to a 7zip executable
84135

85136

86-
- **a system-installed libarchive and 7zip executable in the system PATH**:
137+
- **a system-installed libarchive and 7zip executable** available in the system **PATH**.
87138

88139

89-
The supported versions are:
140+
The supported binary tools versions are:
90141

91142
- libarchive 3.5.x
92143
- 7zip 16.5.x
@@ -95,10 +146,9 @@ The supported versions are:
95146
Development
96147
-----------
97148

98-
99149
To set up the development environment::
100150

101-
source configure
151+
source configure --dev
102152

103153

104154
To run unit tests::
@@ -116,18 +166,43 @@ To run the command line tool in the activated environment::
116166
./extractcode -h
117167

118168

169+
Configuration with environment variables
170+
----------------------------------------
171+
172+
ExtractCode will use these environment variables if set:
173+
174+
- EXTRACTCODE_LIBARCHIVE_PATH : the path to the ``libarchive.so`` libarchive
175+
shared library used to support some of the archive formats. If not provided,
176+
ExtractCode will look for a plugin-provided libarchive library path. See
177+
https://github.com/nexB/scancode-plugins/tree/main/builtins for such plugins.
178+
If no plugin contributes libarchive, then a final attempt is made to look for
179+
it in the PATH using standard DLL loading techniques.
180+
181+
- EXTRACTCODE_7Z_PATH : the path to the ``7z`` 7zip executable used to support
182+
some of the archive formats. If not provided, ExtractCode will look for a
183+
plugin-provided 7z executable path. See
184+
https://github.com/nexB/scancode-plugins/tree/main/builtins for such plugins.
185+
If no plugin contributes 7z, then a final attempt is made to look for
186+
it in the PATH.
187+
188+
- EXTRACTCODE_GUESTFISH_PATH : the path to the ``guestfish`` tool from
189+
libguestfs to use to extract VM images. If not provided, ExtractCode will look
190+
in the PATH for an installed ``guestfish`` executable instead.
191+
192+
193+
119194
Adding support for VM images extraction
120195
---------------------------------------
121196

122-
Adding support for VM images requires the manual installation of libguestfs
123-
tools system package. This is suport on Linux only. On Debian and Ubuntu you can
124-
use this::
197+
Adding support for VM images requires the manual installation of the
198+
libguestfs-tools system package. This is suported only on Linux.
199+
On Debian and Ubuntu you can use this command::
125200

126201
sudo apt-get install libguestfs-tools
127202

128203

129204
On Ubuntu only, an additional manual step is required as the kernel executable
130-
file cannot be read as required by libguestfish.
205+
file cannot be read by users as required by libguestfish.
131206

132207
Run this command as a temporary and immediate fix::
133208

@@ -136,10 +211,9 @@ Run this command as a temporary and immediate fix::
136211
do sudo dpkg-statoverride --add --update root root 0644 /boot/vmlinuz-$k
137212
done
138213

139-
140-
But you likely want both this temporary fix and a permanent fix; otherwise each
141-
kernel update will revert to the default permissions and extractcode will stop
142-
working for VM images extraction.
214+
You likely want both this temporary fix and a more permanent fix; otherwise each
215+
kernel update will revert to the default permissions and ExtractCode will stop
216+
working for VM images extraction.
143217

144218
Therefore follow these instructions:
145219

@@ -164,26 +238,19 @@ See also these links for a complete discussion:
164238
- https://bugs.launchpad.net/ubuntu/+source/libguestfs/+bug/1813662/comments/24
165239

166240

167-
Configuration with environment variables
168-
----------------------------------------
241+
Alternative
242+
-----------
169243

170-
ExtractCode will use these environment variables if set:
244+
These other tools are related and were considered before creating ExtractCode:
171245

172-
- EXTRACTCODE_GUESTFISH_PATH : the path to the ``guestfish`` tool from
173-
libguestfs to use to extract VM images. If not provided, ExtractCode will look
174-
in the PATH for an installed ``guestfish`` executable instead.
246+
These tools provide built-in, original extraction capabilities:
175247

176-
- EXTRACTCODE_LIBARCHIVE_PATH : the path to the ``libarchive.so`` libarchive
177-
shared library used to support some of the archive formats. If not provided,
178-
ExtractCode will look for a plugin-provided libarchive library path. See
179-
https://github.com/nexB/scancode-plugins/tree/main/builtins for such plugins.
180-
If no plugin contributes libarchive, then a final attempt is made to look for
181-
it in the PATH using standard DLL loading techniques.
248+
- https://libarchive.org/ (integrated in ExtractCode) (BSD license)
249+
- https://www.7-zip.org/ (integrated in ExtractCode) (LGPL license)
250+
- https://theunarchiver.com/command-line (maintenance status unknown) (LGPL license)
182251

183-
- EXTRACTCODE_7Z_PATH : the path to the ``7z`` 7zip executable used to support
184-
some of the archive formats. If not provided, ExtractCode will look for a
185-
plugin-provided 7z executable path. See
186-
https://github.com/nexB/scancode-plugins/tree/main/builtins for such plugins.
187-
If no plugin contributes 7z, then a final attempt is made to look for
188-
it in the PATH.
189-
252+
These tools are command line tools wrapping other extraction tools and are
253+
similar to ExtractCode but with different goals:
254+
255+
- https://github.com/wummel/patool (wrapper on many CLI tools) (GPL license)
256+
- https://github.com/dtrx-py/dtrx (wrapper on a few CLI tools) (recently revived) (GPL license)

0 commit comments

Comments
 (0)