@@ -6,40 +6,89 @@ ExtractCode
6
6
- homepage_url: https://github.com/nexB/extractcode
7
7
- keywords: archive, extraction, libarchive, 7zip, scancode-toolkit, extractcode
8
8
9
+ Supports Windows, Linux and macOS on 64 bits processors and Python 3.6 to 3.9.
9
10
10
- ExtractCode is a universal archive extractor. It uses behind the scenes
11
- multiple tools such as:
11
+
12
+ **ExtractCode is a (mostly) universal archive extractor. **
13
+
14
+ Install with::
15
+
16
+ pip install extractcode[full]
17
+
18
+
19
+ Why another extractor?
20
+ ----------------------
21
+
22
+ **it will extract! **
23
+
24
+ ExtractCode will extract things where other extractors may fail.
25
+
26
+ - Say you want to extract the tarball of the Linux kernel source code on Windows.
27
+ It contains paths that are the same when ignoring the case and therefore will
28
+ not extract OK on Windows: some file may be munged or the extract may file.
29
+
30
+ - Or a tarball (on any OS) may contain multiple times the exact same path. In
31
+ these cases the paths showing up earlier in the archive may be "hidden" and
32
+ overwritten by the same path showing up later in the archive giving the
33
+ impression that there is only one file.
34
+
35
+ - Or an archive may be damaged a little but most files can still be extracted.
36
+
37
+ - Or the extracted files are such permissions that you cannot read them and are
38
+ not owned by you.
39
+
40
+ - Or the archive may contain weird paths inluding relative paths that may be
41
+ problematic to extract.
42
+
43
+ - Or the archive may contain special file types (character/device files) that
44
+ may be problematic to extract.
45
+
46
+ - Or an archive may be a virtual disk or some file system(s) images that would
47
+ typically need to be mounted to be accessed, and may require root access
48
+ and guesswork to find out which partition and filesystem are at play and
49
+ which driver to use.
50
+
51
+ In all these cases, ExtractCode will extract and try hard do the right thing to
52
+ obtain the actual archived content when other tools may fail.
53
+
54
+ It can also extract recursively any type of (nested) archives-in-archives
55
+
56
+ As a downside, the extracted content may not be exactly what would be expected
57
+ to use the contained files: for instance ... but this it is perfectly OK for
58
+ file content analysis for software composition or forensic analysis.
59
+
60
+ Behind the scene, ExtractCode uses multiple tools such as:
12
61
13
62
- the Python standard library,
14
63
- a custom ctypes binding to libarchive,
15
- - the 7zip command line, and
64
+ - the 7zip command line tool , and
16
65
- optionally libguestfs on Linux.
17
66
18
- With these it is possible to extract a large number of common and
19
-
20
- less common archives and compressed files. ExtractCode tries to extract things
21
- in the same way on all OSes, including auto-renaming files that would not have
22
- valid names on certain filesystems or when there are multiple copies of the same
23
- path in a given archive (which is possible in a tar).
67
+ With these, it is possible to extract a large number of common and less common
68
+ archives and compressed file types. ExtractCode tries to extract things in the
69
+ same way on all supported OSes, including auto-renaming files that would have
70
+ invalid, non-extractible names on certain filesystems or when there are multiple
71
+ copies of the same path in a given archive (which is possible in a tar).
24
72
25
- The extraction is driven from a "voting" system that considers the
26
- file extension(s) and name, the filetype and mimetype (using a ctypes
27
- binding to libmagic) to select the most appropriate extractor or
28
- decompressor function. It can handle multi-level archives such as tar.gz and
29
- can extract recursively nested archives.
73
+ The extraction is driven from a "voting" system that considers the file
74
+ extension(s) and name, the filetype and mimetype (using a ctypes binding to
75
+ libmagic) to select the most appropriate extractor or decompressor function.
76
+ It can handle multi-level archives such as tar.gz and can extract recursively
77
+ any nested archives.
30
78
31
79
Visit https://aboutcode.org and https://github.com/nexB/ for support and download.
32
80
81
+
33
82
We run CI tests on:
34
83
35
84
- Azure pipelines https://dev.azure.com/nexB/extractcode/_build
36
85
37
- We run CI tests on:
38
86
39
- - Azure pipelines https://dev.azure.com/nexB/extractcode/_build
87
+ Installation
88
+ ------------
40
89
41
90
To install this package with its full capability (where the binaries for
42
- 7zip and libarchive are installed), use the `full ` option::
91
+ 7zip and libarchive are installed), use the `full ` extra option::
43
92
44
93
pip install extractcode[full]
45
94
@@ -48,45 +97,47 @@ system, use the `minimal` option::
48
97
49
98
pip install extractcode
50
99
51
- In this case, you will need to provide a working libarchive and 7zip
52
- available in one of these ways:
100
+ In this case, you will need to provide a working and compatible libarchive and
101
+ 7zip installed and configured in one of these ways such that ExtractCode can
102
+ find them:
53
103
54
- - **a typecode-libarchive and typecode-7z plugin **: See the standard ones at
104
+ - **a typecode-libarchive and typecode-7z plugin **: See the standard ones at
55
105
https://github.com/nexB/scancode-plugins/tree/main/builtins
56
106
These can either bundle a libarchive library, a 7z executable or expose a
57
107
system-installed libraries.
58
108
It does so by providing plugin entry points as ``scancode_location_provider ``
59
109
for ``extractcode_libarchive `` that should point to a ``LocationProviderPlugin ``
60
- subclass with a ``get_locations() `` method that must return a mapping with this key:
110
+ subclass with a ``get_locations() `` method that must return a mapping with
111
+ this key:
61
112
62
- - 'extractcode.libarchive.dll': the absolute path to a libarchive DLL
113
+ - 'extractcode.libarchive.dll': the absolute path to a ** libarchive ** shared object/ DLL
63
114
64
115
See for example:
65
116
66
117
- https://github.com/nexB/scancode-plugins/blob/4da5fe8a5ab1c87b9b4af9e54d7ad60e289747f5/builtins/extractcode_libarchive-linux/setup.py#L40
67
118
- https://github.com/nexB/scancode-plugins/blob/4da5fe8a5ab1c87b9b4af9e54d7ad60e289747f5/builtins/extractcode_libarchive-linux/src/extractcode_libarchive/__init__.py#L17
68
119
69
- And the ``scancode_location_provider `` for ``extractcode_7zip `` should point
70
- to a ``LocationProviderPlugin `` subclass with a ``get_locations() `` method that must
71
- return a mapping with this key:
120
+ And in the same way, the ``scancode_location_provider `` for ``extractcode_7zip ``
121
+ should point to a ``LocationProviderPlugin `` subclass with a ``get_locations() ``
122
+ method that must return a mapping with this key:
72
123
73
- - 'extractcode.sevenzip.exe': the absolute path to a 7zip executable
124
+ - 'extractcode.sevenzip.exe': the absolute path to a ** 7zip ** executable
74
125
75
126
See for example:
76
127
77
128
- https://github.com/nexB/scancode-plugins/blob/4da5fe8a5ab1c87b9b4af9e54d7ad60e289747f5/builtins/extractcode_7z-linux/setup.py#L40
78
129
- https://github.com/nexB/scancode-plugins/blob/4da5fe8a5ab1c87b9b4af9e54d7ad60e289747f5/builtins/extractcode_7z-linux/src/extractcode_7z/__init__.py#L18
79
130
80
- - **environment variables **:
131
+ - use **environment variables ** to point to installed binaries :
81
132
82
133
- EXTRACTCODE_LIBARCHIVE_PATH: the absolute path to a libarchive DLL
83
134
- EXTRACTCODE_7Z_PATH: the absolute path to a 7zip executable
84
135
85
136
86
- - **a system-installed libarchive and 7zip executable in the system PATH **:
137
+ - **a system-installed libarchive and 7zip executable ** available in the system ** PATH **.
87
138
88
139
89
- The supported versions are:
140
+ The supported binary tools versions are:
90
141
91
142
- libarchive 3.5.x
92
143
- 7zip 16.5.x
@@ -95,10 +146,9 @@ The supported versions are:
95
146
Development
96
147
-----------
97
148
98
-
99
149
To set up the development environment::
100
150
101
- source configure
151
+ source configure --dev
102
152
103
153
104
154
To run unit tests::
@@ -116,18 +166,43 @@ To run the command line tool in the activated environment::
116
166
./extractcode -h
117
167
118
168
169
+ Configuration with environment variables
170
+ ----------------------------------------
171
+
172
+ ExtractCode will use these environment variables if set:
173
+
174
+ - EXTRACTCODE_LIBARCHIVE_PATH : the path to the ``libarchive.so `` libarchive
175
+ shared library used to support some of the archive formats. If not provided,
176
+ ExtractCode will look for a plugin-provided libarchive library path. See
177
+ https://github.com/nexB/scancode-plugins/tree/main/builtins for such plugins.
178
+ If no plugin contributes libarchive, then a final attempt is made to look for
179
+ it in the PATH using standard DLL loading techniques.
180
+
181
+ - EXTRACTCODE_7Z_PATH : the path to the ``7z `` 7zip executable used to support
182
+ some of the archive formats. If not provided, ExtractCode will look for a
183
+ plugin-provided 7z executable path. See
184
+ https://github.com/nexB/scancode-plugins/tree/main/builtins for such plugins.
185
+ If no plugin contributes 7z, then a final attempt is made to look for
186
+ it in the PATH.
187
+
188
+ - EXTRACTCODE_GUESTFISH_PATH : the path to the ``guestfish `` tool from
189
+ libguestfs to use to extract VM images. If not provided, ExtractCode will look
190
+ in the PATH for an installed ``guestfish `` executable instead.
191
+
192
+
193
+
119
194
Adding support for VM images extraction
120
195
---------------------------------------
121
196
122
- Adding support for VM images requires the manual installation of libguestfs
123
- tools system package. This is suport on Linux only. On Debian and Ubuntu you can
124
- use this::
197
+ Adding support for VM images requires the manual installation of the
198
+ libguestfs- tools system package. This is suported only on Linux.
199
+ On Debian and Ubuntu you can use this command ::
125
200
126
201
sudo apt-get install libguestfs-tools
127
202
128
203
129
204
On Ubuntu only, an additional manual step is required as the kernel executable
130
- file cannot be read as required by libguestfish.
205
+ file cannot be read by users as required by libguestfish.
131
206
132
207
Run this command as a temporary and immediate fix::
133
208
@@ -136,10 +211,9 @@ Run this command as a temporary and immediate fix::
136
211
do sudo dpkg-statoverride --add --update root root 0644 /boot/vmlinuz-$k
137
212
done
138
213
139
-
140
- But you likely want both this temporary fix and a permanent fix; otherwise each
141
- kernel update will revert to the default permissions and extractcode will stop
142
- working for VM images extraction.
214
+ You likely want both this temporary fix and a more permanent fix; otherwise each
215
+ kernel update will revert to the default permissions and ExtractCode will stop
216
+ working for VM images extraction.
143
217
144
218
Therefore follow these instructions:
145
219
@@ -164,26 +238,19 @@ See also these links for a complete discussion:
164
238
- https://bugs.launchpad.net/ubuntu/+source/libguestfs/+bug/1813662/comments/24
165
239
166
240
167
- Configuration with environment variables
168
- ----------------------------------------
241
+ Alternative
242
+ -----------
169
243
170
- ExtractCode will use these environment variables if set :
244
+ These other tools are related and were considered before creating ExtractCode :
171
245
172
- - EXTRACTCODE_GUESTFISH_PATH : the path to the ``guestfish `` tool from
173
- libguestfs to use to extract VM images. If not provided, ExtractCode will look
174
- in the PATH for an installed ``guestfish `` executable instead.
246
+ These tools provide built-in, original extraction capabilities:
175
247
176
- - EXTRACTCODE_LIBARCHIVE_PATH : the path to the ``libarchive.so `` libarchive
177
- shared library used to support some of the archive formats. If not provided,
178
- ExtractCode will look for a plugin-provided libarchive library path. See
179
- https://github.com/nexB/scancode-plugins/tree/main/builtins for such plugins.
180
- If no plugin contributes libarchive, then a final attempt is made to look for
181
- it in the PATH using standard DLL loading techniques.
248
+ - https://libarchive.org/ (integrated in ExtractCode) (BSD license)
249
+ - https://www.7-zip.org/ (integrated in ExtractCode) (LGPL license)
250
+ - https://theunarchiver.com/command-line (maintenance status unknown) (LGPL license)
182
251
183
- - EXTRACTCODE_7Z_PATH : the path to the ``7z `` 7zip executable used to support
184
- some of the archive formats. If not provided, ExtractCode will look for a
185
- plugin-provided 7z executable path. See
186
- https://github.com/nexB/scancode-plugins/tree/main/builtins for such plugins.
187
- If no plugin contributes 7z, then a final attempt is made to look for
188
- it in the PATH.
189
-
252
+ These tools are command line tools wrapping other extraction tools and are
253
+ similar to ExtractCode but with different goals:
254
+
255
+ - https://github.com/wummel/patool (wrapper on many CLI tools) (GPL license)
256
+ - https://github.com/dtrx-py/dtrx (wrapper on a few CLI tools) (recently revived) (GPL license)
0 commit comments