@@ -4,35 +4,253 @@ ExtractCode
4
4
- license: Apache-2.0
5
5
- copyright: copyright (c) nexB. Inc. and others
6
6
- homepage_url: https://github.com/nexB/extractcode
7
- - keywords: archive, extraction, libarchive, 7zip, scancode-toolkit
7
+ - keywords: archive, extraction, libarchive, 7zip, scancode-toolkit, extractcode
8
8
9
+ Supports Windows, Linux and macOS on 64 bits processors and Python 3.6 to 3.9.
9
10
10
- ExtractCode is a universal archive extractor. It uses behind the scenes
11
- the Python standard library, a custom ctypes binding to libarchive and
12
- the 7zip command line to extract a large number of common and
13
- less common archives and compressed files. It tries to extract things
14
- in the same way on all OSes, including auto-renaming files that would
15
- not have valid names on certain filesystems or when there are multiple
16
- copies of the same path in a given archive.
17
- The extraction is driven from a "voting" system that considers the
18
- file extension(s) and name, the file type and mime type (using a ctypes
19
- binding to libmagic) to select the most appropriate extractor or
20
- uncompressor function. It can handle multi-level archives such as tar.gz.
21
11
12
+ **ExtractCode is a (mostly) universal archive extractor. **
22
13
14
+ Install with::
15
+
16
+ pip install extractcode[full]
17
+
18
+
19
+ Why another extractor?
20
+ ----------------------
21
+
22
+ **it will extract! **
23
+
24
+ ExtractCode will extract things where other extractors may fail.
25
+
26
+ - Say you want to extract the tarball of the Linux kernel source code on Windows.
27
+ It contains paths that are the same when ignoring the case and therefore will
28
+ not extract OK on Windows: some file may be munged or the extract may file.
29
+
30
+ - Or a tarball (on any OS) may contain multiple times the exact same path. In
31
+ these cases the paths showing up earlier in the archive may be "hidden" and
32
+ overwritten by the same path showing up later in the archive giving the
33
+ impression that there is only one file.
34
+
35
+ - Or an archive may be damaged a little but most files can still be extracted.
36
+
37
+ - Or the extracted files are such permissions that you cannot read them and are
38
+ not owned by you.
39
+
40
+ - Or the archive may contain weird paths inluding relative paths that may be
41
+ problematic to extract.
42
+
43
+ - Or the archive may contain special file types (character/device files) that
44
+ may be problematic to extract.
45
+
46
+ - Or an archive may be a virtual disk or some file system(s) images that would
47
+ typically need to be mounted to be accessed, and may require root access
48
+ and guesswork to find out which partition and filesystem are at play and
49
+ which driver to use.
50
+
51
+ In all these cases, ExtractCode will extract and try hard do the right thing to
52
+ obtain the actual archived content when other tools may fail.
53
+
54
+ It can also extract recursively any type of (nested) archives-in-archives
55
+
56
+ As a downside, the extracted content may not be exactly what would be expected
57
+ to use the contained files: for instance ... but this it is perfectly OK for
58
+ file content analysis for software composition or forensic analysis.
59
+
60
+ Behind the scene, ExtractCode uses multiple tools such as:
61
+
62
+ - the Python standard library,
63
+ - a custom ctypes binding to libarchive,
64
+ - the 7zip command line tool, and
65
+ - optionally libguestfs on Linux.
66
+
67
+ With these, it is possible to extract a large number of common and less common
68
+ archives and compressed file types. ExtractCode tries to extract things in the
69
+ same way on all supported OSes, including auto-renaming files that would have
70
+ invalid, non-extractible names on certain filesystems or when there are multiple
71
+ copies of the same path in a given archive (which is possible in a tar).
72
+
73
+ The extraction is driven from a "voting" system that considers the file
74
+ extension(s) and name, the filetype and mimetype (using a ctypes binding to
75
+ libmagic) to select the most appropriate extractor or decompressor function.
76
+ It can handle multi-level archives such as tar.gz and can extract recursively
77
+ any nested archives.
23
78
24
79
Visit https://aboutcode.org and https://github.com/nexB/ for support and download.
25
80
81
+
82
+ We run CI tests on:
83
+
84
+ - Azure pipelines https://dev.azure.com/nexB/extractcode/_build
85
+
86
+
87
+ Installation
88
+ ------------
89
+
90
+ To install this package with its full capability (where the binaries for
91
+ 7zip and libarchive are installed), use the `full ` extra option::
92
+
93
+ pip install extractcode[full]
94
+
95
+ If you want to use the version of binaries (possibly) provided by your operating
96
+ system, use the `minimal ` option::
97
+
98
+ pip install extractcode
99
+
100
+ In this case, you will need to provide a working and compatible libarchive and
101
+ 7zip installed and configured in one of these ways such that ExtractCode can
102
+ find them:
103
+
104
+ - **a typecode-libarchive and typecode-7z plugin **: See the standard ones at
105
+ https://github.com/nexB/scancode-plugins/tree/main/builtins
106
+ These can either bundle a libarchive library, a 7z executable or expose a
107
+ system-installed libraries.
108
+ It does so by providing plugin entry points as ``scancode_location_provider ``
109
+ for ``extractcode_libarchive `` that should point to a ``LocationProviderPlugin ``
110
+ subclass with a ``get_locations() `` method that must return a mapping with
111
+ this key:
112
+
113
+ - 'extractcode.libarchive.dll': the absolute path to a **libarchive ** shared object/DLL
114
+
115
+ See for example:
116
+
117
+ - https://github.com/nexB/scancode-plugins/blob/4da5fe8a5ab1c87b9b4af9e54d7ad60e289747f5/builtins/extractcode_libarchive-linux/setup.py#L40
118
+ - https://github.com/nexB/scancode-plugins/blob/4da5fe8a5ab1c87b9b4af9e54d7ad60e289747f5/builtins/extractcode_libarchive-linux/src/extractcode_libarchive/__init__.py#L17
119
+
120
+ And in the same way, the ``scancode_location_provider `` for ``extractcode_7zip ``
121
+ should point to a ``LocationProviderPlugin `` subclass with a ``get_locations() ``
122
+ method that must return a mapping with this key:
123
+
124
+ - 'extractcode.sevenzip.exe': the absolute path to a **7zip ** executable
125
+
126
+ See for example:
127
+
128
+ - https://github.com/nexB/scancode-plugins/blob/4da5fe8a5ab1c87b9b4af9e54d7ad60e289747f5/builtins/extractcode_7z-linux/setup.py#L40
129
+ - https://github.com/nexB/scancode-plugins/blob/4da5fe8a5ab1c87b9b4af9e54d7ad60e289747f5/builtins/extractcode_7z-linux/src/extractcode_7z/__init__.py#L18
130
+
131
+ - use **environment variables ** to point to installed binaries:
132
+
133
+ - EXTRACTCODE_LIBARCHIVE_PATH: the absolute path to a libarchive DLL
134
+ - EXTRACTCODE_7Z_PATH: the absolute path to a 7zip executable
135
+
136
+
137
+ - **a system-installed libarchive and 7zip executable ** available in the system **PATH **.
138
+
139
+
140
+ The supported binary tools versions are:
141
+
142
+ - libarchive 3.5.x
143
+ - 7zip 16.5.x
144
+
145
+
146
+ Development
147
+ -----------
148
+
26
149
To set up the development environment::
27
150
28
- source configure
151
+ source configure --dev
152
+
29
153
30
154
To run unit tests::
31
155
32
156
pytest -vvs -n 2
33
157
158
+
34
159
To clean up development environment::
35
160
36
161
./configure --clean
37
162
38
163
164
+ To run the command line tool in the activated environment::
165
+
166
+ ./extractcode -h
167
+
168
+
169
+ Configuration with environment variables
170
+ ----------------------------------------
171
+
172
+ ExtractCode will use these environment variables if set:
173
+
174
+ - EXTRACTCODE_LIBARCHIVE_PATH : the path to the ``libarchive.so `` libarchive
175
+ shared library used to support some of the archive formats. If not provided,
176
+ ExtractCode will look for a plugin-provided libarchive library path. See
177
+ https://github.com/nexB/scancode-plugins/tree/main/builtins for such plugins.
178
+ If no plugin contributes libarchive, then a final attempt is made to look for
179
+ it in the PATH using standard DLL loading techniques.
180
+
181
+ - EXTRACTCODE_7Z_PATH : the path to the ``7z `` 7zip executable used to support
182
+ some of the archive formats. If not provided, ExtractCode will look for a
183
+ plugin-provided 7z executable path. See
184
+ https://github.com/nexB/scancode-plugins/tree/main/builtins for such plugins.
185
+ If no plugin contributes 7z, then a final attempt is made to look for
186
+ it in the PATH.
187
+
188
+ - EXTRACTCODE_GUESTFISH_PATH : the path to the ``guestfish `` tool from
189
+ libguestfs to use to extract VM images. If not provided, ExtractCode will look
190
+ in the PATH for an installed ``guestfish `` executable instead.
191
+
192
+
193
+
194
+ Adding support for VM images extraction
195
+ ---------------------------------------
196
+
197
+ Adding support for VM images requires the manual installation of the
198
+ libguestfs-tools system package. This is suported only on Linux.
199
+ On Debian and Ubuntu you can use this command::
200
+
201
+ sudo apt-get install libguestfs-tools
202
+
203
+
204
+ On Ubuntu only, an additional manual step is required as the kernel executable
205
+ file cannot be read by users as required by libguestfish.
206
+
207
+ Run this command as a temporary and immediate fix::
208
+
209
+ sudo chmod 0644 /boot/vmlinuz-*
210
+ for k in /boot/vmlinuz-*
211
+ do sudo dpkg-statoverride --add --update root root 0644 /boot/vmlinuz-$k
212
+ done
213
+
214
+ You likely want both this temporary fix and a more permanent fix; otherwise each
215
+ kernel update will revert to the default permissions and ExtractCode will stop
216
+ working for VM images extraction.
217
+
218
+ Therefore follow these instructions:
219
+
220
+ 1. As sudo, create the file /etc/kernel/postinst.d/statoverride with this
221
+ content, devised by Kees Cook (@kees) in
222
+ https://bugs.launchpad.net/ubuntu/+source/linux/+bug/759725/comments/3 ::
223
+
224
+ #!/bin/sh
225
+ version="$1"
226
+ # passing the kernel version is required
227
+ [ -z "${version}" ] && exit 0
228
+ dpkg-statoverride --update --add root root 0644 /boot/vmlinuz-${version}
229
+
230
+ 2. Set executable permissions::
231
+
232
+ sudo chmod +x /etc/kernel/postinst.d/statoverride
233
+
234
+ See also these links for a complete discussion:
235
+
236
+ - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/759725
237
+ - https://bugzilla.redhat.com/show_bug.cgi?id=1670790
238
+ - https://bugs.launchpad.net/ubuntu/+source/libguestfs/+bug/1813662/comments/24
239
+
240
+
241
+ Alternative
242
+ -----------
243
+
244
+ These other tools are related and were considered before creating ExtractCode:
245
+
246
+ These tools provide built-in, original extraction capabilities:
247
+
248
+ - https://libarchive.org/ (integrated in ExtractCode) (BSD license)
249
+ - https://www.7-zip.org/ (integrated in ExtractCode) (LGPL license)
250
+ - https://theunarchiver.com/command-line (maintenance status unknown) (LGPL license)
251
+
252
+ These tools are command line tools wrapping other extraction tools and are
253
+ similar to ExtractCode but with different goals:
254
+
255
+ - https://github.com/wummel/patool (wrapper on many CLI tools) (GPL license)
256
+ - https://github.com/dtrx-py/dtrx (wrapper on a few CLI tools) (recently revived) (GPL license)
0 commit comments