Skip to content

Commit 3519dd7

Browse files
add magika
1 parent 266d856 commit 3519dd7

File tree

9 files changed

+341
-4
lines changed

9 files changed

+341
-4
lines changed

Chapter5/machine_learning.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2547,7 +2547,7 @@
25472547
"id": "d50ebd2f",
25482548
"metadata": {},
25492549
"source": [
2550-
"In contrast, AutoGluon automates these tasks, allowing you to train and deploy accurate models with minimal code."
2550+
"In contrast, AutoGluon automates these tasks, allowing you to train and deploy accurate models in 3 lines of code."
25512551
]
25522552
},
25532553
{

Chapter6/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
examples

Chapter6/workflow_automation.ipynb

Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1779,6 +1779,133 @@
17791779
"source": [
17801780
"main()"
17811781
]
1782+
},
1783+
{
1784+
"cell_type": "markdown",
1785+
"id": "cadcd0c5",
1786+
"metadata": {},
1787+
"source": [
1788+
"### Magika: Detect File Content Types with Deep Learning"
1789+
]
1790+
},
1791+
{
1792+
"cell_type": "code",
1793+
"execution_count": null,
1794+
"id": "fd791a58",
1795+
"metadata": {
1796+
"tags": [
1797+
"hide-cell"
1798+
]
1799+
},
1800+
"outputs": [],
1801+
"source": [
1802+
"!pip install magika"
1803+
]
1804+
},
1805+
{
1806+
"cell_type": "markdown",
1807+
"id": "a4382593",
1808+
"metadata": {},
1809+
"source": [
1810+
"Detecting file types helps identify malicious files disguised with false extensions, such as a .jpg that is actually malware.\n",
1811+
"\n",
1812+
"Magika, Google's AI-powered file type detection tool, uses deep learning for precise detection. In the following code, files have misleading extensions, but Magika still accurately detects their correct types."
1813+
]
1814+
},
1815+
{
1816+
"cell_type": "code",
1817+
"execution_count": 36,
1818+
"id": "a9fed27a",
1819+
"metadata": {},
1820+
"outputs": [
1821+
{
1822+
"name": "stdout",
1823+
"output_type": "stream",
1824+
"text": [
1825+
"Created 7 files in the 'examples' directory.\n"
1826+
]
1827+
}
1828+
],
1829+
"source": [
1830+
"from pathlib import Path\n",
1831+
"import shutil\n",
1832+
"\n",
1833+
"# Define the directory where files will be created\n",
1834+
"directory = Path(\"examples\")\n",
1835+
"\n",
1836+
"# Ensure the directory exists\n",
1837+
"directory.mkdir(exist_ok=True)\n",
1838+
"\n",
1839+
"# Empty the directory if it is not empty\n",
1840+
"for item in directory.iterdir():\n",
1841+
" if item.is_dir():\n",
1842+
" shutil.rmtree(item)\n",
1843+
" else:\n",
1844+
" item.unlink()\n",
1845+
"\n",
1846+
"# Define the filenames and their respective content\n",
1847+
"files = [\n",
1848+
" (\"plain_text.csv\", \"This is a plain text file.\"),\n",
1849+
" (\"csv.json\", \"id,name,age\\n1,John Doe,30\"),\n",
1850+
" (\"json.xml\", '{\"name\": \"John\", \"age\": 30}'),\n",
1851+
" (\"markdown.js\", \"# Heading 1\\nSome text.\"),\n",
1852+
" (\"python.ini\", 'print(\"Hello, World!\")'),\n",
1853+
" (\"js.yml\", 'console.log(\"Hello, World!\");'),\n",
1854+
" (\"yml.js\", \"name: John\\nage: 30\"),\n",
1855+
"]\n",
1856+
"\n",
1857+
"# Create each file with the specified content\n",
1858+
"for filename, content in files:\n",
1859+
" (directory / filename).write_text(content)\n",
1860+
"\n",
1861+
"print(f\"Created {len(files)} files in the '{directory}' directory.\")"
1862+
]
1863+
},
1864+
{
1865+
"cell_type": "markdown",
1866+
"id": "cacd7497",
1867+
"metadata": {},
1868+
"source": [
1869+
"```bash\n",
1870+
"$ magika -r examples\n",
1871+
"```"
1872+
]
1873+
},
1874+
{
1875+
"cell_type": "code",
1876+
"execution_count": 37,
1877+
"id": "de0d10ab",
1878+
"metadata": {
1879+
"tags": [
1880+
"remove-input"
1881+
]
1882+
},
1883+
"outputs": [
1884+
{
1885+
"name": "stdout",
1886+
"output_type": "stream",
1887+
"text": [
1888+
"\u001b[1;34mexamples/csv.json: CSV document (code)\u001b[0;39m\n",
1889+
"\u001b[1;34mexamples/js.yml: JavaScript source (code)\u001b[0;39m\n",
1890+
"\u001b[1;34mexamples/json.xml: JSON document (code)\u001b[0;39m\n",
1891+
"\u001b[1;37mexamples/markdown.js: Markdown document (text)\u001b[0;39m\n",
1892+
"\u001b[1;37mexamples/plain_text.csv: Generic text document (text)\u001b[0;39m\n",
1893+
"\u001b[1;34mexamples/python.ini: Python source (code)\u001b[0;39m\n",
1894+
"\u001b[1;34mexamples/yml.js: YAML source (code)\u001b[0;39m\n"
1895+
]
1896+
}
1897+
],
1898+
"source": [
1899+
"!magika -r examples"
1900+
]
1901+
},
1902+
{
1903+
"cell_type": "markdown",
1904+
"id": "badeabd1",
1905+
"metadata": {},
1906+
"source": [
1907+
"[Link to Magika](https://bit.ly/45tdw5O)."
1908+
]
17821909
}
17831910
],
17841911
"metadata": {

docs/Chapter5/machine_learning.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1981,7 +1981,7 @@ <h2><span class="section-number">6.5.18. </span>AutoGluon: Fast and Accurate ML
19811981
<span class="n">grid_search</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_test</span><span class="p">)</span>
19821982
</pre></div>
19831983
</div>
1984-
<p>In contrast, AutoGluon automates these tasks, allowing you to train and deploy accurate models with minimal code.</p>
1984+
<p>In contrast, AutoGluon automates these tasks, allowing you to train and deploy accurate models in 3 lines of code.</p>
19851985
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">autogluon.tabular</span> <span class="kn">import</span> <span class="n">TabularPredictor</span>
19861986

19871987
<span class="n">predictor</span> <span class="o">=</span> <span class="n">TabularPredictor</span><span class="p">(</span><span class="n">label</span><span class="o">=</span><span class="s2">&quot;class&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">train_data</span><span class="p">)</span>

docs/Chapter6/Chapter6.html

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -234,6 +234,7 @@
234234
<li class="toctree-l2"><a class="reference internal" href="../Chapter2/dataclasses.html">3.7. Data Classes</a></li>
235235
<li class="toctree-l2"><a class="reference internal" href="../Chapter2/typing.html">3.8. Typing</a></li>
236236
<li class="toctree-l2"><a class="reference internal" href="../Chapter2/pathlib.html">3.9. pathlib</a></li>
237+
<li class="toctree-l2"><a class="reference internal" href="../Chapter2/pydantic.html">3.10. Pydantic</a></li>
237238
</ul>
238239
</li>
239240
<li class="toctree-l1 has-children"><a class="reference internal" href="../Chapter3/Chapter3.html">4. Pandas</a><input class="toctree-checkbox" id="toctree-checkbox-4" name="toctree-checkbox-4" type="checkbox"/><label class="toctree-toggle" for="toctree-checkbox-4"><i class="fa-solid fa-chevron-down"></i></label><ul>

docs/Chapter6/workflow_automation.html

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -234,6 +234,7 @@
234234
<li class="toctree-l2"><a class="reference internal" href="../Chapter2/dataclasses.html">3.7. Data Classes</a></li>
235235
<li class="toctree-l2"><a class="reference internal" href="../Chapter2/typing.html">3.8. Typing</a></li>
236236
<li class="toctree-l2"><a class="reference internal" href="../Chapter2/pathlib.html">3.9. pathlib</a></li>
237+
<li class="toctree-l2"><a class="reference internal" href="../Chapter2/pydantic.html">3.10. Pydantic</a></li>
237238
</ul>
238239
</li>
239240
<li class="toctree-l1 has-children"><a class="reference internal" href="../Chapter3/Chapter3.html">4. Pandas</a><input class="toctree-checkbox" id="toctree-checkbox-4" name="toctree-checkbox-4" type="checkbox"/><label class="toctree-toggle" for="toctree-checkbox-4"><i class="fa-solid fa-chevron-down"></i></label><ul>
@@ -527,6 +528,7 @@ <h2> Contents </h2>
527528
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#pytube-a-lightweight-python-library-for-downloading-youtube-videos">7.2.13. PyTube: A Lightweight Python Library for Downloading YouTube Videos</a></li>
528529
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#limit-the-execution-time-of-a-function-call-with-prefect">7.2.14. Limit the Execution Time of a Function Call with Prefect</a></li>
529530
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#retry-on-failure-with-prefect">7.2.15. Retry on Failure with Prefect</a></li>
531+
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#magika-detect-file-content-types-with-deep-learning">7.2.16. Magika: Detect File Content Types with Deep Learning</a></li>
530532
</ul>
531533
</nav>
532534
</div>
@@ -1575,6 +1577,84 @@ <h2><span class="section-number">7.2.15. </span>Retry on Failure with Prefect<a
15751577
</details>
15761578
</div>
15771579
</section>
1580+
<section id="magika-detect-file-content-types-with-deep-learning">
1581+
<h2><span class="section-number">7.2.16. </span>Magika: Detect File Content Types with Deep Learning<a class="headerlink" href="#magika-detect-file-content-types-with-deep-learning" title="Permalink to this heading">#</a></h2>
1582+
<div class="cell tag_hide-cell docutils container">
1583+
<details class="hide above-input">
1584+
<summary aria-label="Toggle hidden content">
1585+
<span class="collapsed">Show code cell content</span>
1586+
<span class="expanded">Hide code cell content</span>
1587+
</summary>
1588+
<div class="cell_input docutils container">
1589+
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="o">!</span>pip<span class="w"> </span>install<span class="w"> </span>magika
1590+
</pre></div>
1591+
</div>
1592+
</div>
1593+
</details>
1594+
</div>
1595+
<p>Detecting file types helps identify malicious files disguised with false extensions, such as a .jpg that is actually malware.</p>
1596+
<p>Magika, Google’s AI-powered file type detection tool, uses deep learning for precise detection. In the following code, files have misleading extensions, but Magika still accurately detects their correct types.</p>
1597+
<div class="cell docutils container">
1598+
<div class="cell_input docutils container">
1599+
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
1600+
<span class="kn">import</span> <span class="nn">shutil</span>
1601+
1602+
<span class="c1"># Define the directory where files will be created</span>
1603+
<span class="n">directory</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="s2">&quot;examples&quot;</span><span class="p">)</span>
1604+
1605+
<span class="c1"># Ensure the directory exists</span>
1606+
<span class="n">directory</span><span class="o">.</span><span class="n">mkdir</span><span class="p">(</span><span class="n">exist_ok</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
1607+
1608+
<span class="c1"># Empty the directory if it is not empty</span>
1609+
<span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">directory</span><span class="o">.</span><span class="n">iterdir</span><span class="p">():</span>
1610+
<span class="k">if</span> <span class="n">item</span><span class="o">.</span><span class="n">is_dir</span><span class="p">():</span>
1611+
<span class="n">shutil</span><span class="o">.</span><span class="n">rmtree</span><span class="p">(</span><span class="n">item</span><span class="p">)</span>
1612+
<span class="k">else</span><span class="p">:</span>
1613+
<span class="n">item</span><span class="o">.</span><span class="n">unlink</span><span class="p">()</span>
1614+
1615+
<span class="c1"># Define the filenames and their respective content</span>
1616+
<span class="n">files</span> <span class="o">=</span> <span class="p">[</span>
1617+
<span class="p">(</span><span class="s2">&quot;plain_text.csv&quot;</span><span class="p">,</span> <span class="s2">&quot;This is a plain text file.&quot;</span><span class="p">),</span>
1618+
<span class="p">(</span><span class="s2">&quot;csv.json&quot;</span><span class="p">,</span> <span class="s2">&quot;id,name,age</span><span class="se">\n</span><span class="s2">1,John Doe,30&quot;</span><span class="p">),</span>
1619+
<span class="p">(</span><span class="s2">&quot;json.xml&quot;</span><span class="p">,</span> <span class="s1">&#39;{&quot;name&quot;: &quot;John&quot;, &quot;age&quot;: 30}&#39;</span><span class="p">),</span>
1620+
<span class="p">(</span><span class="s2">&quot;markdown.js&quot;</span><span class="p">,</span> <span class="s2">&quot;# Heading 1</span><span class="se">\n</span><span class="s2">Some text.&quot;</span><span class="p">),</span>
1621+
<span class="p">(</span><span class="s2">&quot;python.ini&quot;</span><span class="p">,</span> <span class="s1">&#39;print(&quot;Hello, World!&quot;)&#39;</span><span class="p">),</span>
1622+
<span class="p">(</span><span class="s2">&quot;js.yml&quot;</span><span class="p">,</span> <span class="s1">&#39;console.log(&quot;Hello, World!&quot;);&#39;</span><span class="p">),</span>
1623+
<span class="p">(</span><span class="s2">&quot;yml.js&quot;</span><span class="p">,</span> <span class="s2">&quot;name: John</span><span class="se">\n</span><span class="s2">age: 30&quot;</span><span class="p">),</span>
1624+
<span class="p">]</span>
1625+
1626+
<span class="c1"># Create each file with the specified content</span>
1627+
<span class="k">for</span> <span class="n">filename</span><span class="p">,</span> <span class="n">content</span> <span class="ow">in</span> <span class="n">files</span><span class="p">:</span>
1628+
<span class="p">(</span><span class="n">directory</span> <span class="o">/</span> <span class="n">filename</span><span class="p">)</span><span class="o">.</span><span class="n">write_text</span><span class="p">(</span><span class="n">content</span><span class="p">)</span>
1629+
1630+
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&quot;Created </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">files</span><span class="p">)</span><span class="si">}</span><span class="s2"> files in the &#39;</span><span class="si">{</span><span class="n">directory</span><span class="si">}</span><span class="s2">&#39; directory.&quot;</span><span class="p">)</span>
1631+
</pre></div>
1632+
</div>
1633+
</div>
1634+
<div class="cell_output docutils container">
1635+
<div class="output stream highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>Created 7 files in the &#39;examples&#39; directory.
1636+
</pre></div>
1637+
</div>
1638+
</div>
1639+
</div>
1640+
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$<span class="w"> </span>magika<span class="w"> </span>-r<span class="w"> </span>examples
1641+
</pre></div>
1642+
</div>
1643+
<div class="cell tag_remove-input docutils container">
1644+
<div class="cell_output docutils container">
1645+
<div class="output stream highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span><span class=" -Color -Color-Bold -Color-Bold-Blue">examples/csv.json: CSV document (code)</span>
1646+
<span class=" -Color -Color-Bold -Color-Bold-Blue">examples/js.yml: JavaScript source (code)</span>
1647+
<span class=" -Color -Color-Bold -Color-Bold-Blue">examples/json.xml: JSON document (code)</span>
1648+
<span class=" -Color -Color-Bold -Color-Bold-White">examples/markdown.js: Markdown document (text)</span>
1649+
<span class=" -Color -Color-Bold -Color-Bold-White">examples/plain_text.csv: Generic text document (text)</span>
1650+
<span class=" -Color -Color-Bold -Color-Bold-Blue">examples/python.ini: Python source (code)</span>
1651+
<span class=" -Color -Color-Bold -Color-Bold-Blue">examples/yml.js: YAML source (code)</span>
1652+
</pre></div>
1653+
</div>
1654+
</div>
1655+
</div>
1656+
<p><a class="reference external" href="https://bit.ly/45tdw5O">Link to Magika</a>.</p>
1657+
</section>
15781658
</section>
15791659

15801660
<script type="text/x-thebe-config">
@@ -1655,6 +1735,7 @@ <h2><span class="section-number">7.2.15. </span>Retry on Failure with Prefect<a
16551735
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#pytube-a-lightweight-python-library-for-downloading-youtube-videos">7.2.13. PyTube: A Lightweight Python Library for Downloading YouTube Videos</a></li>
16561736
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#limit-the-execution-time-of-a-function-call-with-prefect">7.2.14. Limit the Execution Time of a Function Call with Prefect</a></li>
16571737
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#retry-on-failure-with-prefect">7.2.15. Retry on Failure with Prefect</a></li>
1738+
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#magika-detect-file-content-types-with-deep-learning">7.2.16. Magika: Detect File Content Types with Deep Learning</a></li>
16581739
</ul>
16591740
</nav></div>
16601741

docs/_sources/Chapter5/machine_learning.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2547,7 +2547,7 @@
25472547
"id": "d50ebd2f",
25482548
"metadata": {},
25492549
"source": [
2550-
"In contrast, AutoGluon automates these tasks, allowing you to train and deploy accurate models with minimal code."
2550+
"In contrast, AutoGluon automates these tasks, allowing you to train and deploy accurate models in 3 lines of code."
25512551
]
25522552
},
25532553
{

0 commit comments

Comments
 (0)