Skip to content

Project Ideas Improve File Classification

Michael Herzog edited this page Feb 25, 2020 · 2 revisions

Improve file classification in ScanCode

ScanCode currently detects the programming language, file type and MIME type for files, but this detection is not as accurate as it could be. We also need a better way to classify files for further automation particularly in the area of identifying the likely "purpose" of a file - e.g. focus on source and binary files that represent code versus files that are documentation, scripts, etc. This is similar to the concept of "facets" from the Clearly Defined project.

The first goal of this project is to improve the quality of detecting file characteristics including programming language (which currently use only Pygments) and Linux "magic" file type. The second goal is to create and implement a flexible framework of rules to automate assigning "purpose" to files, possibly with machine learning.

Clone this wiki locally