An innovative benchmark for semi-structured textual data parsing, profiling and analysis. This repo collects 100+ open-source semi-structured datasets (TXT, LOG, CSV, JSON, XML, PHP, YAML, HMM, FASTQ etc.), mostly from GitHub. Visiting corresponding repositories or links in README.md
of each directory to get more datasets (hundreds of in total).
This repo contains some excerpted and modified version of original datasets. Those excerpted and modified versions are freely available for research or academic work. However, for the original datasets collected by this repo, please comply with the corresponding source's license before use.
For any usage or distribution of the datasets, please refer to this repository's URL and our paper StructVizor: Interactive Profiling of Semi-Structured Textual Data.