Semi-structured Dataset Collection

An innovative benchmark for semi-structured textual data parsing, profiling and analysis. This repo collects 100+ open-source semi-structured datasets (TXT, LOG, CSV, JSON, XML, PHP, YAML, HMM, FASTQ etc.), mostly from GitHub. Visiting corresponding repositories or links in README.md of each directory to get more datasets (hundreds of in total).

This repo contains some excerpted and modified version of original datasets. Those excerpted and modified versions are freely available for research or academic work. However, for the original datasets collected by this repo, please comply with the corresponding source's license before use.

For any usage or distribution of the datasets, please refer to this repository's URL and our paper StructVizor: Interactive Profiling of Semi-Structured Textual Data.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
2017QUT_S7comm		2017QUT_S7comm
50BusinessAssignmentsLog		50BusinessAssignmentsLog
IMDB_data_analysis		IMDB_data_analysis
ISO-3166-Countries-with-Regional-Codes		ISO-3166-Countries-with-Regional-Codes
LifeLog-DiaLog		LifeLog-DiaLog
Logo-2k-plus-Dataset		Logo-2k-plus-Dataset
MagicLamp		MagicLamp
PADS		PADS
RICE-for-CK3		RICE-for-CK3
Scottish-Parliament		Scottish-Parliament
UserClustering		UserClustering
WikiTableQuestions		WikiTableQuestions
WineReviewAnalysis		WineReviewAnalysis
bd_districts_statistics_dataset		bd_districts_statistics_dataset
car-logos-dataset		car-logos-dataset
container-escape-dataset		container-escape-dataset
currency-data		currency-data
fifa18-all-player-statistics		fifa18-all-player-statistics
global-hr-update-po		global-hr-update-po
govtrack		govtrack
gws_dataset		gws_dataset
jVectorMap-BD-districts		jVectorMap-BD-districts
loghub-2.0		loghub-2.0
loghub		loghub
o365_dataset		o365_dataset
testdata		testdata
zenodo_Traffic-and-Log-Data-Captured-During-a-Cyber-Defense-Exercise		zenodo_Traffic-and-Log-Data-Captured-During-a-Cyber-Defense-Exercise
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Semi-structured Dataset Collection

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Amur-N/Semi-structured-Dataset-Collection

Folders and files

Latest commit

History

Repository files navigation

Semi-structured Dataset Collection

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages