|
| 1 | +<h1 align="center"> |
| 2 | + <b>datasets-knowledge-embedding</b> |
| 3 | +</h1> |
| 4 | +<p align="center"> |
| 5 | + <!-- License --> |
| 6 | + <a href="https://github.com/simonepri/datasets-knowledge-embedding/tree/master/license"> |
| 7 | + <img src="https://img.shields.io/github/license/simonepri/datasets-knowledge-embedding.svg" alt="Project license" /> |
| 8 | + </a> |
| 9 | +</p> |
| 10 | +<p align="center"> |
| 11 | + 📝 A collection of common datasets used in knowledge embedding |
| 12 | +</p> |
| 13 | + |
| 14 | + |
| 15 | +## Datasets |
| 16 | + |
| 17 | +This project collects different datasets used in various knowledge embedding related papers. |
| 18 | +It also standardizes the format of these datasets, making it easier to use them in the evaluation of new works. |
| 19 | + |
| 20 | +The datasets can be downloaded from the [release page][release]. |
| 21 | +For licensing information, please refer to the original dataset license file. |
| 22 | + |
| 23 | + |
| 24 | +### COUNTRIES-S1 |
| 25 | +This dataset was introduced in [On Approximate Reasoning Capabilities of Low-Rank Vector Spaces](https://www.aaai.org/ocs/index.php/SSS/SSS15/paper/view/10257). |
| 26 | +The link to the original dataset as released by the authors is unknown but a copy has been taken from [here](https://github.com/TimDettmers/ConvE/tree/master/countries). |
| 27 | + |
| 28 | +| Entities | Relation Types | Edges | Train Edges | Validation Edges | Test Edges | |
| 29 | +|----------|----------------|-------|-------------|------------------|------------| |
| 30 | +| 271 | 2 | 1159 | 1111 | 24 | 24 | |
| 31 | + |
| 32 | +[](https://github.com/simonepri/datasets-knowledge-embedding/releases/latest/download/COUNTRIES-S1.tgz) [](https://github.com/simonepri/datasets-knowledge-embedding/releases/latest/download/COUNTRIES-S1-ID.tgz) |
| 35 | + |
| 36 | + |
| 37 | +### COUNTRIES-S2 |
| 38 | +This dataset was introduced in [On Approximate Reasoning Capabilities of Low-Rank Vector Spaces](https://www.aaai.org/ocs/index.php/SSS/SSS15/paper/view/10257). |
| 39 | +The link to the original dataset as released by the authors is unknown but a copy has been taken from [here](https://github.com/TimDettmers/ConvE/tree/master/countries). |
| 40 | + |
| 41 | +| Entities | Relation Types | Edges | Train Edges | Validation Edges | Test Edges | |
| 42 | +|----------|----------------|-------|-------------|------------------|------------| |
| 43 | +| 271 | 2 | 1111 | 1063 | 24 | 24 | |
| 44 | + |
| 45 | +[](https://github.com/simonepri/datasets-knowledge-embedding/releases/latest/download/COUNTRIES-S2.tgz) [](https://github.com/simonepri/datasets-knowledge-embedding/releases/latest/download/COUNTRIES-S2-ID.tgz) |
| 48 | + |
| 49 | +### COUNTRIES-S3 |
| 50 | +This dataset was introduced in [On Approximate Reasoning Capabilities of Low-Rank Vector Spaces](https://www.aaai.org/ocs/index.php/SSS/SSS15/paper/view/10257). |
| 51 | +The link to the original dataset as released by the authors is unknown but a copy has been taken from [here](https://github.com/TimDettmers/ConvE/tree/master/countries). |
| 52 | + |
| 53 | +| Entities | Relation Types | Edges | Train Edges | Validation Edges | Test Edges | |
| 54 | +|----------|----------------|-------|-------------|------------------|------------| |
| 55 | +| 271 | 2 | 1033 | 985 | 24 | 24 | |
| 56 | + |
| 57 | +[](https://github.com/simonepri/datasets-knowledge-embedding/releases/latest/download/COUNTRIES-S3.tgz) [](https://github.com/simonepri/datasets-knowledge-embedding/releases/latest/download/COUNTRIES-S3-ID.tgz) |
| 60 | + |
| 61 | +### FB15K |
| 62 | +This dataset was introduced in [Translating Embeddings for Modeling Multi-relational Data](https://dl.acm.org/doi/10.5555/2999792.2999923). |
| 63 | +The original dataset as release by the authors is available [here](https://everest.hds.utc.fr/doku.php?id=en:transe). |
| 64 | + |
| 65 | +| Entities | Relation Types | Edges | Train Edges | Validation Edges | Test Edges | |
| 66 | +|----------|----------------|-------|-------------|------------------|------------| |
| 67 | +| 14951 | 1345 | 592213 | 483142 | 50000 | 59071 | |
| 68 | + |
| 69 | +[](https://github.com/simonepri/datasets-knowledge-embedding/releases/latest/download/FB15K.tgz) [](https://github.com/simonepri/datasets-knowledge-embedding/releases/latest/download/FB15K-ID.tgz) |
| 72 | + |
| 73 | +### FB15K-237 |
| 74 | +This dataset was introduced in [Observed versus latent features for knowledge base and text inference](https://www.aclweb.org/anthology/W15-4007/). |
| 75 | +The original dataset as release by the authors is available [here](https://www.microsoft.com/en-us/download/details.aspx?id=52312). |
| 76 | + |
| 77 | +| Entities | Relation Types | Edges | Train Edges | Validation Edges | Test Edges | |
| 78 | +|----------|----------------|-------|-------------|------------------|------------| |
| 79 | +| 14541 | 237 | 310116 | 272115 | 17535 | 20466 | |
| 80 | + |
| 81 | +[](https://github.com/simonepri/datasets-knowledge-embedding/releases/latest/download/FB15K-237.tgz) [](https://github.com/simonepri/datasets-knowledge-embedding/releases/latest/download/FB15K-237-ID.tgz) |
| 84 | + |
| 85 | +### KINSHIP |
| 86 | +This dataset was introduced in [Learning systems of concepts with an infinite relational model](https://dl.acm.org/doi/10.5555/1597538.1597600). |
| 87 | +The original dataset as release by the authors is available [here](http://www.charleskemp.com/code/irm.html). |
| 88 | + |
| 89 | +| Entities | Relation Types | Edges | Train Edges | Validation Edges | Test Edges | |
| 90 | +|----------|----------------|-------|-------------|------------------|------------| |
| 91 | +| 104 | 25 | 10686 | 8544 | 1068 | 1074 | |
| 92 | + |
| 93 | +[](https://github.com/simonepri/datasets-knowledge-embedding/releases/latest/download/KINSHIP.tgz) [](https://github.com/simonepri/datasets-knowledge-embedding/releases/latest/download/KINSHIP-ID.tgz) |
| 96 | + |
| 97 | +### NATIONS |
| 98 | +This dataset was introduced in [Learning systems of concepts with an infinite relational model](https://dl.acm.org/doi/10.5555/1597538.1597600). |
| 99 | +The original dataset as release by the authors is available [here](http://www.charleskemp.com/code/irm.html). |
| 100 | + |
| 101 | +| Entities | Relation Types | Edges | Train Edges | Validation Edges | Test Edges | |
| 102 | +|----------|----------------|-------|-------------|------------------|------------| |
| 103 | +| 14 | 55 | 1992 | 1592 | 199 | 201 | |
| 104 | + |
| 105 | +[](https://github.com/simonepri/datasets-knowledge-embedding/releases/latest/download/NATIONS.tgz) [](https://github.com/simonepri/datasets-knowledge-embedding/releases/latest/download/NATIONS-ID.tgz) |
| 108 | + |
| 109 | +### UMLS |
| 110 | +This dataset was introduced in [Learning systems of concepts with an infinite relational model](https://dl.acm.org/doi/10.5555/1597538.1597600). |
| 111 | +The original dataset as release by the authors is available [here](http://www.charleskemp.com/code/irm.html). |
| 112 | + |
| 113 | +| Entities | Relation Types | Edges | Train Edges | Validation Edges | Test Edges | |
| 114 | +|----------|----------------|-------|-------------|------------------|------------| |
| 115 | +| 135 | 46 | 6529 | 5216 | 652 | 661 | |
| 116 | + |
| 117 | +[](https://github.com/simonepri/datasets-knowledge-embedding/releases/latest/download/UMLS.tgz) [](https://github.com/simonepri/datasets-knowledge-embedding/releases/latest/download/UMLS-ID.tgz) |
| 120 | + |
| 121 | +### WN18 |
| 122 | +This dataset was introduced in [Translating Embeddings for Modeling Multi-relational Data](https://dl.acm.org/doi/10.5555/2999792.2999923). |
| 123 | +The original dataset as release by the authors is available [here](https://everest.hds.utc.fr/doku.php?id=en:transe). |
| 124 | + |
| 125 | +| Entities | Relation Types | Edges | Train Edges | Validation Edges | Test Edges | |
| 126 | +|----------|----------------|-------|-------------|------------------|------------| |
| 127 | +| 41105 | 18 | 151442 | 141442 | 5000 | 5000 | |
| 128 | + |
| 129 | +[](https://github.com/simonepri/datasets-knowledge-embedding/releases/latest/download/WN18.tgz) [](https://github.com/simonepri/datasets-knowledge-embedding/releases/latest/download/WN18-ID.tgz) |
| 132 | + |
| 133 | +### WN18RR |
| 134 | +This dataset was introduced in [Convolutional 2D Knowledge Graph Embeddings](https://arxiv.org/abs/1707.01476). |
| 135 | +The original dataset as release by the authors is available [here](https://github.com/TimDettmers/ConvE). |
| 136 | + |
| 137 | +| Entities | Relation Types | Edges | Train Edges | Validation Edges | Test Edges | |
| 138 | +|----------|----------------|-------|-------------|------------------|------------| |
| 139 | +| 41105 | 11 | 93003 | 86835 | 3034 | 3134 | |
| 140 | + |
| 141 | +[](https://github.com/simonepri/datasets-knowledge-embedding/releases/latest/download/WN18RR.tgz) [](https://github.com/simonepri/datasets-knowledge-embedding/releases/latest/download/WN18RR-ID.tgz) |
| 144 | + |
| 145 | +### YAGO3-10 |
| 146 | +This dataset was introduced in [Convolutional 2D Knowledge Graph Embeddings](https://arxiv.org/abs/1707.01476). |
| 147 | +The original dataset as release by the authors is available [here](https://github.com/TimDettmers/ConvE). |
| 148 | + |
| 149 | +| Entities | Relation Types | Edges | Train Edges | Validation Edges | Test Edges | |
| 150 | +|----------|----------------|-------|-------------|------------------|------------| |
| 151 | +| 123182 | 37 | 1089040 | 1079040 | 5000 | 5000 | |
| 152 | + |
| 153 | +[](https://github.com/simonepri/datasets-knowledge-embedding/releases/latest/download/YAGO3-10.tgz) [](https://github.com/simonepri/datasets-knowledge-embedding/releases/latest/download/YAGO3-10-ID.tgz) |
| 156 | + |
| 157 | + |
| 158 | +## Add a new dataset |
| 159 | + |
| 160 | +If you want to add a new dataset to this collection, first you need to create three files called `train.tsv`, `valid.tsv`, and `test.tsv` containing respectively the edges for the three splits train, validation and test. |
| 161 | +The files must contain tab-separated triples of the form `(head entity, relation, tail entity)`. |
| 162 | + |
| 163 | +Once you did this, you can simply process the three files with the following bash script. |
| 164 | + |
| 165 | +```bash |
| 166 | +bash build.sh train.tsv valid.tsv test.tsv . |
| 167 | +``` |
| 168 | + |
| 169 | +The script uses the [datasets-knowledge-embedding][github:simonepri/datasets-knowledge-embedding] tool under the hood. |
| 170 | + |
| 171 | + |
| 172 | +## Authors |
| 173 | + |
| 174 | +- **Simone Primarosa** - [simonepri][github:simonepri] |
| 175 | + |
| 176 | +See also the list of [contributors][contributors] who participated in this project. |
| 177 | + |
| 178 | + |
| 179 | +## License |
| 180 | + |
| 181 | +This project is licensed under the MIT License - see the [license][license] file for details. |
| 182 | + |
| 183 | +<!-- Links --> |
| 184 | +[license]: https://github.com/simonepri/datasets-knowledge-embedding/tree/master/license |
| 185 | +[contributors]: https://github.com/simonepri/datasets-knowledge-embedding/contributors |
| 186 | +[release]: https://github.com/simonepri/datasets-knowledge-embedding/releases/latest |
| 187 | + |
| 188 | +[github:simonepri]: https://github.com/simonepri |
| 189 | + |
| 190 | +[github:simonepri/datasets-knowledge-embedding]: https://github.com/simonepri/datasets-knowledge-embedding |
0 commit comments