Compilation of Datasets #109

afiaka87 · 2021-03-19T20:18:25Z

afiaka87
Mar 19, 2021

We'll need lots of data to train dalle-pytorch to the level OpenAI has with DALLE. If you find any new or interesting datasets that are either captioned or could have captions generated for them using class-name, etc. then please post here and I'll update the list:

We'll download these with aria2c, gdown and wget (for the WIT links). Make sure to:
apt install wget
python3 -m pip install aria2c gdown

COCO 2014 Resized to 256x256

edit: If someone could rehost this for me I'd appreciate it. That drive account isnt exactly production ready.
gdown "https://drive.google.com/file/d/1d7_N0Uxf4xYSSS-VcIVt4lBXFsUIbjvP/view"

Visual Genome

aria2c https://academictorrents.com/download/1bfe6871046860a2ff8c0cc1414318beb35dc916.torrent;

imagenet

aria2c https://academictorrents.com/download/96816a530ee002254d29bf7a61c0c158d3dedc3b.torrent;

STL-10

aria2c https://academictorrents.com/download/a799a2845ac29a66c07cf74e2a2838b6c5698a6a.torrent;

food-101

aria2c https://academictorrents.com/download/470791483f8441764d3b01dbc4d22b3aa58ef46f.torrent;

indoor CVPR

aria2c https://academictorrents.com/download/59aa0ad684e5d849f68bad9a6d43a9000a927164.torrent;

SVHN

aria2c https://academictorrents.com/download/6f4caf3c24803d114c3cae3ab9cb946cd23c7213.torrent;

OpenImagesV6 (only downloads the 256 px versions)

aria2c --bt-metadata-only=true --bt-save-metadata=true https://academictorrents.com/download/9208d33aceb2ca3eb2beb70a192600c9c41efba1.torrent;
aria2c --show-files /workspace/downsampled-open-images-v4-9208d33aceb2ca3eb2beb70a192600c9c41efba1.torrent;
aria2c --select-file=9,11,15 /workspace/downsampled-open-images-v4-9208d33aceb2ca3eb2beb70a192600c9c41efba1.torrent;

WIT

Here are the links to download the 10 files.

wit_v1.train.all-00000-of-00010.tsv.gz

wit_v1.train.all-00001-of-00010.tsv.gz

wit_v1.train.all-00002-of-00010.tsv.gz

wit_v1.train.all-00003-of-00010.tsv.gz

wit_v1.train.all-00004-of-00010.tsv.gz

wit_v1.train.all-00005-of-00010.tsv.gz

wit_v1.train.all-00006-of-00010.tsv.gz

wit_v1.train.all-00007-of-00010.tsv.gz

wit_v1.train.all-00008-of-00010.tsv.gz

wit_v1.train.all-00009-of-00010.tsv.gz

sorrge · 2021-03-19T23:06:19Z

sorrge
Mar 19, 2021

Do you think that object detection datasets with a single label can be useful for DALL-E? It seems that the information content is quite low; but it could still learn the object names and various views and their typical backgrounds.
Along this line of thought, perhaps even images without any labels can still be useful, just to train the image completion abilities.

0 replies

robvanvolt · 2021-03-20T11:22:03Z

robvanvolt
Mar 20, 2021

YFCC100m could also be added to the list: #110 (comment) as well as the conceptual captions dataset consisting of 3,318,333 image/caption pairs: https://ai.google.com/research/ConceptualCaptions/

11 replies

robvanvolt Mar 20, 2021

It would be misleading for google to only relate its liecense agreement to the URLs and not the images themselves:

We make available Conceptual Captions, a new dataset consisting of ~3.3M images annotated with captions.

They clearly mention the images in the beginning, and not just the URLs! :) But you are right, it's good to always double check! :)

sorrge Mar 20, 2021

The license is for what you download. Of course, Google doesn't give you the copyright for the random collection of images from the Internet. Even Google Open Images dataset has a disclaimer that they don't guarantee the copyright, although there they at least tried to select the images with permissive licenses.

afiaka87 Mar 20, 2021
Author

Sooo... You're saying I don't have a right to distribute the images? I think I'm misunderstanding because it sounds like you've contradicted yourself.

sorrge Mar 20, 2021

In my understanding, no one has the right to redistribute these images bundled together. Downloading and training on them is fair use.
There are special datasets, e.g. Google Open Images, where they only select images with permissive licenses. Those you can redistribute.

I don't think it's going to be a problem, though. For example, you can find full ImageNet on Academic Torrents, and nobody takes it down.

afiaka87 Mar 22, 2021
Author

@sorrge this is very helpful to me. Somone else gave me a completely different answer and this stuff is important to me. I think I would prefer the protection that a larger institution like academictorrents can provide. There's also just the nature of torrents in general (list of ip addresses to where the data might be, not necessarily a dedicated host) that makes me feel better about this in general.

Unfortunately, i'm not an academic anymore. I haven't been in Uni for like ten years and they require an @edu or similar email address in order to host a torrent there.

I'm still very much in the process of creating a torrent, btw. But if I get it figured out, I would love to host there. I'm assuming at least one of you has edu address to contact them with ha.

robvanvolt · 2021-03-20T16:36:54Z

robvanvolt
Mar 20, 2021

@afiaka87 good point with the with the copyright issue - I will always also add the license agreements for new datasets i find:

2.5 million images from 205 scene categories under the Create Common License
http://places.csail.mit.edu/downloadData.html

One million labeled images for each of 10 scene categories and 20 object categories. (maybe a little bit too reductionist), I didn't find any information on license agreements despite "If you find LSUN dataset useful in your research, please consider citing"...
https://github.com/fyu/lsun

2,686,419 of AI-generated faces, e.g. "A white male with long black hair facing left" or "A black woman facing right with short white hair."

All images can be used for any purpose without worrying about copyrights, distribution rights, infringement claims, or royalties.

Bulk download request under: work.with@generated.photos
https://generated.photos/

8,456,240 (or 6,464,018 cleaned) images of 94,682 celebrities under GNU General Public License v3.0
https://github.com/EB-Dodo/C-MS-Celeb
https://academictorrents.com/details/9e67eb7cc23c9417f39778a8e06cca5e26196a97/tech&hit=1&filelist=1

1 reply

afiaka87 Mar 20, 2021
Author

Thanks it's best to at least check ourselves on this once in awhile. I'm not an expert but this is all research and I'm pretty sure that makes it fair use. If someone knows more, please chime in.

ieee8023 · 2021-04-05T14:49:20Z

ieee8023
Apr 5, 2021

Here is the resized coco dataset as a torrent: https://academictorrents.com/details/eea5a532dd69de7ff93d5d9c579eac55a41cb700

0 replies

rom1504 · 2021-06-05T14:53:39Z

rom1504
Jun 5, 2021

https://github.com/rom1504/kaggle-fashion-dalle/releases/tag/1.0.0 contains preprocessed for dalle of https://www.kaggle.com/paramaggarwal/fashion-product-images-dataset
40k images with captions
fashion products

1 reply

afiaka87 Jun 5, 2021
Author

Thanks @rom1504 - going to add these to my training session now.

Uh oh!

Compilation of Datasets #109

Uh oh!

Uh oh!

COCO 2014 Resized to 256x256

Visual Genome

imagenet

STL-10

food-101

indoor CVPR

SVHN

OpenImagesV6 (only downloads the 256 px versions)

WIT

Replies: 5 comments · 13 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

afiaka87 Mar 20, 2021 Author

Uh oh!

Uh oh!

afiaka87 Mar 22, 2021 Author

Uh oh!

Uh oh!

Uh oh!

afiaka87 Mar 20, 2021 Author

Uh oh!

Uh oh!

Uh oh!

afiaka87 Jun 5, 2021 Author

Replies: 5 comments 13 replies

afiaka87 Mar 20, 2021
Author

afiaka87 Mar 22, 2021
Author

afiaka87 Mar 20, 2021
Author

afiaka87 Jun 5, 2021
Author