Skip to content

GuillaumeMougeot/gbifxdl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

gbifxdl

GBIF eXtreme downloader (gbifxdl) is a Python tool meant for scraping large datasets of images using GBIF website.

gbifxdl is a two-steps process:

  • The first step takes as input a GBIF predicate (see GBIF documentation and the exemple here) and outputs a download link towards an occurrence datasets, generated by GBIF and stored in a Darwin Core Archive format. This step requires a GBIF account. The duration of creation of the occurrence file by GBIF website can take a few minutes to a few hours. If properly configured, GBIF should notify the user once the file is ready for download.
  • If the download is completed, which can be seen here, the second step takes the previous download link as input, download the occurrence file, download the images and postprocess the file.

Warning: this package is still under active development and may undergo major updates in the future.

Installation

Source installation only for now.

TL;DR:

git clone git@github.com:GuillaumeMougeot/gbifxdl.git
cd gbifxdl
pip install -e .

Step-by-step:

  • Install Python version >= 3.10 language or use Anaconda.
  • Download gbifxdl repository or git clone it git clone git@github.com:GuillaumeMougeot/gbifxdl.git. Unzip the downloaded .zip if needed.
  • Open a terminal/a command prompt where Python can be executed (check that by running python command in that terminal) and navigate to the folder of download (with cd command) cd gbifxdl.
  • (Optional) Create a virtual environment with either python -m venv venv-name or conda create -n venv-name.
  • Run pip install -e . command (don't forget the . at the end of the command).

Usage

The package only provides an Application Programming Interface for now.

For the first step, post your GBIF predicate with the following command:

from gbifxdl import post

payload_path = "payload_traits.json"
pwd = "your_gbif_password"
download_key = post(payload_path, pwd=pwd, wait=False)

# Eventually save the download key to a local file.
download_key_path = "download_key.txt"
with open(download_key_path, "w") as file:
    file.write(download_key)

Note: if you don't have a predicate yet but you have a list of species or genus or family or any other taxa, you could use this online lookup tool to get a list of GBIF ids. Then you could look at the payload/predicate template and edit the list of values under the "TAXON_KEY" field.

Warning: this first step requires you to have a GBIF account. Create an account and then put your user ID in the .json/payload file that you will send to GBIF and pass your GBIF password to the post function (careful not to share your password publicly).

For the second step, download the occurrences file with:

from gbifxdl import poll_status, download_occurrences

download_key_path = "download_key.txt"

# Poll the POST status and wait if not ready to be downloaded
status = poll_status(download_key)

# Download the GBIF file
if status == 'succeeded':
    download_path = download_occurrences(
        download_key= download_key,
        dataset_dir = dataset_dir,
        file_format = 'dwca'
    )
else:
    print(f"Download failed because status is {status}.")
    download_path = None
    exit()

Preprocess the occurrence file and download the images with:

from gbifxdl import preprocess_occurrences_stream, AsyncImagePipeline

download_path = "path/to/occurrence_file"
images_dir = "path/to/output/folder"

# Preprocess the occurrence file
preprocessed_path = preprocess_occurrences_stream(
    dwca_path=download_path,
    max_img_spc=500, # Maximum number of images per species
    )

# Define an asynchronous routine to download image in parallel
downloader = AsyncImagePipeline(
    parquet_path=preprocessed_path,
    output_dir=images_dir,
)
downloader.run()
images_metadata_path = downloader.metadata_file

Postprocess the images and their metadata with:

from gbifxdl import postprocess

images_metadata_path = "path/to/metadata.parquet"
images_dir = "path/to/output/folder"

postprocess(
    parquet_path=images_metadata_path,
    img_dir=images_dir,
)

The scripts above are used in practice in the usecases folder.

For more detailed examples, look at the examples folder.

Contributing

This repo welcomes external contributions!

If you find an issue, feel free to open it here.

If you would like to contribute to the code, feel free to send a pull request. Currently, most of the code of this package is stored in a single script in gbifxdl/src/gbifxdl.py.

For any other request, don't hesitate to reach out by sending me an email.

Many thanks to anyone interested by this work.

TODO

  • Integrate resizing in the pipeline.
  • Deep learning processing during postprocessing.

Acknowledgement

This work has been inspired by the amazing works done in gbif-dl and in ami-ml.

About

[GBIF eXtreme DownLoader] Helper for downloading large amount of data from GBIF.

Resources

License

Stars

Watchers

Forks

Packages

No packages published