Tools for TreeOfLife dataset

This repository contains tools used in creating the TreeOfLife-200M dataset. They were created on the basis of distributed-downloader, which was used for downloading all the images. Step-by-step instructions to download all of the images are provided in docs/README.

Installation Instructions

Currently, only the portion of this package that is required for downloading all the images in the TreeOfLife-200M dataset is installable. Our other processing code, which is not required to download a copy of the dataset, is provided and described further in the processing/ directory.¹

Pip installation

Install Python 3.10 or 3.11
Install MPI, any MPI should work, tested with OpenMPI and IntelMPI. Installation instructions can be found on official websites:
- OpenMPI
- IntelMPI
Install the required package:
- For development:
```
pip install -e .[dev]
```

Scripts creation

After installation, you need to create scripts for the tools. Follow this instructions here

Currently, the following tools are available

column_name_change - changes the names of columns in the dataset
column_name_change_lila_fix - changes the names of columns in the dataset (were created to fix the bug in Lila BC dataset)
data_merging - used to filter out duplicated data in freshly downloaded datasets from existing ones (deduplication based on hashsum)
data_transfer - transfers data from one downloaded dataset to TreeOfLife dataset
eol_rename - were used to change source_id from EOL content ID to "EOL content ID_ EOL page ID" (the change was discarded later)
fathom_net_crop - used to crop FathomNet images to the bounding box sizes
fathom_net_crop_fix - used to crop FathomNet images to the bounding box sizes (were created to fix the bug in FathomNet dataset)
filter_out_by_uuid - tool to filter out using table of Tree of life uuids
lila_bc_filtering - tool for filtering Lila BC dataset (based on some processed csv table)
lila_extra_noaa_processing - tool for processing Lila Extra NOAA dataset in TreeOfLife format
lila_separation_multilable_filtering - tool to extract multilabels data from Lila BC dataset and duplicate images for each label
lila_separation_single_label_filtering - tool to extract single label data from Lila BC dataset
mam_ansp_fix - tool to fix the bug in man ansp server (gbif source)
research_filtering - tool to filter out data from TreeOfLife datasets
transfer_and_type_change - tool to transfer data (and change types) from one place to another on research storage (it transfers only 10Tb per day, to not overload the back-up system)
tol200m_bioscan_data_tranfer - tool to transfer data from Bioscan dataset to TreeOfLife dataset
tol200m_fathom_net_crop - tool to crop FathomNet images to the bounding box sizes for the TreeOfLife dataset

How to use the tools

To use the tools, you will need to create a config.yaml file, schema can be found in example.yaml.

To run the tool, use the following command:

tree_of_life_toolbox <config_path> <tool_name> [--OPTIONS]

<config_path> - path to the config.yaml file (either absolute or relative)
<tool_name> - name of the tool to run
[--OPTIONS] - optional arguments for the tool:
- --reset_filtering - basically a full reset. It resets the first step of the tool - filtering, however, since all the following steps depend on the filtering step, it will reset them as well
- --reset_scheduling - resets the scheduling step (useful when you want to change the number of runners/nodes per runner)
- --reset_runners - resets the runners, meaning they will start from scratch
- --tool_name_override - used to disable the tool name check

How to create a new tool

To create a new tool, you will need to create a new folder in src/TreeOfLife_toolbox/ and add the following files:

__init__.py - empty file
classes.py - file with the classes for the tool

In classes.py you will need to

Create a class for each step of the tool (filtering, scheduling, runner). Make sure that the class inherits from the base class for the step and that class names are unique.
Register the classes with their respective registry (FilterRegistery, SchedulerRegistry, RunnerRegistry) using the register decorator.
Add tool folder to __init__.py file in src/TreeOfLife_toolbox/ folder.

The following base classes are available:

filtering step:
- FilterToolBase - bare minimum class for the filtering step
- SparkFilterToolBase - base class for the filtering step using Spark, it automatically creates a Spark session and has some additional methods for working with Spark
- PythonFilterToolBase - base class for the filtering step using Python, it can automatically traverse the * downloaded* dataset
scheduling step:
- SchedulerToolBase - bare minimum class for the scheduling step
- DefaultScheduler - base class for the scheduling step. It can perform "standard" scheduling for the runners, you will need to specify the schema for it.
runner step:
- RunnerToolBase - bare minimum class for the runner step
- MPIRunnerTool - base class for the MPI based runner step, it can automatically initialize the MPI environment, read the schedule and call the apply_filter method on the separate chunks from schedule sequentially. You will need to implement the apply_filter method in your class.
- FilterRunnerTool - inherits from MPIRunnerTool and can perform "standard" filtering based on UUIDs. Works only with downloaded dataset schema.

Recommended Citation

If using the TreeOfLife-200M dataset, please cite this repo, the dataset, and our paper.

@software{Kopanev_TreeOfLife-toolbox_2025,
  author = {Kopanev, Andrei and Zhang, Net and Gu, Jianyang and Stevens, Samuel and Thompson, Matthew J and Campolongo, Elizabeth G},
  license = {MIT},
  month = may,
  title = {{TreeOfLife-toolbox}},
  url = {https://github.com/Imageomics/TreeOfLife-toolbox},
  version = {0.2.0-beta},
  year = {2025}
}

@dataset{treeoflife_200m,
  title = {{T}ree{O}f{L}ife-200{M}}, 
  author = {Jianyang Gu and Samuel Stevens and Elizabeth G Campolongo and Matthew J Thompson and Net Zhang and Jiaman Wu and Andrei Kopanev and Zheda Mai and Alexander E. White and James Balhoff and Wasila M Dahdul and Daniel Rubenstein and Hilmar Lapp and Tanya Berger-Wolf and Wei-Lun Chao and Yu Su},
  year = {2025},
  url = {https://huggingface.co/datasets/imageomics/TreeOfLife-200M},
  doi = {},
  publisher = {Hugging Face}
}

@article{gu2025bioclip,
  title = {{B}io{CLIP} 2: Emergent Properties from Scaling Hierarchical Contrastive Learning}, 
  author = {Jianyang Gu and Samuel Stevens and Elizabeth G Campolongo and Matthew J Thompson and Net Zhang and Jiaman Wu and Andrei Kopanev and Zheda Mai and Alexander E. White and James Balhoff and Wasila M Dahdul and Daniel Rubenstein and Hilmar Lapp and Tanya Berger-Wolf and Wei-Lun Chao and Yu Su},
  year = {2025},
  eprint={2505.23883},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2505.23883},
}

Also consider citing GBIF, BIOSCAN-5M, EOL, and FathomNet:

@misc{GBIF,
  title = {{GBIF} Occurrence Download},
  author = {GBIF.org},
  doi = {10.15468/DL.BFV433},
  url = {https://doi.org/10.15468/dl.bfv433},
  keywords = {GBIF, biodiversity, species occurrences},
  publisher = {The Global Biodiversity Information Facility},
  month = {May},
  year = {2024},
  copyright = {Creative Commons Attribution Non Commercial 4.0 International}
}

@inproceedings{gharaee2024bioscan5m,
    title={{BIOSCAN-5M}: A Multimodal Dataset for Insect Biodiversity},
    author={Zahra Gharaee and Scott C. Lowe and ZeMing Gong and Pablo Millan Arias
        and Nicholas Pellegrino and Austin T. Wang and Joakim Bruslund Haurum
        and Iuliia Zarubiieva and Lila Kari and Dirk Steinke and Graham W. Taylor
        and Paul Fieguth and Angel X. Chang
    },
    booktitle={NeurIPS},
    editor={A. Globerson and L. Mackey and D. Belgrave and A. Fan and U. Paquet and J. Tomczak and C. Zhang},
    pages={36285--36313},
    publisher={Curran Associates, Inc.},
    year={2024},
    volume={37},
}

@misc{eol,
  author = {{Encyclopedia of Life (EOL)}},
  url = {https://eol.org},
  note = {Accessed August 2024}
}

@article{katija_fathomnet_2022,
	title = {{FathomNet}: {A} global image database for enabling artificial intelligence in the ocean},
  author = {Katija, Kakani and Orenstein, Eric and Schlining, Brian and Lundsten, Lonny and Barnard, Kevin and Sainz, Giovanna and Boulais, Oceane and Cromwell, Megan and Butler, Erin and Woodward, Benjamin and Bell, Katherine L. C.},
	journal = {Scientific Reports},
	volume = {12},
	number = {1},
	pages = {15914},
	issn = {2045-2322},
	shorttitle = {{FathomNet}},
	url = {https://www.nature.com/articles/s41598-022-19939-2},
	doi = {10.1038/s41598-022-19939-2},
	month = sep,
	year = {2022},
}

This processing code will be reworked into installable modules as appropriate over the coming months. ↩

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
config		config
docs		docs
processing		processing
scripts		scripts
src/TreeOfLife_toolbox		src/TreeOfLife_toolbox
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Tools for TreeOfLife dataset

Installation Instructions

Pip installation

Scripts creation

Currently, the following tools are available

How to use the tools

How to create a new tool

Recommended Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 6

Uh oh!

Languages

License

Imageomics/TreeOfLife-toolbox

Folders and files

Latest commit

History

Repository files navigation

Tools for TreeOfLife dataset

Installation Instructions

Pip installation

Scripts creation

Currently, the following tools are available

How to use the tools

How to create a new tool

Recommended Citation

Footnotes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 6

Uh oh!

Languages

Packages