Skip to content

chaeeunlee-io/bioasq2025

Repository files navigation

GutBrainIE @ BioASQ2025

GutBrainIE: Official Page


0. Place raw datasets from organiser in data/raw

NER

1. Set Configs

Edit your configuration files in conf/conf_gutbrain

2. Preprocess Data

Run the following scripts to prepare the dataset:


# For train/dev set
python src/ner/utils/get_bio2.py  
python src/ner/utils/preprocess.py

# For test set
python src/ner/utils/create_dummy_annot_test.py
python src/ner/utils/get_bio2.py  
python src/ner/utils/preprocess.py

Processed data is saved to: data/preprocessed

3. Run train/inference

# Train
python src/ner/train/main.py +exp=train  

# Run prediction on dev set
python src/ner/train/main.py +exp=predict

# Run inference on test set
python src/ner/train/test.py +exp=ner/test model.model_name_or_path={trained_model_path}

Trained models are saved in output/ner/train_res

4. Post-process for eval

python src/utils/postprocess.py

# For submission format, see:
eval/NER_get_devset_submission_file_and_evaluate.ipynb
eval/NER_get_testset_submission_file.ipynb

RE

1. Set Configs

Edit your configuration files in conf/conf_gutbrain

2. Preprocess Data

Run the following scripts to prepare the dataset:

# For train/dev set
python src/re/utils/preprocess_w_negatives.py

# For test set
python src/re/utils/preprocess_w_negatives_testset.py

Processed data is saved to: data/preprocessed

3. Run train/inference

# Train
python src/re/train/main.py +exp=train  

# Prediction on dev/test set
python src/re/train/test.py +exp=predict
python src/re/train/test.py +exp=test

Trained models are saved in output/re/train_res
Prediction results are saved in submission format in output/re/test_res


RE Baselines

1. Simple baseline based on train corpus stats

The train corpus stats are generated in ./00_Data_Overview.ipynb.

We extract relation and co-occurrence statistics to support relation prediction:

  • frequency: Number of times a (subject_label, object_label) pair appears in labeled relations.
  • cooccurrence_frequency: Number of times the same pair co-occurs in entity annotations (regardless of relation).
  • relation_likelihood: Computed as frequency / cooccurrence_frequency, estimating the probability of a relation when the pair co-occurs.
  • avg_char_distance, median_distance, min/max_distance_percentile: Character-based distance metrics between subject and object spans.
  • predicate_counts: Counts of different predicates assigned to the pair across annotations.
  • annotators: List of annotators who labeled the relation, used to filter weak or distant-only annotations.

Those are saved in ./data/ds_stats/train_binary_rel_stats.csv.

The prediction code based on this data is made in ./src/baseline_RE.py.

2. REBEL

Experiments with https://github.com/Babelscape/rebel.

1.1. Model fine-tuning

Created new files:

Adapted the following files to include new relation and entity types:

Run on server: ./rebel_RE/src/train_job.sh.

1.2. Model predictions and eval

Run on server: ./rebel_RE/src/test_job.sh.

This will save an output like ./rebel_RE/predictions/preds_gutbrainie.jsonl.

This output can be then converted into the challenge format for evaluation via ./rebel_RE/predictions/convert_to_format.py.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published