GutBrainIE: Official Page
Edit your configuration files in conf/conf_gutbrain
Run the following scripts to prepare the dataset:
# For train/dev set
python src/ner/utils/get_bio2.py
python src/ner/utils/preprocess.py
# For test set
python src/ner/utils/create_dummy_annot_test.py
python src/ner/utils/get_bio2.py
python src/ner/utils/preprocess.py
Processed data is saved to: data/preprocessed
# Train
python src/ner/train/main.py +exp=train
# Run prediction on dev set
python src/ner/train/main.py +exp=predict
# Run inference on test set
python src/ner/train/test.py +exp=ner/test model.model_name_or_path={trained_model_path}
Trained models are saved in output/ner/train_res
python src/utils/postprocess.py
# For submission format, see:
eval/NER_get_devset_submission_file_and_evaluate.ipynb
eval/NER_get_testset_submission_file.ipynb
Edit your configuration files in conf/conf_gutbrain
Run the following scripts to prepare the dataset:
# For train/dev set
python src/re/utils/preprocess_w_negatives.py
# For test set
python src/re/utils/preprocess_w_negatives_testset.py
Processed data is saved to: data/preprocessed
# Train
python src/re/train/main.py +exp=train
# Prediction on dev/test set
python src/re/train/test.py +exp=predict
python src/re/train/test.py +exp=test
Trained models are saved in output/re/train_res
Prediction results are saved in submission format in output/re/test_res
The train corpus stats are generated in ./00_Data_Overview.ipynb.
We extract relation and co-occurrence statistics to support relation prediction:
frequency
: Number of times a (subject_label, object_label) pair appears in labeled relations.cooccurrence_frequency
: Number of times the same pair co-occurs in entity annotations (regardless of relation).relation_likelihood
: Computed asfrequency / cooccurrence_frequency
, estimating the probability of a relation when the pair co-occurs.avg_char_distance
,median_distance
,min/max_distance_percentile
: Character-based distance metrics between subject and object spans.predicate_counts
: Counts of different predicates assigned to the pair across annotations.annotators
: List of annotators who labeled the relation, used to filter weak or distant-only annotations.
Those are saved in ./data/ds_stats/train_binary_rel_stats.csv.
The prediction code based on this data is made in ./src/baseline_RE.py.
Experiments with https://github.com/Babelscape/rebel.
Created new files:
- ./rebel_RE/conf/data/gutbrainie_data.yaml
- ./rebel_RE/conf/train/gutbrainie_train.yaml
- ./rebel_RE/datasets/gutbrainie_typed.py
Adapted the following files to include new relation and entity types:
Run on server: ./rebel_RE/src/train_job.sh.
Run on server: ./rebel_RE/src/test_job.sh.
This will save an output like ./rebel_RE/predictions/preds_gutbrainie.jsonl.
This output can be then converted into the challenge format for evaluation via ./rebel_RE/predictions/convert_to_format.py.