This is to fine-tune open LLMs (e.g., Gemma family) using mlx-lm for Coreference Resolution.
Coreference Resolution is to restore the entities in the text which was modified to prevent the repetitions and to ensure linguistic varieties.
For instance, the following text becomes
Paul went to his local Costco. He made sure to buy six packs.
Paul went to Paul's local Costco. Paul made sure to buy six packs.
Coreference Resolution will help the computers to perform text processing, to avoid the ambiguity and to aid the better text analysis, e.g., Sentiment Analysis.
Needless to say, since mlx-lm is developed and optimized for the Apple Silicon devices (e.g., Macbook Air M1), the whole code in this repo is only applicable for Apple Silicon.
Additionally, it is assumed that uv has been installed.
The following files are used as prerequisite data / configurations.
training_config.yml: The training configuration is defined in thisyamlfiledata/coreference_dataset.csv: The original dataset to be used in the training (original: Original Text,coref: Desired text)
To prepare the training, run the coref_prep.py file to generate the data in the json format which is accepted by mlx-lm.
Usage:
uv run python coref_prep.py
This will create the training.jsonl and valid.jsonl datasets within the data\ directory.
To start the training (fine-tuning), run the coref_training.py script with datasets and configuration (training_config.yml) files in hand.
Usage:
uv run python coref_training.py