Skip to content

Conversation

@mcioffi
Copy link
Contributor

@mcioffi mcioffi commented Aug 11, 2025

The Updates ✏️

TL;DR — Future web app (PR #25) use websockets for routing training data to the front-end, and tqdm adds complexity to that with stdout behavior over sockets, so this proposes to remove it; also included are updates to tabulate printing and a decorate function

  • Trainer: Remove tqdm dependency
  • Trainer: Reverts to logger.info for multi-processing tasks like train (multiple), gridsearch, and featuresearch
  • Trainer: Normalize use of tabulate across all tasks — train (single/multiple) now gets this treatment (examples below)
  • Trainer: New decorator fn called set_redirect_log_stream, to be used in web app
$ python3 train.py train --database train/data/training.sqlite3

[INFO] (training_utils) Loading and transforming training data.
[INFO] (training_utils) 81,272 usable vectors.
[INFO] (training_utils) 45 discarded due to OTHER labels.
[INFO] (train_model) 905609147 is the random seed used for the train/test split.
[INFO] (train_model) 65,017 training vectors.
[INFO] (train_model) 16,255 testing vectors.
[INFO] (train_model) Training model with training data.
[INFO] (train_model) Evaluating model with test data.

+------------------------+---------------------------+
| Sentence-level results | Word-level results        |
+------------------------+---------------------------+
| Accuracy: 94.42%       | Accuracy: 97.84%          |
|                        | Precision (micro) 97.82%  |
|                        | Recall (micro) 97.84%     |
|                        | F1 score (micro) 97.83%   |
+------------------------+---------------------------+
$ python3.12 train.py multiple --database train/data/training.sqlite3 --runs 2

[INFO] (training_utils) Loading and transforming training data.
[INFO] (training_utils) 81,272 usable vectors.
[INFO] (training_utils) 45 discarded due to OTHER labels.
[INFO] (train_model) Queued for 2 separate runs. This may take some time.
[INFO] (train_model) 1st run completed
[INFO] (train_model) 2nd run completed

+---------+---------------------+-------------------+-----------+
| Run     | Word/Token accuracy | Sentence accuracy | Seed      |
+---------+---------------------+-------------------+-----------+
| 1st     | 97.75%              | 94.36%            | 94545036  |
| 2nd     | 98.01%              | 94.83%            | 125339459 |
| -       | -                   | -                 | -         |
| Average | 97.88% ± 0.54%      | 94.60% ± 0.98%    | 125339459 |
| -       | -                   | -                 | -         |
| Best    | 98.01%              | 94.83%            | 125339459 |
| Worst   | 97.75%              | 94.36%            | 94545036  |
+---------+---------------------+-------------------+-----------+

mcioffi added a commit to mcioffi/ingredient-parser that referenced this pull request Aug 12, 2025
…le output from websockets

1. Work-in-progress new feature for gridsearch
2. Rever to logger.info for websocket output, to be aligned with develop branch
4. Refine the console output, should have originally used <pre> tags to persist  reserved str \n, \t, etc

Note: This commit refers to a function set_redirect_log_stream, that will be introduced in strangetom#35, since it is required to set a logging stream handle correctly to pipe results to websockets
@strangetom
Copy link
Owner

Thanks @mcioffi

A few minor issues:

  1. Can you run the pre-commit hooks to fix the formatting issues
  2. Using the fancy_grid style for the tables makes for better readability when the table columns are wide (e.g. when grid search results)

@strangetom
Copy link
Owner

Perfect, thanks!

@strangetom strangetom merged commit 50c12f3 into strangetom:develop Aug 12, 2025
4 checks passed
strangetom added a commit that referenced this pull request Aug 26, 2025
* Drop flask front-end

* First batch of vite/react/ts front-end with flask server

* Updates for docs and readability

* Cleanup

* Update assets + readme

* Refactoring webtools + integrating flask-socketio

* Bump readme

* Bump

* dev reqs & concurrent pids bash in npm

* Change handling of edit mode from modal to inline

* cleanup

* Labeller help hover

* Missed doc notes

* Train model tab updates

- Cleaned up flask code
- Removed train tab interrupt button (limitations)
- Added package requirements precheck on app startup
- Added time elapsed on training screen

* Accidental none check against thread, preventing re-runs

* Asset screenshot

* Move time tracker on trainer to zustrand state

"Time elapsed" tracking needs to live within managed state, and not ui state, due to the training being asynchronous and user can use the app while the model is training in the background

* Bump

* Add seed input to training

* Add seed input to training (server)

* Sets up cache reset on model loader

Addresses one of the comments by @strangetom on #25 (comment) regarding cache resetting by the @ lru_cache decorator on the model loading. This ensures completed training cycles done via the web app (which impacts the parser) updates the parser optimistically to handle the cache resetting

* Bump up third-party package versions against vite and mantine

* WIP - add gridsearch options, revert to logger.info, and refine console output from websockets

1. Work-in-progress new feature for gridsearch
2. Rever to logger.info for websocket output, to be aligned with develop branch
4. Refine the console output, should have originally used <pre> tags to persist  reserved str \n, \t, etc

Note: This commit refers to a function set_redirect_log_stream, that will be introduced in #35, since it is required to set a logging stream handle correctly to pipe results to websockets

* Screenshots

* Remove remnants

* Cleanup, readability, etc

* Remove interrupt and keepModels based on maintainer feedback

* Pre-commit format fixes, part I

* Pre-commit fixes, part II (adds biomejs for webtools linter/formatter/etc)

Tacks on to the existing pre-commit hooks in the repo with BiomeJS. BiomeJS is specifically for the web side (typescript/js/css). Local configuration modeled after preconfigs at https://github.com/biomejs/pre-commit. Commit includes all file format fixes necessary. Anticipated that config will be modified as necessary if new web contributors participate in future commits.

* Restore training.sqlite3 to previous commit

Accidentally included new training.sqlite3 in e01430b

* Address bug feedback, round I

- [x] Parser: The amount flags are not shown in the results table
- [x] Parser: order labels in token tooltip as per old webapp
- [x] Parser: Reorder rows in result table as per old webapp
- [x] General: Can we avoid use of google fonts? Everything else (after running npm install) is entirely local.
- [x] Parser: show % sign in tooltips for scores for each label
- [x] Trainer: Output include debug info app.sockets, which isn't relevant to training.
- [x] Trainer: html and tsv files saved in webtools directory instead of parent
- [x] General: on the parser and labeller pages, the contrast between text and colours is a bit low (particularly for anything on a red background). This is probably best fixed by increasing the font size on those pages, since it's a little small anyway.
- [x] Trainer: default model location incorrect (this was changed recently, is should be in ingredient_parser/en/data/)
- [x] Trainer: split value does not allow more than one decimal place. There shouldn't really be any limit of the number of decimal places (but in practice, 3 might be a more reasonable limit than 1).
- [x] Trainer: saveur dataset is selected by default, but doesn't exist.
- [x] Parser: Missing separate_names option

* Address bugs feedback, round II

- [x] Trainer: Inputs for seed, split do not focus on click so cannot edit without tabbing to the inputs or finding another way to focus them. This might be a wider problem than just those.
- [x] Trainer: unable to start another training run after completing a previous one if confusion matrix is generated by the first run. There's a RuntimeError related to Tk.

* Address bugs feedback, round III

- [x] Labeller: Unable to search unicode fractions
- [x] Labeller: Unable to quickly select a label for a token using first letter (as per <select> element)
- [x] Labeller: Bug when searching (coarse, dried returns no results but should return sentence id 4718)

* Address bugs feedback, round IV

- [x] Trainer: The split value seems to be capped at 0.9. That should probably changed 0.999
- [x] Labeller: There's a weird bug when changing the label for a token - an addition PUNC token gets inserted. This seems to effect any sentence containing a comma when you change the label of the first token.

* Address bugs feedback, round V

- [x] Labeller: Is it possible to be able to tab between tokens in a sentence to change to the next one?

* Address bugs feedback, round VI

- [x] Trainer: Gridsearch incomplete

---------

Co-authored-by: tom <tpstrange@gmail.com>
@mcioffi mcioffi deleted the pr/35 branch August 27, 2025 11:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants