Skip to content

Conversation

@mcioffi
Copy link
Contributor

@mcioffi mcioffi commented Oct 30, 2024

The New πŸ’―

TL;DR β€” Moves the flask front-end to vite/react/typescript front-end preserving the basic flask server.

  • Mantine holds most of the weight across the app for base component functionality
  • Combines the original labeller and webapp into one webtools
  • Webtools has three main features β€” (a) the main parser to test sentences and ingredients , (b) the labeller to edit and improve the dataset, and (c) a trigger to train the model
  • A favicon/logo
  • For the above (c) trigger to train model needs to actually be built ... initial thinking is to use websockets
  • Preserve most, if not all, of original repo features ... currently it's not all migrated yet
  • Add some testing framework for unit tests
  • General cleanup and simpler ways to spin up dev instance of flask plus vite/react/typescript

"Too Early" Sneak Peak πŸ‘€

369080881-b24b7ab8-8f3a-4569-9649-00e198138e49

@mcioffi
Copy link
Contributor Author

mcioffi commented Apr 30, 2025

@strangetom does pycrfsuite expose some sort of progress on the training execution? I looked at their docs on the Trainer class, but figured I would pose the question here, as it relates to this PR, since I am building a piece of the web tool for triggering and working with the training parts of this repo.

trainer.train(save_model)

@strangetom
Copy link
Owner

It's possible to set trainer = pycrfsuite.Trainer(verbose=True), which gives an output during the model training;

***** Iteration #774 *****
Loss: 1743.588445
Feature norm: 27.243591
Error norm: 18.806226
Active features: 5412
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.056

***** Iteration #775 *****
Loss: 1743.586763
Feature norm: 27.243678
Error norm: 6.752622
Active features: 5411
Line search trials: 2
Line search step: 0.500000
Seconds required for this iteration: 0.111

...etc

It's not a measure of progress as such, but it is an indication that something is happening.

@mcioffi
Copy link
Contributor Author

mcioffi commented May 8, 2025

@strangetom can you give the PR a spin on your machine, and run through the "Try the Parser" and "Browse & Label" tabs. Node is required to be installed, its managed outside of Python, as its a JS runtime. Once its globally installed on your machine, you can download the packages with /webtools/ as your working directory:

npm install

Then you need to run these three separate instances, and access the app in your browser same as you would have before, typically http://localhost:5000 http://127.0.0.1:5000

npm run flask
npm run flask-sockets
npm run watch

Depending on your python environment setup, you might need map the python path located in the npm package.json, in which case you can change these accordingly:

"flask": "python3_path=$(which python3) && \"$python3_path\" app.py",
"flask-sockets": "python3_path=$(which python3) && \"$python3_path\" app.sockets.py",

I have been some good progress on the "Train Model" tab, but there are some outstanding issues β€” you can ignore that at the moment.

app labeller screenshot

@strangetom
Copy link
Owner

Hi @mcioffi

I've got this working and have had a play with it. This is very impressive.

I have a couple of notes:

  • flask_cors is missing from requirements-dev.txt.
  • Is there a way run this without running the three commands
  • I'm not sure if there's a need for a separate edit mode when browsing the training data.

@mcioffi
Copy link
Contributor Author

mcioffi commented May 13, 2025

Many thanks for the feedback. I pushed some updates based on your notes. I will have the 'Train model' tab done in coming weeks.

All set, thank you

flask_cors is missing from requirements-dev.txt.

Pulled everything under npm run dev, it should spawn three separate processes within the same terminal session now

Is there a way run this without running the three commands

Good point. I folded that functionality inline now, instead of in a separate user experience

I'm not sure if there's a need for a separate edit mode when browsing the training data.

mcioffi added 3 commits May 13, 2025 10:06
- Cleaned up flask code
- Removed train tab interrupt button (limitations)
- Added package requirements precheck on app startup
- Added time elapsed on training screen
@mcioffi
Copy link
Contributor Author

mcioffi commented Jul 31, 2025

@strangetom one more batch of requests β€” could you spin an instance up, and validate the following:

  • Train a model with configured inputs (e.g. selected datasets, html results file, split value, etc)
  • Navigate and use other app features while the training is running in the background
  • Try to close the browser tab / window while the training is running in the background β€” should receive a prompt
  • Uninstall a requirements-dev.txt package (e.g. tqdm), and restart the app β€” should receive a prompt

Due to how threading in python works, which Flask Socket IO uses to spawn separate processes without interfering with websockets, there is no true way to interrupt or terminate the training in the background. I left it out of scope, and will assume this meets the basic needs of most users. Also out of scope is any "new language" database/training (e.g. Spanish), and 1:1 configuration with the command line options for training (e.g. the app uses the default SQLite file/directory/etc versus 100% configuration in a terminal).

Python’s Thread class supports a subset of the behavior of Java’s Thread class; currently, there are no priorities, no thread groups, and threads cannot be destroyed, stopped, suspended, resumed, or interrupted. The static methods of Java’s Thread class, when implemented, are mapped to module-level functions.

mcioffi added 4 commits July 31, 2025 21:43
"Time elapsed" tracking needs to live within managed state, and not ui state, due to the training being asynchronous and user can use the app while the model is training in the background
@strangetom
Copy link
Owner

Hi @mcioffi

I had to add allow_unsafe_werkzeug=True to the socketio.run() function in app.sockets.py to get the training functionality to work.

  • Training worked with the various options I tried.
    • It would be handy if there was an option to set the seed value for the training run. I use it a lot to compare performance when I'm making changes.
  • Whilst training the model I was able to use the other functionality in the app.
  • Attempting to close the tab whilst training did display a prompt.
  • Uninstalling a dependency did show an error in the terminal when starting the app.
    • However the app didn't abort, which is what I expected would happen.

I've got a couple more questions:

  • If I train a new model and then go to use the parser functionality in the app, will it use the new model or the model that was present when the app was started? The load_parser_model function in the package is cached with a LRU cache, so unless you're handling the specifically I think the app would continue with the older model.
  • Since you originally created this branch there have been some changes to the training process. Some of the command line options have changed and I've switching from using print to logging for displaying progress - there might be some impact on this.
  • Is it necessary to have the entire node_modules folder committed to git? Does the package-lock.json provide enough information to make it reproducible without copying node_modules around?

@mcioffi
Copy link
Contributor Author

mcioffi commented Aug 1, 2025

It would be handy if there was an option to set the seed value for the training run. I use it a lot to compare performance when I'm making changes.

πŸ‘πŸΌ makes sense β€” will add the seed input, mimicking the existing

If I train a new model and then go to use the parser functionality in the app, will it use the new model or the model that was present when the app was started? The load_parser_model function in the package is cached with a LRU cache, so unless you're handling the specifically I think the app would continue with the older model.

Let me play around the functionality of the lru_cache, it appears it should be simple to reset once the training cycle is complete via the web app "The decorator also provides a cache_clear() function for clearing or invalidating the cache"

Since you originally created this branch there have been some changes to the training process. Some of the command line options have changed and I've switching from using print to logging for displaying progress - there might be some impact on this.

Currently using contextlib.redirect_stdout to swallow the print output. Will adjust to accommodate logging versus print, and check out your develop branch to see what's upcoming

Is it necessary to have the entire node_modules folder committed to git? Does the package-lock.json provide enough information to make it reproducible without copying node_modules around?

Correct, node_modules can be removed when copying/moving around the file system. Running npm install restocks the packages, and package-lock.json is usually preserved in source control (from what I have seen)

mcioffi added a commit to mcioffi/ingredient-parser that referenced this pull request Aug 11, 2025
Changes are to accommodate strangetom#25, as webtools will need basic logging to handle websocket logging behavior required for front-end
…le output from websockets

1. Work-in-progress new feature for gridsearch
2. Rever to logger.info for websocket output, to be aligned with develop branch
4. Refine the console output, should have originally used <pre> tags to persist  reserved str \n, \t, etc

Note: This commit refers to a function set_redirect_log_stream, that will be introduced in strangetom#35, since it is required to set a logging stream handle correctly to pipe results to websockets
strangetom pushed a commit that referenced this pull request Aug 12, 2025
* Drop tqdm in favor of logger; normalize use of tabulate

Changes are to accommodate #25, as webtools will need basic logging to handle websocket logging behavior required for front-end

* Missed reference

* Preserve logging bypass from existing branch

* Remove flask web package dependencies until PR#25

* Introduce decorate fn for redirecting logsteam to be later used in web app/tool

Refer to notes on other PR at 371a060

* Formatting fixes with pre-commit
@mcioffi mcioffi changed the base branch from master to develop August 12, 2025 17:20
…/etc)

Tacks on to the existing pre-commit hooks in the repo with BiomeJS. BiomeJS is specifically for the web side (typescript/js/css). Local configuration modeled after preconfigs at https://github.com/biomejs/pre-commit. Commit includes all file format fixes necessary. Anticipated that config will be modified as necessary if new web contributors participate in future commits.
Accidentally included new training.sqlite3 in strangetom@e01430b
@strangetom
Copy link
Owner

strangetom commented Aug 16, 2025

Hi @mcioffi

I've had another, more in-depth look at this. I've found a couple of bugs that will need fixing before merging. These are in no particular order.

Bugs

  • General: on the parser and labeller pages, the contrast between text and colours is a bit low (particularly for anything on a red background). This is probably best fixed by increasing the font size on those pages, since it's a little small anyway.
  • Parser: Missing separate_names option
  • Labeller: Bug when searching (coarse, dried returns no results but should return sentence id 4718)
  • Labeller: Unable to quickly select a label for a token using first letter (as per <select> element)
  • Trainer: Inputs for seed, split do not focus on click so cannot edit without tabbing to the inputs or finding another way to focus them. This might be a wider problem than just those.
  • Trainer: Output include debug info app.sockets, which isn't relevant to training.
  • Trainer: default model location incorrect (this was changed recently, is should be in ingredient_parser/en/data/)
  • Trainer: saveur dataset is selected by default, but doesn't exist.
  • Trainer: html and tsv files saved in webtools directory instead of parent
  • Trainer: unable to start another training run after completing a previous one if confusion matrix is generated by the first run. There's a RuntimeError related to Tk.
  • Trainer: split value does not allow more than one decimal place. There shouldn't really be any limit of the number of decimal places (but in practice, 3 might be a more reasonable limit than 1).

Not bugs, but would be nice

  • Parser: show % sign in tooltips for scores for each label
  • Parser: order labels in token tooltip as per old webapp
  • Parser: Reorder rows in result table as per old webapp
  • Parser: The amount flags are not shown in the results table
  • General: Can we avoid use of google fonts? Everything else (after running npm install) is entirely local.

This is a seriously impressive bit of work, thanks again!

- [x] Parser: The amount flags are not shown in the results table
- [x] Parser: order labels in token tooltip as per old webapp
- [x] Parser: Reorder rows in result table as per old webapp
- [x] General: Can we avoid use of google fonts? Everything else (after running npm install) is entirely local.
- [x] Parser: show % sign in tooltips for scores for each label
- [x] Trainer: Output include debug info app.sockets, which isn't relevant to training.
- [x] Trainer: html and tsv files saved in webtools directory instead of parent
- [x] General: on the parser and labeller pages, the contrast between text and colours is a bit low (particularly for anything on a red background). This is probably best fixed by increasing the font size on those pages, since it's a little small anyway.
- [x] Trainer: default model location incorrect (this was changed recently, is should be in ingredient_parser/en/data/)
- [x] Trainer: split value does not allow more than one decimal place. There shouldn't really be any limit of the number of decimal places (but in practice, 3 might be a more reasonable limit than 1).
- [x] Trainer: saveur dataset is selected by default, but doesn't exist.
- [x] Parser: Missing separate_names option
- [x] Trainer: Inputs for seed, split do not focus on click so cannot edit without tabbing to the inputs or finding another way to focus them. This might be a wider problem than just those.
- [x] Trainer: unable to start another training run after completing a previous one if confusion matrix is generated by the first run. There's a RuntimeError related to Tk.
@mcioffi
Copy link
Contributor Author

mcioffi commented Aug 17, 2025

Labeller: Bug when searching (coarse, dried returns no results but should return sentence id 4718)

Thanks! For this one ^, do you prefer to catch other intermediary punctuations too, so , ; - β€”β€”. I also noticed that fractions, e.g. Β½, don't work with labeller search, so will address that too. Is there a way to reverse convert an already altered unicode fraction 1#1$2 back to 1 Β½? I believe will need to include that too. For example, if you filter only against QTY label in labeller, you hit this scenario where the search term will look against raw tokens, e.g. ["1#1$2", "cup", "heavy", "cream"] sentence id 10414 β€”>

partial_sentence = " ".join(

Labeller: Unable to quickly select a label for a token using first letter (as per select element)

And do you have a screenshot handy for this ^, trying to visualize it for clarity, thanks!

@strangetom
Copy link
Owner

Thanks! For this one ^, do you prefer to catch other intermediary punctuations too, so , ; - β€”β€”. I also noticed that fractions, e.g. Β½, don't work with the labeller search, so will address that too.

I think it should match the search phrase exactly, so coarse, dried should only find sentence that contain the substring coarse, dried.
Good catch on the fractions, I'd not noticed that.
There's also an existing bug with the old labeller app where it would never find any matching sentences if the search phrase ended with punctuation. Fixing that would be handy too.

Labeller: Unable to quickly select a label for a token using first letter (as per select element)

And do you have a screenshot handy for this ^, trying to visualize it for clarity, thanks!

Sorry, my explanation was pretty terrible. With the old labeller app (which uses the <select> element for the dropdowns), when the dropdown is focussed you can type the first letter of the label you want and it will select the first option that starts with that letter. If there are multiple options that start with the same letter, typing the letter again will move on the next option starting with that letter.

@strangetom
Copy link
Owner

strangetom commented Aug 17, 2025

Is there a way to reverse convert an already altered unicode fraction 1#1$2 back to 1 Β½? I believe will need to include that too.

Ah, that is a problem. I don't think I've ever encountered that when using the labeller, which is not to say we shouldn't fix it.
Would it be possible to use the PreProcessor._identify_fractions function on the search phrase to convert to the 1#1$2 form and then try to do the match? We'd probably need to refactor _identify_fractions to move it into the _utils.py file.

Edit: It might actually be easier to just run the search phrase through the PreProcessor and compare the resultant tokens with the tokens in the database.

- [x] Labeller: Unable to search unicode fractions
- [x] Labeller: Unable to quickly select a label for a token using first letter (as per <select> element)
- [x] Labeller: Bug when searching (coarse, dried returns no results but should return sentence id 4718)
@mcioffi
Copy link
Contributor Author

mcioffi commented Aug 17, 2025

Edit: It might actually be easier to just run the search phrase through the PreProcessor and compare the resultant tokens with the tokens in the database.

Agreed went that route. I used PreProcessor("input").sentence, no need to modify PreProcessor._identify_fractions. Re: "trailing sentence period", it appears PreProcessor does strip it.

@strangetom
Copy link
Owner

Thanks for those bugfixes. There a few remaining that I can find

  • Thanks for changing the labeller to use the combobox, it makes things much easier. Is it possible to be able to tab between tokens in a sentence to change to the next one?
  • Trainer: The split value seems to be capped at 0.9. That should probably changed 0.999 - I was trying to use 0.95 to quickly train a model to test the webtools.
  • Labeller: There's a weird bug when changing the label for a token - an addition PUNC token gets inserted.
    • Steps to reproduce:
      1. Open labeller
      2. Load the NYT dataset
      3. Enable editing
      4. Change the label of the first token in the first sentence
      5. Observe an additional comma token is inserted (after the word squash.
        This seems to effect any sentence containing a comma when you change the label of the first token.

@mcioffi
Copy link
Contributor Author

mcioffi commented Aug 18, 2025

Thanks for changing the labeller to use the combobox, it makes things much easier. Is it possible to be able to tab between tokens in a sentence to change to the next one?

Took a look at Mantine's Combobox source code. Unfortunately, it looks like they only reserve ArrowUp and ArrowDown for navigating. Hopefully those keys work for now to navgitate via the keyboard in the dropdown

Labeller: There's a weird bug when changing the label for a token - an addition PUNC token gets inserted.

Ah ok, this one is related React's rendering lists with keys plus some code fragility. I forgot underlying token data can be repetitive, non-unique, i.e. [[',','PUNC'],['vegetable','B-NAME'][',','PUNC']]. I will submit some fixes/refactor a bit. Thanks for catching.

Edit: Misinterpreted Combobox question, thought it was referring to the dropdown label options, not the separate tokens in the sentence. Fixed thanks

- [x] Trainer: The split value seems to be capped at 0.9. That should probably changed 0.999
- [x] Labeller: There's a weird bug when changing the label for a token - an addition PUNC token gets inserted. This seems to effect any sentence containing a comma when you change the label of the first token.
- [x] Labeller: Is it possible to be able to tab between tokens in a sentence to change to the next one?
- [x] Trainer: Gridsearch incomplete
@strangetom strangetom merged commit 1d8c045 into strangetom:develop Aug 26, 2025
4 checks passed
@mcioffi mcioffi deleted the 1.2.0-webtools branch August 27, 2025 11:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants