feat: use data annotations for HTML and Markdown files, and implement splitting of data annotation requests #134

Rolv-Apneseth · 2025-03-08T22:33:18Z

Continuing on with #117

This should hopefully check off splitting for data annotation requests (and joining of the responses).

I've also converted the Markdown and HTML parsers to return data annotations.

Open to any suggestions for changes, and I'd appreciate some testing of any files you happen to have laying around. I'm sure that these basic implementations have a lot of edge cases left to cover.

Also, I've been avoiding running the checks on files against the public instance of LanguageTool for fear of getting blocked due to the number of requests. Do we have any concerns about that? I've just been running all tests against a locally hosted docker version of the API.

codspeed-hq · 2025-03-08T22:43:37Z

CodSpeed Performance Report

Merging #134 will not alter performance

_{Comparing Rolv-Apneseth:data_annotations (dc81d5c) with v3 (ea24862)}

Summary

✅ 6 untouched benchmarks

jeertmans · 2025-03-11T12:44:15Z

Thanks @Rolv-Apneseth, do you have an example .typst file that you used to check that it works?

Rolv-Apneseth · 2025-03-11T13:18:50Z

I just made an example as I couldn't really find many online. To be honest working with this has been the first I've heard of the typst format. I just tried to follow the examples they provide so hopefully it's all correct syntactically.

example.txt

Just save as a .typ file since it wouldn't let me upload with that extension. Should only have 1 spelling mistake from what I remember (temperatre)

jeertmans · 2025-03-11T13:55:11Z

Could you include that file in the tests folder and use it in a test?

Rolv-Apneseth · 2025-03-11T15:15:01Z

Sure. What do you think the best way of testing would be? The CLI integration tests?

jeertmans · 2025-03-11T15:33:57Z

Sure. What do you think the best way of testing would be? The CLI integration tests?

Yes, that's a good idea: you can write a test that runs ltrs check split_example.typst --raw, it should automatically detect Typst format (and split the request into multiple because one would be too long). Then, you parse the output text into a JSON object and very that it contains the expected errors (you can probably use serde and the check::Response struct).

Rolv-Apneseth · 2025-03-11T16:23:57Z

Yes, that's a good idea: you can write a test that runs ltrs check split_example.typst --raw, it should automatically detect Typst format (and split the request into multiple because one would be too long). Then, you parse the output text into a JSON object and very that it contains the expected errors (you can probably use serde and the check::Response struct).

I was just gonna check if it contains the string to be honest, but sure I can try that. Alternatively, if you want it to be thorough, how would you feel about snapshot tests with something like insta?

jeertmans · 2025-03-11T19:20:50Z

Yes, that's a good idea: you can write a test that runs ltrs check split_example.typst --raw, it should automatically detect Typst format (and split the request into multiple because one would be too long). Then, you parse the output text into a JSON object and very that it contains the expected errors (you can probably use serde and the check::Response struct).

I was just gonna check if it contains the string to be honest, but sure I can try that. Alternatively, if you want it to be thorough, how would you feel about snapshot tests with something like insta?

Insta snapshots is probably a more appropriate solution to this, good idea!

Rolv-Apneseth · 2025-03-11T22:49:40Z

I'll need to look a bit more into the snapshot tests (tomorrow hopefully)

jeertmans · 2025-03-12T10:52:43Z

Looks like the snapshot contain absolute path to files, which could be an issue as they won't match the paths in the CI workflows. I see two possible solution: (1) ignore file path in the diff check (maybe an option of Insta.rs), or (2) print relative paths in request outputs.

Rolv-Apneseth · 2025-03-12T10:57:16Z

I might have some time to investigate later, but yeah I don't think it's an option with Insta, will have to look into other solutions

Cargo.toml

jeertmans · 2025-03-12T11:05:06Z

I might have some time to investigate later, but yeah I don't think it's an option with Insta, will have to look into other solutions

You can probably just filter out the path from the snapshot, see https://insta.rs/docs/filters/.

Rolv-Apneseth · 2025-03-12T11:05:38Z

I might have some time to investigate later, but yeah I don't think it's an option with Insta, will have to look into other solutions

You can probably just filter out the path from the snapshot, see https://insta.rs/docs/filters/.

Thanks, I'll have a look

src/api/server.rs

jeertmans · 2025-03-12T11:44:45Z

src/parsers/typst.rs

@@ -153,7 +53,7 @@ pub fn parse_typst(file_content: impl AsRef<str>) -> Data<'static> {
                    // issues. The following sentence would give an error for
                    // repeated whitespace otherwise: This has ``` `backticks`
                    // ``` in it
-                    "_ignore_".to_string(),
+                    IGNORE,


Can't we just do new_markup instead of new_interpreted_markup?

If I understand the LT API correctly, new_markup will prompt LT to ignore its content, while new_interpreted_markup.

See for example the output of the following request:

{"annotation":[ {"text": "A "}, {"markup": "<b>", "interpretAs": "simpl "}, {"text": "test"}, {"markup": "</b>"} ]}

returns:

{ "software": { "name": "LanguageTool", "version": "6.6.13", "buildDate": "2025-02-19 14:09:51 +0000", "apiVersion": 1, "premium": true, "premiumHint": "You might be missing errors only the Premium version can find. Contact us at support<at>languagetoolplus.com.", "status": "" }, "warnings": { "incompleteResults": false }, "language": { "name": "English (US)", "code": "en-US", "detectedLanguage": { "name": "Galician", "code": "gl-ES", "confidence": 0.3392336, "source": "ngram" } }, "matches": [ { "message": "Possible spelling mistake found.", "shortMessage": "Spelling mistake", "replacements": [ { "value": "simply" }, { "value": "simple" }, { "value": "simp" }, { "value": "simps" } ], "offset": 2, "length": 2, "context": { "text": "A <b>test</b>", "offset": 2, "length": 2 }, "sentence": "A simpl test", "type": { "typeName": "UnknownWord" }, "rule": { "id": "MORFOLOGIK_RULE_EN_US", "description": "Possible spelling mistake", "issueType": "misspelling", "category": { "id": "TYPOS", "name": "Possible Typo" }, "isPremium": false, "confidence": 0.68 }, "ignoreForIncompleteSentence": false, "contextForSureMatch": 0 } ], "sentenceRanges": [ [ 0, 13 ] ], "extendedSentenceRanges": [ { "from": 0, "to": 12, "detectedLanguages": [ { "language": "en", "rate": 1 } ] } ] }

So I think we can remove the IGNORE placeholder value, and changed all new_interpreted_markup(..., IGNORE) calls to new_markup(...). What do you think?

I could adjust which tags have that ignore pattern, and leave the rest as just markup, but it's required for things like code and code blocks to avoid white-space issues like the comment says

I would argue that this is a non-issue: people (or us) can disable the whitespace rule, and detecting too many / too few whitespaces is sometimes a desired feature, even around markup.

What do you mean? I wouldn't expect something like "We use tool_name_here for ..." to return a white-space error, and disabling the white-space rule altogether doesn't seem great

For inline markup, like backsticks, I guess we could use the interpret as variant, e.g.,

{"annotation":[ {"text": "We usse "}, {"markup": "`Python`", "interpretAs": "Python"}, {"text": " for programming."}]}

which does not cause any whitespace issue:

output

{ "software": { "name": "LanguageTool", "version": "6.6.13", "buildDate": "2025-02-19 14:09:51 +0000", "apiVersion": 1, "premium": true, "premiumHint": "You might be missing errors only the Premium version can find. Contact us at support<at>languagetoolplus.com.", "status": "" }, "warnings": { "incompleteResults": false }, "language": { "name": "English (US)", "code": "en-US", "detectedLanguage": { "name": "English (US)", "code": "en-US", "confidence": 0.9999978, "source": "ngram" } }, "matches": [ { "message": "Possible spelling mistake found.", "shortMessage": "Spelling mistake", "replacements": [ { "value": "use" }, { "value": "uses" }, { "value": "USSR" }, { "value": "SSE" }, { "value": "USS" }, { "value": "ASSE" }, { "value": "BSSE" }, { "value": "USE" }, { "value": "USSB" }, { "value": "USSS" }, { "value": "esse" } ], "offset": 3, "length": 4, "context": { "text": "We usse `Python` for programming.", "offset": 3, "length": 4 }, "sentence": "We usse Python for programming.", "type": { "typeName": "UnknownWord" }, "rule": { "id": "MORFOLOGIK_RULE_EN_US", "description": "Possible spelling mistake", "issueType": "misspelling", "category": { "id": "TYPOS", "name": "Possible Typo" }, "isPremium": false, "confidence": 0.68 }, "ignoreForIncompleteSentence": false, "contextForSureMatch": 0 } ], "sentenceRanges": [ [ 0, 33 ] ], "extendedSentenceRanges": [ { "from": 0, "to": 31, "detectedLanguages": [ { "language": "en", "rate": 1 } ] } ] }

For things like math, that cannot be interpreted, things are complex, and your solution actually makes some sense. I spent a few minutes looking at various repositories, and LTeX, a very popular vscode extension, insert dummy variables like Dummy or Dummys (ref: https://github.com/valentjn/ltex-ls/blob/develop/src/main/kotlin/org/bsplines/ltexls/parsing/latex/LatexAnnotatedTextBuilder.kt and https://github.com/valentjn/ltex-ls/blob/develop/src/main/kotlin/org/bsplines/ltexls/parsing/DummyGenerator.kt), but it looks relatively complex to do that.

Let's just replace them with new_markup for the moment. We will probably have whitespace issues, but this rule can be disabled anyway.

You're sure that's how you prefer to do it? The way it is now seems better than having to disable the white-space rule entirely.

Multiple whitespaces are usually not an issue in those languages (Markdown, LaTeX, Typst), so it makes sense to disable this rule anyway (e.g., see this). We might change this in the future, but let's not focus too much on this now.

@jeertmans Making it just markup like you suggest is giving some errors other than just consecutive white-spaces:

error[COMMA_PARENTHESIS_WHITESPACE]: Don’t put a space before the full stop. --> [path] | 8 | ...he glacier melting models established in @glacier-melt. bibliography("works.bib") The flow ra... | ^^^^^^^^^^^^^^^ Use of whitespace before comma and before/after parentheses | --------------- help: . | error[TO_NON_BASE]: The verb after “to” should be in the base form as part of the to-infinitive. A verb can take many forms, but the base form is always used in the to-infinitive. --> [path] | 2 | = Code blocks Adding `rbx` to `rcx` gives the desired result. What is ```rust fn... | ^^^^^ 'to' + non-base form | ----- help: give | error[COMMA_PARENTHESIS_WHITESPACE]: Don’t put a space before the full stop. --> [path] | 18 | ... defined through the recurrence relation $F_n = F_(n-1) + F_(n-2)$. It can also be expressed in _closed for... | ^^^^^^^^^^^^^^^^^^^^^^^^^^^ Use of whitespace before comma and before/after parentheses | --------------------------- help: . |

Still want to proceed with that?

Yes, we will leave this as a future improvement. As said, I don’t mind if it throws more errors that it shouldn’t, I prefer false positive than false negative.

note: once again, my apologies for the delay in reviewing, I was very busy with other things…

All good, there's no rush on any of this.

I made a commit just accepting the false positives for those snapshots then.

Rolv-Apneseth · 2025-03-12T22:31:19Z

Any idea what would cause the error?

command=`"/home/runner/work/languagetool-rust/languagetool-rust/target/debug/ltrs" "check" "-t" "en-US" "\"I am a star.\""`
code=2
stdout=""
stderr=```
error: invalid value \'\"I am a star.\"\' for \'[FILENAMES]...\': invalid filename (got \'\"I am a star.\"\', does not exist or is not a file)

The code doesn't seem to include that "en-US":

let assert = cmd
    .arg("check")
    .arg("-t")
    .arg("\"some text that is given as text\"")
    .assert();

Edit: fixed after merging v3 and fixing the language declaration - moving onto issue with the snapshots

…osted LanguageTool API

Rolv-Apneseth · 2025-03-12T23:17:36Z

Looks like previous versions provide different results - what do you think we should do? Drop the snapshots approach altogether?

jeertmans · 2025-03-13T10:33:47Z

Looks like previous versions provide different results - what do you think we should do? Drop the snapshots approach altogether?

I think we should only generate and check snapshots for the latest version of LanguageTools.

See 727ca6a

chore(ci): fix when CI runs

…ta_annotations

Rolv-Apneseth · 2025-03-13T14:00:08Z

I think we should only generate and check snapshots for the latest version of LanguageTools.

Fair enough

Rolv-Apneseth · 2025-03-13T14:41:25Z

Edit: fixed after merging v3 and fixing the language declaration - moving onto issue with the snapshots

Also - to have the correct snapshots (to match the CI) I needed to run tests against a self-hosted API using the same options that the CI uses. This should probably be noted in a e.g. CONTRIBUTING.md, along with maybe a docker-compose.yml for easy hosting and a bash script for running the tests?

Might be relevant but in frankfurte-rs I just forced all tests to run against a local API, included a docker-compose.yml and used just for automating the containers and running tests. Happy to implement this anyway you want though, of course.

I think it would at least make sense to have tests require a local API to be run, and a comprehensive contributing section / file which outlines all the tests, linting and formatting that can be run locally to avoid errors from the CI.

jeertmans · 2025-03-15T09:34:51Z

Edit: fixed after merging v3 and fixing the language declaration - moving onto issue with the snapshots

Also - to have the correct snapshots (to match the CI) I needed to run tests against a self-hosted API using the same options that the CI uses. This should probably be noted in a e.g. CONTRIBUTING.md, along with maybe a docker-compose.yml for easy hosting and a bash script for running the tests?

Might be relevant but in frankfurte-rs I just forced all tests to run against a local API, included a docker-compose.yml and used just for automating the containers and running tests. Happy to implement this anyway you want though, of course.

I think it would at least make sense to have tests require a local API to be run, and a comprehensive contributing section / file which outlines all the tests, linting and formatting that can be run locally to avoid errors from the CI.

Yes, it makes sense to only run API tests on a local server, for performance and avoiding spamming the API. We could probably achieve that with fixtures, see https://docs.rs/rstest/latest/rstest/attr.fixture.html

Either we try to build a server client from environ variables, so we never use the default API, or we run a docker command (but then it requires docker properly set up) to start a server.

Rolv-Apneseth · 2025-03-15T10:56:53Z

Either we try to build a server client from environ variables

What do you mean by this?

This crate could be interesting though: testcontainers-rs

jeertmans · 2025-03-15T14:04:43Z

Either we try to build a server client from environ variables

What do you mean by this?

Currently, we build a client by first reading environ variable and, if there are not specified, we use the public LT api. We could change the constructor to only rely on environ variable, and fail if they are missing.

This crate could be interesting though: testcontainers-rs

Maybe! It looks nice, but it is meant to run a docker image that is alive during all other tests?
Here, we only need to fail tests if they are not started on a local server, at least a server that is specified through environ variables. Then, for local testing, we can provide a command alias that runs tests inside a container, that automatically starts a LT server.

Rolv-Apneseth · 2025-03-15T22:46:42Z

We could change the constructor to only rely on environ variable, and fail if they are missing.

I see - so just fail the test if the env vars aren't set. I'm not sure that's the most ergonomic, if we're gonna require that a local API is running, we may as well hard-code where we expect it to be in my opinion.

Maybe! It looks nice, but it is meant to run a docker image that is alive during all other tests?

It looks like the intended use case is spinning up a database instance for each integration test, but (without having looked into it) I'd assume we can spin up a single container and store it in a LazyLock to use for each individual test? Would have to explore the crate a bit.

Anyway that's probably for a separate PR - I'll do test related stuff next

jeertmans · 2025-03-17T09:16:10Z

We could change the constructor to only rely on environ variable, and fail if they are missing.

I see - so just fail the test if the env vars aren't set. I'm not sure that's the most ergonomic, if we're gonna require that a local API is running, we may as well hard-code where we expect it to be in my opinion.

That's probably not the best option, but it is relatively easy to implement and does not require add. dependencies.

Maybe! It looks nice, but it is meant to run a docker image that is alive during all other tests?

It looks like the intended use case is spinning up a database instance for each integration test, but (without having looked into it) I'd assume we can spin up a single container and store it in a LazyLock to use for each individual test? Would have to explore the crate a bit.

Probably (not sure if we need a lock, though).

Anyway that's probably for a separate PR - I'll do test related stuff next

Yes, let's not spend too much time now setting up an automated local docker setup, as people can still run tests with GitHub workflows in the meantime.

…usting all storage on the CI container

…le combinations

jeertmans · 2025-04-09T09:58:47Z

Looks about right @Rolv-Apneseth, thanks! Any idea / things you wanted to add to this PR?

Rolv-Apneseth · 2025-04-09T10:03:50Z

Not that I can think of for this PR anyway

jeertmans · 2025-04-09T10:10:04Z

Perfect, thanks for your contribution !!

Rolv-Apneseth added 4 commits March 2, 2025 17:42

feat: use data annotations for Markdown files

f6998de

refactor: use shared ignore pattern

c19b15b

feat: use data annotations for HTML files

7d276f1

feat: splitting for data annotation requests

433db83

Rolv-Apneseth added 2 commits March 8, 2025 23:25

fix: pass cargo check for different features combinations

40c35a9

fix: satisfy toml pre-commit hook

1ec8f82

Rolv-Apneseth mentioned this pull request Mar 10, 2025

refactor: merge changes from master branch to v3 #135

Merged

Merge branch 'v3' into data_annotations

e11e289

feat(tests): snapshot tests for file parsers

2c51dd4

jeertmans reviewed Mar 12, 2025

View reviewed changes

Cargo.toml Outdated Show resolved Hide resolved

Merge branch 'v3' into data_annotations

9a319c4

jeertmans reviewed Mar 12, 2025

View reviewed changes

src/api/server.rs Outdated Show resolved Hide resolved

remove dup

11de4b2

jeertmans reviewed Mar 12, 2025

View reviewed changes

jeertmans added 3 commits March 12, 2025 12:52

fix(tests): go for lazy_static

4fa1244

fmt

44b10d2

setup MSRV to minimal

5b95ba4

Rolv-Apneseth added 2 commits March 12, 2025 22:16

fix(tests): filter out paths from snapshots

3c560b9

refactor(tests): use util function for building cmd

07235cd

Rolv-Apneseth added 2 commits March 12, 2025 22:35

Merge branch 'v3' into data_annotations

015a5bd

fix(tests): adjust snapshots to results from running against docker-h…

8349405

…osted LanguageTool API

jeertmans added 2 commits March 13, 2025 11:31

chore(tests): only test snapshots on latest

64078ca

chore(ci): fix when CI runs

7209411

jeertmans added 4 commits March 13, 2025 11:34

chore(tests): only test snapshots on latest

727ca6a

chore(ci): fix when CI runs

Merge remote-tracking branch 'Rolv-Apneseth/data_annotations' into da…

d06c92d

…ta_annotations

fmt

75a6a03

fix(ci): add dummy value

c0dea8c

Rolv-Apneseth added 6 commits March 17, 2025 10:11

fix(ci): clear artifacts after each run of cargo hack to avoid exha…

d7a81c7

…usting all storage on the CI container

refactor: adjust cli feature combination

de79ac7

fix(ci): significantly reduce cargo hack runtime by reducing possib…

29b0ce8

…le combinations

refactor: use plain markup instead of interpreted markup

413d99d

accept snapshot changes for false positives

b58470f

fix: remove needless lifetime

dc81d5c

jeertmans merged commit 0614534 into jeertmans:v3 Apr 9, 2025
22 of 25 checks passed

Uh oh!

feat: use data annotations for HTML and Markdown files, and implement splitting of data annotation requests #134

feat: use data annotations for HTML and Markdown files, and implement splitting of data annotation requests #134

Uh oh!

Conversation

Rolv-Apneseth commented Mar 8, 2025

Uh oh!

codspeed-hq bot commented Mar 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging #134 will not alter performance

Summary

Uh oh!

jeertmans commented Mar 11, 2025

Uh oh!

Rolv-Apneseth commented Mar 11, 2025

Uh oh!

jeertmans commented Mar 11, 2025

Uh oh!

Rolv-Apneseth commented Mar 11, 2025

Uh oh!

jeertmans commented Mar 11, 2025

Uh oh!

Rolv-Apneseth commented Mar 11, 2025

Uh oh!

jeertmans commented Mar 11, 2025

Uh oh!

Rolv-Apneseth commented Mar 11, 2025

Uh oh!

jeertmans commented Mar 12, 2025

Uh oh!

Rolv-Apneseth commented Mar 12, 2025

Uh oh!

Uh oh!

jeertmans commented Mar 12, 2025

Uh oh!

Rolv-Apneseth commented Mar 12, 2025

Uh oh!

Uh oh!

jeertmans Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Rolv-Apneseth commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rolv-Apneseth commented Mar 12, 2025

Uh oh!

jeertmans commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rolv-Apneseth commented Mar 13, 2025

Uh oh!

Rolv-Apneseth commented Mar 13, 2025

Uh oh!

jeertmans commented Mar 15, 2025

Uh oh!

Rolv-Apneseth commented Mar 15, 2025

Uh oh!

jeertmans commented Mar 15, 2025

Uh oh!

Rolv-Apneseth commented Mar 15, 2025

Uh oh!

codspeed-hq bot commented Mar 8, 2025 •

edited

Loading

jeertmans Mar 12, 2025 •

edited

Loading

Rolv-Apneseth commented Mar 12, 2025 •

edited

Loading

jeertmans commented Mar 13, 2025 •

edited

Loading