Skip to content

feat: use data annotations for HTML and Markdown files, and implement splitting of data annotation requests #134

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 29 commits into from
Apr 9, 2025

Conversation

Rolv-Apneseth
Copy link
Collaborator

Continuing on with #117

This should hopefully check off splitting for data annotation requests (and joining of the responses).

I've also converted the Markdown and HTML parsers to return data annotations.

Open to any suggestions for changes, and I'd appreciate some testing of any files you happen to have laying around. I'm sure that these basic implementations have a lot of edge cases left to cover.

Also, I've been avoiding running the checks on files against the public instance of LanguageTool for fear of getting blocked due to the number of requests. Do we have any concerns about that? I've just been running all tests against a locally hosted docker version of the API.

Copy link

codspeed-hq bot commented Mar 8, 2025

CodSpeed Performance Report

Merging #134 will not alter performance

Comparing Rolv-Apneseth:data_annotations (dc81d5c) with v3 (ea24862)

Summary

✅ 6 untouched benchmarks

@jeertmans
Copy link
Owner

Thanks @Rolv-Apneseth, do you have an example .typst file that you used to check that it works?

@Rolv-Apneseth
Copy link
Collaborator Author

I just made an example as I couldn't really find many online. To be honest working with this has been the first I've heard of the typst format. I just tried to follow the examples they provide so hopefully it's all correct syntactically.

example.txt

Just save as a .typ file since it wouldn't let me upload with that extension. Should only have 1 spelling mistake from what I remember (temperatre)

@jeertmans
Copy link
Owner

Could you include that file in the tests folder and use it in a test?

@Rolv-Apneseth
Copy link
Collaborator Author

Sure. What do you think the best way of testing would be? The CLI integration tests?

@jeertmans
Copy link
Owner

Sure. What do you think the best way of testing would be? The CLI integration tests?

Yes, that's a good idea: you can write a test that runs ltrs check split_example.typst --raw, it should automatically detect Typst format (and split the request into multiple because one would be too long). Then, you parse the output text into a JSON object and very that it contains the expected errors (you can probably use serde and the check::Response struct).

@Rolv-Apneseth
Copy link
Collaborator Author

Yes, that's a good idea: you can write a test that runs ltrs check split_example.typst --raw, it should automatically detect Typst format (and split the request into multiple because one would be too long). Then, you parse the output text into a JSON object and very that it contains the expected errors (you can probably use serde and the check::Response struct).

I was just gonna check if it contains the string to be honest, but sure I can try that. Alternatively, if you want it to be thorough, how would you feel about snapshot tests with something like insta?

@jeertmans
Copy link
Owner

Yes, that's a good idea: you can write a test that runs ltrs check split_example.typst --raw, it should automatically detect Typst format (and split the request into multiple because one would be too long). Then, you parse the output text into a JSON object and very that it contains the expected errors (you can probably use serde and the check::Response struct).

I was just gonna check if it contains the string to be honest, but sure I can try that. Alternatively, if you want it to be thorough, how would you feel about snapshot tests with something like insta?

Insta snapshots is probably a more appropriate solution to this, good idea!

@Rolv-Apneseth
Copy link
Collaborator Author

I'll need to look a bit more into the snapshot tests (tomorrow hopefully)

@jeertmans
Copy link
Owner

Looks like the snapshot contain absolute path to files, which could be an issue as they won't match the paths in the CI workflows. I see two possible solution: (1) ignore file path in the diff check (maybe an option of Insta.rs), or (2) print relative paths in request outputs.

@Rolv-Apneseth
Copy link
Collaborator Author

I might have some time to investigate later, but yeah I don't think it's an option with Insta, will have to look into other solutions

@jeertmans
Copy link
Owner

I might have some time to investigate later, but yeah I don't think it's an option with Insta, will have to look into other solutions

You can probably just filter out the path from the snapshot, see https://insta.rs/docs/filters/.

@Rolv-Apneseth
Copy link
Collaborator Author

I might have some time to investigate later, but yeah I don't think it's an option with Insta, will have to look into other solutions

You can probably just filter out the path from the snapshot, see https://insta.rs/docs/filters/.

Thanks, I'll have a look

@@ -153,7 +53,7 @@ pub fn parse_typst(file_content: impl AsRef<str>) -> Data<'static> {
// issues. The following sentence would give an error for
// repeated whitespace otherwise: This has ``` `backticks`
// ``` in it
"_ignore_".to_string(),
IGNORE,
Copy link
Owner

@jeertmans jeertmans Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just do new_markup instead of new_interpreted_markup?

If I understand the LT API correctly, new_markup will prompt LT to ignore its content, while new_interpreted_markup.

See for example the output of the following request:

{"annotation":[  {"text": "A "},  {"markup": "<b>", "interpretAs": "simpl "},  {"text": "test"},  {"markup": "</b>"}  ]}

returns:

{
  "software": {
    "name": "LanguageTool",
    "version": "6.6.13",
    "buildDate": "2025-02-19 14:09:51 +0000",
    "apiVersion": 1,
    "premium": true,
    "premiumHint": "You might be missing errors only the Premium version can find. Contact us at support<at>languagetoolplus.com.",
    "status": ""
  },
  "warnings": {
    "incompleteResults": false
  },
  "language": {
    "name": "English (US)",
    "code": "en-US",
    "detectedLanguage": {
      "name": "Galician",
      "code": "gl-ES",
      "confidence": 0.3392336,
      "source": "ngram"
    }
  },
  "matches": [
    {
      "message": "Possible spelling mistake found.",
      "shortMessage": "Spelling mistake",
      "replacements": [
        {
          "value": "simply"
        },
        {
          "value": "simple"
        },
        {
          "value": "simp"
        },
        {
          "value": "simps"
        }
      ],
      "offset": 2,
      "length": 2,
      "context": {
        "text": "A <b>test</b>",
        "offset": 2,
        "length": 2
      },
      "sentence": "A simpl test",
      "type": {
        "typeName": "UnknownWord"
      },
      "rule": {
        "id": "MORFOLOGIK_RULE_EN_US",
        "description": "Possible spelling mistake",
        "issueType": "misspelling",
        "category": {
          "id": "TYPOS",
          "name": "Possible Typo"
        },
        "isPremium": false,
        "confidence": 0.68
      },
      "ignoreForIncompleteSentence": false,
      "contextForSureMatch": 0
    }
  ],
  "sentenceRanges": [
    [
      0,
      13
    ]
  ],
  "extendedSentenceRanges": [
    {
      "from": 0,
      "to": 12,
      "detectedLanguages": [
        {
          "language": "en",
          "rate": 1
        }
      ]
    }
  ]
}

So I think we can remove the IGNORE placeholder value, and changed all new_interpreted_markup(..., IGNORE) calls to new_markup(...). What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could adjust which tags have that ignore pattern, and leave the rest as just markup, but it's required for things like code and code blocks to avoid white-space issues like the comment says

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that this is a non-issue: people (or us) can disable the whitespace rule, and detecting too many / too few whitespaces is sometimes a desired feature, even around markup.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean? I wouldn't expect something like "We use tool_name_here for ..." to return a white-space error, and disabling the white-space rule altogether doesn't seem great

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For inline markup, like backsticks, I guess we could use the interpret as variant, e.g.,

{"annotation":[  {"text": "We usse "},  {"markup": "`Python`", "interpretAs": "Python"},  {"text": " for programming."}]}

which does not cause any whitespace issue:

output
{
  "software": {
    "name": "LanguageTool",
    "version": "6.6.13",
    "buildDate": "2025-02-19 14:09:51 +0000",
    "apiVersion": 1,
    "premium": true,
    "premiumHint": "You might be missing errors only the Premium version can find. Contact us at support<at>languagetoolplus.com.",
    "status": ""
  },
  "warnings": {
    "incompleteResults": false
  },
  "language": {
    "name": "English (US)",
    "code": "en-US",
    "detectedLanguage": {
      "name": "English (US)",
      "code": "en-US",
      "confidence": 0.9999978,
      "source": "ngram"
    }
  },
  "matches": [
    {
      "message": "Possible spelling mistake found.",
      "shortMessage": "Spelling mistake",
      "replacements": [
        {
          "value": "use"
        },
        {
          "value": "uses"
        },
        {
          "value": "USSR"
        },
        {
          "value": "SSE"
        },
        {
          "value": "USS"
        },
        {
          "value": "ASSE"
        },
        {
          "value": "BSSE"
        },
        {
          "value": "USE"
        },
        {
          "value": "USSB"
        },
        {
          "value": "USSS"
        },
        {
          "value": "esse"
        }
      ],
      "offset": 3,
      "length": 4,
      "context": {
        "text": "We usse `Python` for programming.",
        "offset": 3,
        "length": 4
      },
      "sentence": "We usse Python for programming.",
      "type": {
        "typeName": "UnknownWord"
      },
      "rule": {
        "id": "MORFOLOGIK_RULE_EN_US",
        "description": "Possible spelling mistake",
        "issueType": "misspelling",
        "category": {
          "id": "TYPOS",
          "name": "Possible Typo"
        },
        "isPremium": false,
        "confidence": 0.68
      },
      "ignoreForIncompleteSentence": false,
      "contextForSureMatch": 0
    }
  ],
  "sentenceRanges": [
    [
      0,
      33
    ]
  ],
  "extendedSentenceRanges": [
    {
      "from": 0,
      "to": 31,
      "detectedLanguages": [
        {
          "language": "en",
          "rate": 1
        }
      ]
    }
  ]
}

For things like math, that cannot be interpreted, things are complex, and your solution actually makes some sense. I spent a few minutes looking at various repositories, and LTeX, a very popular vscode extension, insert dummy variables like Dummy or Dummys (ref: https://github.com/valentjn/ltex-ls/blob/develop/src/main/kotlin/org/bsplines/ltexls/parsing/latex/LatexAnnotatedTextBuilder.kt and https://github.com/valentjn/ltex-ls/blob/develop/src/main/kotlin/org/bsplines/ltexls/parsing/DummyGenerator.kt), but it looks relatively complex to do that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just replace them with new_markup for the moment. We will probably have whitespace issues, but this rule can be disabled anyway.

You're sure that's how you prefer to do it? The way it is now seems better than having to disable the white-space rule entirely.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple whitespaces are usually not an issue in those languages (Markdown, LaTeX, Typst), so it makes sense to disable this rule anyway (e.g., see this). We might change this in the future, but let's not focus too much on this now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeertmans Making it just markup like you suggest is giving some errors other than just consecutive white-spaces:

error[COMMA_PARENTHESIS_WHITESPACE]: Don’t put a space before the full stop.
 --> [path]
  |
8 | ...he glacier melting models established in @glacier-melt.  bibliography("works.bib")  The flow ra...
  |                                            ^^^^^^^^^^^^^^^ Use of whitespace before comma and before/after parentheses
  |                                            --------------- help: .
  |
error[TO_NON_BASE]: The verb after “to” should be in the base form as part of the to-infinitive. A verb can take many forms, but the base form is always used in the to-infinitive.
 --> [path]
  |
2 | = Code blocks  Adding `rbx` to `rcx` gives the desired result.  What is ```rust fn...
  |                                      ^^^^^ 'to' + non-base form
  |                                      ----- help: give
  |
error[COMMA_PARENTHESIS_WHITESPACE]: Don’t put a space before the full stop.
  --> [path]
   |
18 | ... defined through the recurrence relation $F_n = F_(n-1) + F_(n-2)$. It can also be expressed in _closed for...
   |                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^ Use of whitespace before comma and before/after parentheses
   |                                            --------------------------- help: .
   |

Still want to proceed with that?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we will leave this as a future improvement. As said, I don’t mind if it throws more errors that it shouldn’t, I prefer false positive than false negative.

note: once again, my apologies for the delay in reviewing, I was very busy with other things…

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good, there's no rush on any of this.

I made a commit just accepting the false positives for those snapshots then.

@Rolv-Apneseth
Copy link
Collaborator Author

Rolv-Apneseth commented Mar 12, 2025

Any idea what would cause the error?

command=`"/home/runner/work/languagetool-rust/languagetool-rust/target/debug/ltrs" "check" "-t" "en-US" "\"I am a star.\""`
code=2
stdout=""
stderr=```
error: invalid value \'\"I am a star.\"\' for \'[FILENAMES]...\': invalid filename (got \'\"I am a star.\"\', does not exist or is not a file)

The code doesn't seem to include that "en-US":

let assert = cmd
    .arg("check")
    .arg("-t")
    .arg("\"some text that is given as text\"")
    .assert();

Edit: fixed after merging v3 and fixing the language declaration - moving onto issue with the snapshots

@Rolv-Apneseth
Copy link
Collaborator Author

Looks like previous versions provide different results - what do you think we should do? Drop the snapshots approach altogether?

@jeertmans
Copy link
Owner

jeertmans commented Mar 13, 2025

Looks like previous versions provide different results - what do you think we should do? Drop the snapshots approach altogether?

I think we should only generate and check snapshots for the latest version of LanguageTools.

See 727ca6a

@Rolv-Apneseth
Copy link
Collaborator Author

I think we should only generate and check snapshots for the latest version of LanguageTools.

Fair enough

@Rolv-Apneseth
Copy link
Collaborator Author

Edit: fixed after merging v3 and fixing the language declaration - moving onto issue with the snapshots

Also - to have the correct snapshots (to match the CI) I needed to run tests against a self-hosted API using the same options that the CI uses. This should probably be noted in a e.g. CONTRIBUTING.md, along with maybe a docker-compose.yml for easy hosting and a bash script for running the tests?

Might be relevant but in frankfurte-rs I just forced all tests to run against a local API, included a docker-compose.yml and used just for automating the containers and running tests. Happy to implement this anyway you want though, of course.

I think it would at least make sense to have tests require a local API to be run, and a comprehensive contributing section / file which outlines all the tests, linting and formatting that can be run locally to avoid errors from the CI.

@jeertmans
Copy link
Owner

Edit: fixed after merging v3 and fixing the language declaration - moving onto issue with the snapshots

Also - to have the correct snapshots (to match the CI) I needed to run tests against a self-hosted API using the same options that the CI uses. This should probably be noted in a e.g. CONTRIBUTING.md, along with maybe a docker-compose.yml for easy hosting and a bash script for running the tests?

Might be relevant but in frankfurte-rs I just forced all tests to run against a local API, included a docker-compose.yml and used just for automating the containers and running tests. Happy to implement this anyway you want though, of course.

I think it would at least make sense to have tests require a local API to be run, and a comprehensive contributing section / file which outlines all the tests, linting and formatting that can be run locally to avoid errors from the CI.

Yes, it makes sense to only run API tests on a local server, for performance and avoiding spamming the API. We could probably achieve that with fixtures, see https://docs.rs/rstest/latest/rstest/attr.fixture.html

Either we try to build a server client from environ variables, so we never use the default API, or we run a docker command (but then it requires docker properly set up) to start a server.

@Rolv-Apneseth
Copy link
Collaborator Author

Either we try to build a server client from environ variables

What do you mean by this?

This crate could be interesting though: testcontainers-rs

@jeertmans
Copy link
Owner

Either we try to build a server client from environ variables

What do you mean by this?

Currently, we build a client by first reading environ variable and, if there are not specified, we use the public LT api. We could change the constructor to only rely on environ variable, and fail if they are missing.

This crate could be interesting though: testcontainers-rs

Maybe! It looks nice, but it is meant to run a docker image that is alive during all other tests?
Here, we only need to fail tests if they are not started on a local server, at least a server that is specified through environ variables. Then, for local testing, we can provide a command alias that runs tests inside a container, that automatically starts a LT server.

@Rolv-Apneseth
Copy link
Collaborator Author

We could change the constructor to only rely on environ variable, and fail if they are missing.

I see - so just fail the test if the env vars aren't set. I'm not sure that's the most ergonomic, if we're gonna require that a local API is running, we may as well hard-code where we expect it to be in my opinion.

Maybe! It looks nice, but it is meant to run a docker image that is alive during all other tests?

It looks like the intended use case is spinning up a database instance for each integration test, but (without having looked into it) I'd assume we can spin up a single container and store it in a LazyLock to use for each individual test? Would have to explore the crate a bit.

Anyway that's probably for a separate PR - I'll do test related stuff next

@jeertmans
Copy link
Owner

We could change the constructor to only rely on environ variable, and fail if they are missing.

I see - so just fail the test if the env vars aren't set. I'm not sure that's the most ergonomic, if we're gonna require that a local API is running, we may as well hard-code where we expect it to be in my opinion.

That's probably not the best option, but it is relatively easy to implement and does not require add. dependencies.

Maybe! It looks nice, but it is meant to run a docker image that is alive during all other tests?

It looks like the intended use case is spinning up a database instance for each integration test, but (without having looked into it) I'd assume we can spin up a single container and store it in a LazyLock to use for each individual test? Would have to explore the crate a bit.

Probably (not sure if we need a lock, though).

Anyway that's probably for a separate PR - I'll do test related stuff next

Yes, let's not spend too much time now setting up an automated local docker setup, as people can still run tests with GitHub workflows in the meantime.

@jeertmans
Copy link
Owner

Looks about right @Rolv-Apneseth, thanks! Any idea / things you wanted to add to this PR?

@Rolv-Apneseth
Copy link
Collaborator Author

Not that I can think of for this PR anyway

@jeertmans
Copy link
Owner

Perfect, thanks for your contribution !!

@jeertmans jeertmans merged commit 0614534 into jeertmans:v3 Apr 9, 2025
22 of 25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants