-
-
Notifications
You must be signed in to change notification settings - Fork 7
feat: use data annotations for HTML and Markdown files, and implement splitting of data annotation requests #134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
CodSpeed Performance ReportMerging #134 will not alter performanceComparing Summary
|
Thanks @Rolv-Apneseth, do you have an example |
I just made an example as I couldn't really find many online. To be honest working with this has been the first I've heard of the Just save as a |
Could you include that file in the |
Sure. What do you think the best way of testing would be? The CLI integration tests? |
Yes, that's a good idea: you can write a test that runs |
I was just gonna check if it contains the string to be honest, but sure I can try that. Alternatively, if you want it to be thorough, how would you feel about snapshot tests with something like insta? |
Insta snapshots is probably a more appropriate solution to this, good idea! |
I'll need to look a bit more into the snapshot tests (tomorrow hopefully) |
Looks like the snapshot contain absolute path to files, which could be an issue as they won't match the paths in the CI workflows. I see two possible solution: (1) ignore file path in the diff check (maybe an option of Insta.rs), or (2) print relative paths in request outputs. |
I might have some time to investigate later, but yeah I don't think it's an option with Insta, will have to look into other solutions |
You can probably just filter out the path from the snapshot, see https://insta.rs/docs/filters/. |
Thanks, I'll have a look |
src/parsers/typst.rs
Outdated
@@ -153,7 +53,7 @@ pub fn parse_typst(file_content: impl AsRef<str>) -> Data<'static> { | |||
// issues. The following sentence would give an error for | |||
// repeated whitespace otherwise: This has ``` `backticks` | |||
// ``` in it | |||
"_ignore_".to_string(), | |||
IGNORE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we just do new_markup
instead of new_interpreted_markup
?
If I understand the LT API correctly, new_markup
will prompt LT to ignore its content, while new_interpreted_markup
.
See for example the output of the following request:
{"annotation":[ {"text": "A "}, {"markup": "<b>", "interpretAs": "simpl "}, {"text": "test"}, {"markup": "</b>"} ]}
returns:
{
"software": {
"name": "LanguageTool",
"version": "6.6.13",
"buildDate": "2025-02-19 14:09:51 +0000",
"apiVersion": 1,
"premium": true,
"premiumHint": "You might be missing errors only the Premium version can find. Contact us at support<at>languagetoolplus.com.",
"status": ""
},
"warnings": {
"incompleteResults": false
},
"language": {
"name": "English (US)",
"code": "en-US",
"detectedLanguage": {
"name": "Galician",
"code": "gl-ES",
"confidence": 0.3392336,
"source": "ngram"
}
},
"matches": [
{
"message": "Possible spelling mistake found.",
"shortMessage": "Spelling mistake",
"replacements": [
{
"value": "simply"
},
{
"value": "simple"
},
{
"value": "simp"
},
{
"value": "simps"
}
],
"offset": 2,
"length": 2,
"context": {
"text": "A <b>test</b>",
"offset": 2,
"length": 2
},
"sentence": "A simpl test",
"type": {
"typeName": "UnknownWord"
},
"rule": {
"id": "MORFOLOGIK_RULE_EN_US",
"description": "Possible spelling mistake",
"issueType": "misspelling",
"category": {
"id": "TYPOS",
"name": "Possible Typo"
},
"isPremium": false,
"confidence": 0.68
},
"ignoreForIncompleteSentence": false,
"contextForSureMatch": 0
}
],
"sentenceRanges": [
[
0,
13
]
],
"extendedSentenceRanges": [
{
"from": 0,
"to": 12,
"detectedLanguages": [
{
"language": "en",
"rate": 1
}
]
}
]
}
So I think we can remove the IGNORE
placeholder value, and changed all new_interpreted_markup(..., IGNORE)
calls to new_markup(...)
. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could adjust which tags have that ignore pattern, and leave the rest as just markup, but it's required for things like code and code blocks to avoid white-space issues like the comment says
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue that this is a non-issue: people (or us) can disable the whitespace rule, and detecting too many / too few whitespaces is sometimes a desired feature, even around markup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean? I wouldn't expect something like "We use tool_name_here
for ..." to return a white-space error, and disabling the white-space rule altogether doesn't seem great
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For inline markup, like backsticks, I guess we could use the interpret as variant, e.g.,
{"annotation":[ {"text": "We usse "}, {"markup": "`Python`", "interpretAs": "Python"}, {"text": " for programming."}]}
which does not cause any whitespace issue:
output
{
"software": {
"name": "LanguageTool",
"version": "6.6.13",
"buildDate": "2025-02-19 14:09:51 +0000",
"apiVersion": 1,
"premium": true,
"premiumHint": "You might be missing errors only the Premium version can find. Contact us at support<at>languagetoolplus.com.",
"status": ""
},
"warnings": {
"incompleteResults": false
},
"language": {
"name": "English (US)",
"code": "en-US",
"detectedLanguage": {
"name": "English (US)",
"code": "en-US",
"confidence": 0.9999978,
"source": "ngram"
}
},
"matches": [
{
"message": "Possible spelling mistake found.",
"shortMessage": "Spelling mistake",
"replacements": [
{
"value": "use"
},
{
"value": "uses"
},
{
"value": "USSR"
},
{
"value": "SSE"
},
{
"value": "USS"
},
{
"value": "ASSE"
},
{
"value": "BSSE"
},
{
"value": "USE"
},
{
"value": "USSB"
},
{
"value": "USSS"
},
{
"value": "esse"
}
],
"offset": 3,
"length": 4,
"context": {
"text": "We usse `Python` for programming.",
"offset": 3,
"length": 4
},
"sentence": "We usse Python for programming.",
"type": {
"typeName": "UnknownWord"
},
"rule": {
"id": "MORFOLOGIK_RULE_EN_US",
"description": "Possible spelling mistake",
"issueType": "misspelling",
"category": {
"id": "TYPOS",
"name": "Possible Typo"
},
"isPremium": false,
"confidence": 0.68
},
"ignoreForIncompleteSentence": false,
"contextForSureMatch": 0
}
],
"sentenceRanges": [
[
0,
33
]
],
"extendedSentenceRanges": [
{
"from": 0,
"to": 31,
"detectedLanguages": [
{
"language": "en",
"rate": 1
}
]
}
]
}
For things like math, that cannot be interpreted, things are complex, and your solution actually makes some sense. I spent a few minutes looking at various repositories, and LTeX, a very popular vscode extension, insert dummy variables like Dummy
or Dummys
(ref: https://github.com/valentjn/ltex-ls/blob/develop/src/main/kotlin/org/bsplines/ltexls/parsing/latex/LatexAnnotatedTextBuilder.kt and https://github.com/valentjn/ltex-ls/blob/develop/src/main/kotlin/org/bsplines/ltexls/parsing/DummyGenerator.kt), but it looks relatively complex to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just replace them with
new_markup
for the moment. We will probably have whitespace issues, but this rule can be disabled anyway.
You're sure that's how you prefer to do it? The way it is now seems better than having to disable the white-space rule entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiple whitespaces are usually not an issue in those languages (Markdown, LaTeX, Typst), so it makes sense to disable this rule anyway (e.g., see this). We might change this in the future, but let's not focus too much on this now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeertmans Making it just markup like you suggest is giving some errors other than just consecutive white-spaces:
error[COMMA_PARENTHESIS_WHITESPACE]: Don’t put a space before the full stop.
--> [path]
|
8 | ...he glacier melting models established in @glacier-melt. bibliography("works.bib") The flow ra...
| ^^^^^^^^^^^^^^^ Use of whitespace before comma and before/after parentheses
| --------------- help: .
|
error[TO_NON_BASE]: The verb after “to” should be in the base form as part of the to-infinitive. A verb can take many forms, but the base form is always used in the to-infinitive.
--> [path]
|
2 | = Code blocks Adding `rbx` to `rcx` gives the desired result. What is ```rust fn...
| ^^^^^ 'to' + non-base form
| ----- help: give
|
error[COMMA_PARENTHESIS_WHITESPACE]: Don’t put a space before the full stop.
--> [path]
|
18 | ... defined through the recurrence relation $F_n = F_(n-1) + F_(n-2)$. It can also be expressed in _closed for...
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^ Use of whitespace before comma and before/after parentheses
| --------------------------- help: .
|
Still want to proceed with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we will leave this as a future improvement. As said, I don’t mind if it throws more errors that it shouldn’t, I prefer false positive than false negative.
note: once again, my apologies for the delay in reviewing, I was very busy with other things…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good, there's no rush on any of this.
I made a commit just accepting the false positives for those snapshots then.
Any idea what would cause the error?
The code doesn't seem to include that "en-US": let assert = cmd
.arg("check")
.arg("-t")
.arg("\"some text that is given as text\"")
.assert(); Edit: fixed after merging |
Looks like previous versions provide different results - what do you think we should do? Drop the snapshots approach altogether? |
I think we should only generate and check snapshots for the latest version of LanguageTools. See 727ca6a |
chore(ci): fix when CI runs
Fair enough |
Also - to have the correct snapshots (to match the CI) I needed to run tests against a self-hosted API using the same options that the CI uses. This should probably be noted in a e.g. CONTRIBUTING.md, along with maybe a Might be relevant but in frankfurte-rs I just forced all tests to run against a local API, included a I think it would at least make sense to have tests require a local API to be run, and a comprehensive contributing section / file which outlines all the tests, linting and formatting that can be run locally to avoid errors from the CI. |
Yes, it makes sense to only run API tests on a local server, for performance and avoiding spamming the API. We could probably achieve that with fixtures, see https://docs.rs/rstest/latest/rstest/attr.fixture.html Either we try to build a server client from environ variables, so we never use the default API, or we run a docker command (but then it requires docker properly set up) to start a server. |
What do you mean by this? This crate could be interesting though: testcontainers-rs |
Currently, we build a client by first reading environ variable and, if there are not specified, we use the public LT api. We could change the constructor to only rely on environ variable, and fail if they are missing.
Maybe! It looks nice, but it is meant to run a docker image that is alive during all other tests? |
I see - so just fail the test if the env vars aren't set. I'm not sure that's the most ergonomic, if we're gonna require that a local API is running, we may as well hard-code where we expect it to be in my opinion.
It looks like the intended use case is spinning up a database instance for each integration test, but (without having looked into it) I'd assume we can spin up a single container and store it in a Anyway that's probably for a separate PR - I'll do test related stuff next |
That's probably not the best option, but it is relatively easy to implement and does not require add. dependencies.
Probably (not sure if we need a lock, though).
Yes, let's not spend too much time now setting up an automated local docker setup, as people can still run tests with GitHub workflows in the meantime. |
…usting all storage on the CI container
Looks about right @Rolv-Apneseth, thanks! Any idea / things you wanted to add to this PR? |
Not that I can think of for this PR anyway |
Perfect, thanks for your contribution !! |
Continuing on with #117
This should hopefully check off splitting for data annotation requests (and joining of the responses).
I've also converted the Markdown and HTML parsers to return data annotations.
Open to any suggestions for changes, and I'd appreciate some testing of any files you happen to have laying around. I'm sure that these basic implementations have a lot of edge cases left to cover.
Also, I've been avoiding running the checks on files against the public instance of LanguageTool for fear of getting blocked due to the number of requests. Do we have any concerns about that? I've just been running all tests against a locally hosted docker version of the API.