This Recogito Studio (RS) plugin adds the ability to perform Named Entity Recognition (NER) on plain text and TEI documents. This plugin adds a new option to the document card menu on the project home page:
A document must either be public or owned by you in RS for you to be able to perform this operation. Once selected you ae presented with options to configure the name of the NER'ed document, which NER model to use, and the language of the document.
Once the NER is completed, a new document is added to your project which contains the named entities as read-only annotations. In the case of a plain text document, the produced document is encoded in TEI with the annotations added as a standoff element which can be interpreted by RS. A TEI document which has NER performed will return a new TEI document with a new standoff element containing the NER annotations.
NER can be a long running operation, so this plugin makes use of Trigger.dev background job runner. While it is easiest to use the cloud based service, trigger.dev can be self-hosted. Please see the documentation for guidance on self-hosting.
Whether using the cloud service or self-hosting, the plugin requires that the following ENV are set on the deployed trigger.dev project. Note that this uses the example Stanford CoreNLP Services which is detailed below.
CORENLP_URL_EN=<url of CoreNLP English service>
CORENLP_URL_DE=<url of CoreNLP German service>
To deploy the required tasks to your Trigger.dev job runner you will need to update your trigger.config.ts file located in the /src
directory.
import { defineConfig } from '@trigger.dev/sdk/v3';
export default defineConfig({
project: 'proj_fyeypkhgyaejpiweobwq',
runtime: 'node',
logLevel: 'log',
// The max compute seconds a task is allowed to run. If the task run exceeds this duration, it will be stopped.
// You can override this on an individual task.
// See https://trigger.dev/docs/runs/max-duration
maxDuration: 3600,
retries: {
enabledInDev: true,
default: {
maxAttempts: 3,
minTimeoutInMs: 1000,
maxTimeoutInMs: 10000,
factor: 2,
randomize: true,
},
},
dirs: ['./trigger'],
});
Set the project
attribute to the Project ref
which you can find on your trigger project's Project settings
tab.
Then set the URL for your Trigger.dev job runner in your local .env
file:
TRIGGER_SERVER_URL=<your trigger.dev url>
Now deploy your tasks to the Trigger.dev
server by executing the following command at the root of this project repo:
npx trigger.dev@latest deploy -c ./src/trigger.config.ts
This will build containers for your tasks and deploy them to the Trigger.dev
job runner.
Once complete you should see your tasks on the Tasks
tab on your Trigger.dev
project.
The repository contains an example docker-compose YML file that deploys an English and German NLP services. These feature fairly comprehensive NER capabilities.
To add additional NER service endpoints requires the following steps:
NERMenuExtension.tsx contains the NEROptions
object.
const NEROptions: { value: string; label: string }[] = [
{ value: 'stanford-core', label: t['Stanford Core NLP'] },
];
It is currently configured to only offer the Stanford Core NLP
service. Add new values and labels to include additional NER services.
The top level stanfordCore.ts task calls the sub-tasks that implement the NER pipeline. For a new endpoint, you would implement a new task using stanfordCore.ts
as a template. The basic set of steps here are:
- Create a Supabase client
- Download the file for NER from Supabase
- Convert to plain text. There are two different subtasks that handle this based on wether the file is text or TEI XML:
- Call your new endpoint task. How this task functions will depend on how the endpoint functions but the important requirement is that your new task returns the same structure as the example doStanfordNlp.ts task NERResults.
export type TagTypes =
| 'persName'
| 'orgName'
| 'placeName'
| 'settlement'
| 'country'
| 'region'
| 'date';
export type NEREntry = {
text: string;
startIndex: number;
endIndex: number;
localizedTag: string;
inlineTag: TagTypes;
attributes?: { [key: string]: string };
};
export type NERResults = {
entries: NEREntry[];
};
The NERAgentEndpoint.ts file receives the options from the configuration dialog. To handle your new options, update this block of code and trigger your new top level task:
if (body.model === 'stanford-core') {
handle = await stanfordCore.trigger({
projectId: projectId as string,
documentId: documentId as string,
language: body.language,
token: body.token,
key: supabaseAPIKey,
serverURL: supabaseServerUrl,
nameOut: body.nameOut,
outputLanguage: body.outputLanguage,
});
}
i.e.:
if (body.model === 'stanford-core') {
handle = await stanfordCore.trigger({
projectId: projectId as string,
documentId: documentId as string,
language: body.language,
token: body.token,
key: supabaseAPIKey,
serverURL: supabaseServerUrl,
nameOut: body.nameOut,
outputLanguage: body.outputLanguage,
});
} else if(body.model === 'may-new-ner-service`) {
handle = await myNewNERService.trigger({
projectId: projectId as string,
documentId: documentId as string,
language: body.language,
token: body.token,
key: supabaseAPIKey,
serverURL: supabaseServerUrl,
nameOut: body.nameOut,
outputLanguage: body.outputLanguage,
});
}
Whether your are using the cloud service or self hosting, the procedure is the same:
npm run build
npx trigger.dev@latest deploy -c ./src/trigger.config.ts
The trigger deploy task will use your configuration set in /src/trigger.config.ts