Unified Corpus Explorer

Making UIMA-annotated corpora tangible, searchable and vivid.

We introduce the Unified Corpus Explorer (UCE), a standardized, dockerized, and dynamic Natural Language Processing (NLP) application designed for flexible and scalable corpus navigation. Herein, UCE utilizes the UIMA format for NLP annotations as a standardized input, constructing interfaces and features around those annotations while dynamically adapting to the corpora and their extracted annotations.

UCE-Explained.mp4

Running UCE Instances

UCE is used by different projects to visualize their corpora and to provide a generic, but flexible webportal for their users. Here we list some of those UCE instances.

Url	Project	Description
URL	BIOfid	The Specialised Information Service Biodiversity Research (BIOfid) provides access to current and historical biodiversity literature.
URL	PrismAI	A dataset for the systematic detection of AI-generated text, containg both English and German texts from 8 domains, synthesized using state-of-the-art LLMs.

Quick Start

Tip

Please consult the documentation page for a more detailled and customizable setup documentation. The Quick Start is just that: a short setup guide that sets up a default UCE instance. Chances are, that you might want to customize UCE and need to understand its possiblities beyond this simple quick start.

Usage

When building from source, clone this repository:

git clone https://github.com/texttechnologylab/UCE.git

In the root folder, create a .env file that holds the variables for the docker-compose.yaml file. Example .env:

UCE_CONFIG_PATH=./../uceConfig.json
JVM_ARGS=-Xmx8g
TDB2_DATA=./../tdb2-database
TDB2_ENDPOINT=tdb2-database-name
IMPORTER_THREADS=1

Start the relevant docker containers:

docker-compose up --build uce-postgresql-db uce-web

Optional containers, if applicable to your use-case: [uce-fuseki-sparql], [uce-rag-service]

Warning

If the webportal container can't connect to the database, you can check the connectionstrings within the common.conf file. For the docker setup, the content of this file should match the common-release.conf.

The web instance, by deafult, is reachable under: http://localhost:8008. If you're looking for a small demo without creating it yourself, please check our open demo.

Import Data

Now that the webportal and database are both running, we will start the uce-importer docker container from within the compose to import data. To do so, first:

Create a folder choose_any_name that you can mount into the docker container.
Create a subfolder input. Copy all of your annotated UIMA XMI files that you want to import in there.
Copy a default uce.common/src/main/resources/corpusConfig.json file from the source code and put it into the choose_any_name folder.
Inside the docker-compose.yaml, find the uce-importer service and mount the path/to/choose_any_name to :/app/input/corpora/choose_any_name (example can be found within the compose file)
Finally, start the importer and import your corpus:

docker-compose up --build uce-importer

Important

More information about corpusConfig.json, uceConfig.json, annotations, enabling the RAGbot and other customizations can be found on the documentation page.

Development

For setting up UCE in an development environment, refer to our documentation. When trying to contribute to UCE, also read through our Developer Code.

About

UCE is customizable in terms of annotations imported, corporate identity used, and background information added. It allows the creation of a specific UCE instance for your project, regardless of the domain. It does so by utilizing UIMA-annotated corpora, with the primary tool for creating those being the Docker Unified UIMA Interface (DUUI). Hence, you would gather your corpus, use DUUI to annotate whatever you want to annotate, and finally import those annotations into UCE to host them.

Microservices

UCE consists of several microservices, each dockerized and utilizing distinct technologies, which is being outlined in the following:

Microservice	Description
A: Corpus-Importer	UCE is based on Corpus-Importer, a Java application that reads UIMA-annotated documents from a specified path, along with a corresponding corpus-configuration JSON file. The importer extracts the raw data and the configured annotations, applying its own post-processing to set up the environment, which includes text segmentation, database indexing, keyword extraction, and the creation of various embedding spaces, before finally storing each processed document in a PostgreSQL database (B).
B: Relational Database	As our primary database, we opted for a relational PostgreSQL database, as UCE requires a structured and standardized database schema that can be extended if necessary. Additionally, its compatibility with the pgvector extension enables efficient vector operations directly within the database engine. This allows us to store high-dimensional vector embeddings within relational data tables while also enabling fast vector operations and searches.
C: Graph Database	In addition to a relational database (B), UCE utilizes an Apache Jena SPARQL database to incorporate basic semantic searches in the Resource Description Framework (RDF) and Web Ontology Language (OWL) data formats. This integration enables the incorporation of domain-specific ontologies (e.g., biological taxonomy) into the UCE environment, further enriching its search capabilities.
D: Python Webserver	Within UCE, we also utilize a Python web service to provide an interface to machine learning and AI models, as these are primarily accessible through Python. In this context, the web server facilitates access to the generation of embedding vectors, their dimensionality reduction methods, such as t-SNE and PCA, and the inference of (Large) Language Models. The web server is accessible via a REST API and is utilized by services (A) and (E).
E: UCE Web Portal	The user interacts with UCE and all of its features through a web portal implemented in Java. This service communicates with all other services except for (B), providing a variety of search methods, visualization features, and different ways to interact with the underlying information units, as outlined in detail in Section 3.2.

In Medias Res

Some, but not all of the search and visualization features within UCE:

Name		Name	Last commit message	Last commit date
Latest commit History 242 Commits
.github/workflows		.github/workflows
database		database
documentation		documentation
rag		rag
sparql		sparql
uce.portal		uce.portal
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yaml		docker-compose.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Unified Corpus Explorer

Making UIMA-annotated corpora tangible, searchable and vivid.

Running UCE Instances

Quick Start

Usage

Import Data

Development

About

Microservices

In Medias Res

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

License

texttechnologylab/UCE

Folders and files

Latest commit

History

Repository files navigation

Unified Corpus Explorer

Making UIMA-annotated corpora tangible, searchable and vivid.

Running UCE Instances

Quick Start

Usage

Import Data

Development

About

Microservices

In Medias Res

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages