KnowOrNot is an open-source framework that enables users to develop their own customized evaluation data and pipelines for evaluating out-of-knowledge base robustness, i.e. whether large language models (LLMs) can properly recognize the boundaries of their knowledge and abstain from answering when they don't know the answer.
- Unified, high-level API that streamlines the process of setting up and running robustness benchmark (only a source document is required to get the pipeline running)
- Modular architecture emphasises extensibility and flexibility, allowing users to easily integrate their own LLM clients and RAG settings
- Rigorous data modeling design ensures experiment reproducibility, reliability and traceability
- Comprehensive suite of tools for users to customize their pipelines
- Create and activate a virtual environment. Install uv.
python3 -m venv knowornot
source knowornot/bin/activate
pip install uv
- Download the source code and enter the created source directory.
git clone git@github.com:govtech-responsibleai/KnowOrNot.git
cd knowornot
- Install the library
uv pip install .
-
Set up environment variables in a
.env
file, depending on the LLM provider of choice. The sample scriptexample/quickstart_pipeline.py
depends on OpenAI and would require OpenAI environment variables. Refer toenv.example
for an example. -
Run a sample evaluation pipeline.
uv run python example/quickstart_pipeline.py
Refer to quickstart.md for more information and quickstart_pipeline.py for an end-to-end example flow.
- OpenAI: use
add_openai()
method - Gemini API: use
add_gemini()
method - Azure: use
add_azure()
method - OpenRouter: use
add_openrouter()
method
- Asynchronous: use
run_experiment_async
,evaluate_experiment_async
method - Synchronous: use
run_experiment
,evaluate_experiment
method
- Leave-one-out: Remove a fact from the model's context and test if it can still answer correctly (measures memorization)
- Random (Synthetic): Create questions the model shouldn't know based on provided context (tests abstention capability)
- DIRECT: No context provided - tests raw model knowledge
- BASIC_RAG: Provides semantically relevant context using embedding similarity
- LONG_IN_CONTEXT: Provides all available context
- HYDE_RAG: Uses hypothetical document embeddings for more effective retrieval
- Create custom evaluation metrics
- Measure abstention rates, hallucination tendencies, and answer accuracy
- Compare performance across different retrieval methods and models
KnowOrNot consists of several integrated components that are key to handling and generating the data artifacts:
- FactManager: Extracts structured facts from documents
- QuestionExtractor: Generates diverse question-answer pairs
- ExperimentManager: Creates and runs knowledge boundary experiments
- RetrievalStrategies: Implements different context retrieval methods
- Evaluator: Assesses model responses with customizable metrics
- DataLabeller: Orchestrates human labelling process for validation of LLM evaluations.
This project is licensed under the Creative Commons Attribution 4.0 International License.