This repository contains resources and tools for evaluating responses from LLMs to evaluate JUIC-IoT interface. It includes example Thing Descriptions (TDs) in both JSON-LD and Turtle formats, prompt templates, evaluation scripts nad evaluation data.
-
Code for the Evaluation
Contains Python scripts used to evaluate responses based on Accuracy, Completeness, Reliability and Response Time. -
TD_Blinds.jsonld
TD for an automated blinds device (JSON-LD). -
TD_Lights.jsonld
TD for a smart lighting device (JSON-LD). -
TD_Tractorbot.jsonld
TD for the Tractorbot robot (JSON-LD). -
TD_Cherrybot.ttl
TD for the Cherrybot (Turtle/TTL format).
Inside Code for the Evaluation
-
Blinds/
Contains evaluation results and LLM responses for the Blinds device. -
Lights/
Contains evaluation results and LLM responses for the Lights device. -
Tractorbot/
Contains evaluation results and LLM responses for the Tractorbot device. -
Roboticarm/
Contains evaluation results and LLM responses for the Roboticarm. -
TXTFiles/
Contains supporting files for evaluation:- Background messages (brief interaction histories) to provide context for the LLM.
- Human benchmark responses used as ground truth for comparison.
- TDs split by affordance for each device (Blinds, Lights, Robotic Arm, and Tractorbot).
-
accuracy.py
Computes the accuracy of LLM Responses. -
completeness.py
Computes the completeness of LLM responses. -
reliability.py
Computes the reliability of LLM responses. -
chat_interaction.py
Handles communication with the LLM, including sending prompts and receiving responses. -
json_repair_helper.py
Utility for validating and repairing JSON outputs generated by the LLM. -
main.py
Entry point script for executing the full evaluation pipeline.