Skip to content
This repository was archived by the owner on Jan 30, 2025. It is now read-only.

🤖 GPT Tests Benchmarking

JohannesFinsveen edited this page Dec 1, 2023 · 2 revisions

This wiki page is meant to cover how to test/benchmark newer GPT versions against requirements that the GPT should behave.

📃 Requirements

These requirements represents the absolute ideal behaviour from the GPT, and is not meant to be met 100% by the GPT.

  1. Refers to SSB's website for all responses types:

    • When 'ResponseTooLargeError' error occurs.
    • When tables are listed.
    • When metadata is listed.
  2. Responds in the same language as the prompt is written in. API calls should also be made with the same language code. (lang: 'en' | 'no')

    Test prompts:

    • Eg treng data om KPI
    • Jeg trenger data om KPI
    • I need data on CPI
  3. Begin table search with the original keywords of the user's prompt, and utilize vocabulary knowledge and synonyms to extend the search if no tables are initially returned.

  4. When a table requires metadata that the user has not yet provided, all valueTexts from metadata is listed for the user to pick from.

  5. Requests for tableData should always be made with valid values from metadata.

  6. The GPT should always update it's user on what it does and why it performs these actions.

  7. Graphs are generated with SSB's background image located in knowledge files, and should always list a link for downloading an image of the graph.

🧪 Tests

These tests are prompts that should be tasked to the GPT version that's being tested.

  1. Lets look at table 13760 at the seasonal adjusted 3-months moving average for 15-74 years unemployment rate (LFS) for the past 2 years

    ✅ The GPT responds in English, and makes the API calls to SSB API with lang: 'en'.
    
    ✅ The GPT uses valid timeValues ("past 2 years" is often confusing for the GPT where it forgets to check whether the timeValues it has generated actually is available in the metadata).
    
    ✅ Returns a nice looking graph, with SSB's background image.
    
  2. Har du noe data om årslønn fra 1995-2022?

    ✅ The GPT responds in Norwegian, and makes the API calls to SSB API with lang: 'no'
    
    ✅ The GPT filters out metadata, such as 1995-2022 which can be used later.
    
    ✅ The GPT only uses "årslønn" in it's query in the first try. (This is for testing that it uses the user's original query first before attempting with different synonyms and SSB vocabulary from knowledge)
    
    ✅ The GPT returns a list of relevant tables, with links to each table. (Control check the links)
    
    ✅ When choosing a table, the GPT respons with the table's metadata to enable the user to pick which data to look at.
    
    ✅ The GPT returns a nice looking graph, with SSB's background image.
    
    ✅ The GPT is able to combine the previously generated graph, with another variable listed from the metadata.
    
  3. Kan vi ser nærmere på tabell 08799?

    ✅ The GPT responds in Norwegian, and makes the API calls to SSB API with lang: 'no'
    
    ✅ The metadata should be too large, and the GPT should respond with a link to the table instead.
    
Clone this wiki locally