Skip to content

Conduct Large Language Base Model Selection Process

Yashodhar Pansuriya edited this page Jun 19, 2024 · 3 revisions
  • Question: Identify and select the most suitable Large Language Base Model (LLM) for our project through a structured evaluation process. The selected LLM should excel in:

    • Question Answering
    • Text Generation

    The evaluation process should meet the following criteria:

    • Aligns with the project requirements (PDF link).
    • Covers general criteria (evaluation matrix, scoring system) and/or quantitative benchmark tests (ARC, HellaSwag, MMLU).
    • References frameworks such as Eleuther AI Language Model Evaluation Harness and HuggingFace LLM Leaderboard.
    • Evaluates at least 3 base models based on:
      • Language support
      • License
      • Model size
      • Achievability
      • Complexity of prompts for training
      • Dataset requirements for retraining based on our use case
  • Results:

    Comparison of Candidate Models

Here is a comparison between some prominent base LLM models (some already finetuned marked with *). In selecting the models the following are considered:

  • The model should not be huge (easier to finetune). As the model parameters increase, the amount of data (number of data rows required for finetuning) increases. However, most of these models have larger versions. If we see we can gather enough data, we can opt for them.
  • Selecting models which have an active community/famous. If we face an error in the future, we would have somewhere to discuss it.
  • HumanEval benchmark for programming. Since we are dealing with a great number of .yml files which form project pipelines, it would be crucial for our model to perform well in programming tasks (Humaneval benchmark).
  • Benchmarks may not be fully accurate: In HuggingFace communities, people were mentioning low performance on models with great benchmarks.

Considering the above, the proposed models to start with would be:

  1. Llama3_8b
  2. Gemma_7b
  3. Llama3_70b
Model Model size HuggingFace Avg ARC HellaSwag MMLU HumanEval AGIEval(chat) License
Gemma 7b 64.3 61 82.5 66 32.3 41.7 gemma (Allows redistribution with noting some text)
Llama3 70b 77.8 71.42 85.7 80 81.7 63 Llama (Allows redistribution with noting some text)
Llama3-instruct 8b 66.8 60.7 78.5 67.07 62.2 Llama (Allows redistribution with noting some text)
Mistral 7b 61 60 83 64 Apache 2.0
Calme-7B-Instruct-v0.9* 7b 76 73 89 64 Apache 2.0
Mixtral-8x22b-Instruct* 141b 79.1 72.7 89 77.7 Apache 2.0
Zephyr-orpo-141b-A35b* 141b NA NA NA NA NA 44.16 Apache 2.0

Database Form

Usually databases are in a .json format with the following form:

{
  "prompt": "What are the three most important things to consider when deciding what technology to use to build an assistive device to help an elderly person with basic needs?",
  "response": "To build an assistive device to help an elderly person with basic needs, one must consider three crucial things:..."
}
Clone this wiki locally