Conduct Large Language Base Model Selection Process

Question: Identify and select the most suitable Large Language Base Model (LLM) for our project through a structured evaluation process. The selected LLM should excel in:
- Question Answering
- Text Generation
The evaluation process should meet the following criteria:
- Aligns with the project requirements (PDF link).
- Covers general criteria (evaluation matrix, scoring system) and/or quantitative benchmark tests (ARC, HellaSwag, MMLU).
- References frameworks such as Eleuther AI Language Model Evaluation Harness and HuggingFace LLM Leaderboard.
- Evaluates at least 3 base models based on:
  - Language support
  - License
  - Model size
  - Achievability
  - Complexity of prompts for training
  - Dataset requirements for retraining based on our use case
Results:

Comparison of Candidate Models

Here is a comparison between some prominent base LLM models (some already finetuned marked with *). In selecting the models the following are considered:

The model should not be huge (easier to finetune). As the model parameters increase, the amount of data (number of data rows required for finetuning) increases. However, most of these models have larger versions. If we see we can gather enough data, we can opt for them.
Selecting models which have an active community/famous. If we face an error in the future, we would have somewhere to discuss it.
HumanEval benchmark for programming. Since we are dealing with a great number of .yml files which form project pipelines, it would be crucial for our model to perform well in programming tasks (Humaneval benchmark).
Benchmarks may not be fully accurate: In HuggingFace communities, people were mentioning low performance on models with great benchmarks.

Considering the above, the proposed models to start with would be:

Llama3_8b
Gemma_7b
Llama3_70b

Model	Model size	HuggingFace Avg	ARC	HellaSwag	MMLU	HumanEval	AGIEval(chat)	License
Gemma	7b	64.3	61	82.5	66	32.3	41.7	gemma (Allows redistribution with noting some text)
Llama3	70b	77.8	71.42	85.7	80	81.7	63	Llama (Allows redistribution with noting some text)
Llama3-instruct	8b	66.8	60.7	78.5	67.07	62.2		Llama (Allows redistribution with noting some text)
Mistral	7b	61	60	83	64			Apache 2.0
Calme-7B-Instruct-v0.9*	7b	76	73	89	64			Apache 2.0
Mixtral-8x22b-Instruct*	141b	79.1	72.7	89	77.7			Apache 2.0
Zephyr-orpo-141b-A35b*	141b	NA	NA	NA	NA	NA	44.16	Apache 2.0

Database Form

Usually databases are in a .json format with the following form:

{
  "prompt": "What are the three most important things to consider when deciding what technology to use to build an assistive device to help an elderly person with basic needs?",
  "response": "To build an assistive device to help an elderly person with basic needs, one must consider three crucial things:..."
}

Link to Original Issue: Conduct Large Language Base Model Selection Process Issue #19
Original Assignee: anosh-ar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Conduct Large Language Base Model Selection Process

Comparison of Candidate Models

Database Form

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally