generated from amosproj/amos202Xss0Y-projname
-
Notifications
You must be signed in to change notification settings - Fork 1
Conduct Large Language Base Model Selection Process
Yashodhar Pansuriya edited this page Jun 19, 2024
·
3 revisions
-
Question: Identify and select the most suitable Large Language Base Model (LLM) for our project through a structured evaluation process. The selected LLM should excel in:
- Question Answering
- Text Generation
The evaluation process should meet the following criteria:
- Aligns with the project requirements (PDF link).
- Covers general criteria (evaluation matrix, scoring system) and/or quantitative benchmark tests (ARC, HellaSwag, MMLU).
- References frameworks such as Eleuther AI Language Model Evaluation Harness and HuggingFace LLM Leaderboard.
- Evaluates at least 3 base models based on:
- Language support
- License
- Model size
- Achievability
- Complexity of prompts for training
- Dataset requirements for retraining based on our use case
-
Results:
Here is a comparison between some prominent base LLM models (some already finetuned marked with *). In selecting the models the following are considered:
- The model should not be huge (easier to finetune). As the model parameters increase, the amount of data (number of data rows required for finetuning) increases. However, most of these models have larger versions. If we see we can gather enough data, we can opt for them.
- Selecting models which have an active community/famous. If we face an error in the future, we would have somewhere to discuss it.
- HumanEval benchmark for programming. Since we are dealing with a great number of .yml files which form project pipelines, it would be crucial for our model to perform well in programming tasks (Humaneval benchmark).
- Benchmarks may not be fully accurate: In HuggingFace communities, people were mentioning low performance on models with great benchmarks.
Considering the above, the proposed models to start with would be:
- Llama3_8b
- Gemma_7b
- Llama3_70b
Model | Model size | HuggingFace Avg | ARC | HellaSwag | MMLU | HumanEval | AGIEval(chat) | License |
---|---|---|---|---|---|---|---|---|
Gemma | 7b | 64.3 | 61 | 82.5 | 66 | 32.3 | 41.7 | gemma (Allows redistribution with noting some text) |
Llama3 | 70b | 77.8 | 71.42 | 85.7 | 80 | 81.7 | 63 | Llama (Allows redistribution with noting some text) |
Llama3-instruct | 8b | 66.8 | 60.7 | 78.5 | 67.07 | 62.2 | Llama (Allows redistribution with noting some text) | |
Mistral | 7b | 61 | 60 | 83 | 64 | Apache 2.0 | ||
Calme-7B-Instruct-v0.9* | 7b | 76 | 73 | 89 | 64 | Apache 2.0 | ||
Mixtral-8x22b-Instruct* | 141b | 79.1 | 72.7 | 89 | 77.7 | Apache 2.0 | ||
Zephyr-orpo-141b-A35b* | 141b | NA | NA | NA | NA | NA | 44.16 | Apache 2.0 |
Usually databases are in a .json format with the following form:
{
"prompt": "What are the three most important things to consider when deciding what technology to use to build an assistive device to help an elderly person with basic needs?",
"response": "To build an assistive device to help an elderly person with basic needs, one must consider three crucial things:..."
}
- Link to Original Issue: Conduct Large Language Base Model Selection Process Issue #19
- Original Assignee: anosh-ar