- Download Llama 2 weights from Meta. This project supports 7B, 7B-chat, 13B, 13B-chat, 70B and 70B-chat models.
- Open the
llama-2-7b/params.jsonfile:
- replace
"vocab_size": -1to"vocab_size": 32000, - add a new property:
"max_seq_len": 2048.
- Install dependencies of the converter:
cd converter && pip install -r requirements.txt- Convert weights to Distributed Llama format. This will take a bit of time. The script requires Python 3.
python convert-llama.py /path/to/meta/llama-2-7b q40- Download the tokenizer for Llama 2:
wget https://huggingface.co/b4rtaz/llama-2-distributed-llama/resolve/main/dllama-llama2-tokenizer.t
- Build the project:
make dllama- Run:
./dllama inference --model dllama_llama-2-7b_q40.bin --tokenizer dllama-llama2-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --prompt "Hello world" --steps 16 --nthreads 4In the table below, you can find the expected size of the converted weights with different floating-point types.
| Model | Original size | Float32 | Float16 | Q40 |
|---|---|---|---|---|
| Llama 2 7B | 13.48 GB | 25.10GB | 3.95 GB | |
| Llama 2 13B | 26.03 GB | 7.35 GB | ||
| Llama 2 70B | 137.97 GB | 36.98 GB |
You can skip the 1 - 9 points by downloading the converted Llama 3 8B Q40 model from here.
- Get an access to the model on Llama 3 website.
- Clone the
https://github.com/meta-llama/llama3repository. - Run the
download.shscript to download the model. - For Llama 3 8B model you should have the following files:
Meta-Llama-3-8B/consolidated.00.pthMeta-Llama-3-8B/params.jsonMeta-Llama-3-8B/tokenizer.model
- Open
params.jsonand add a new property:"max_seq_len": 8192. - Clone the
https://github.com/b4rtaz/distributed-llama.gitrepository. - Install dependencies of the converter:
cd converter && pip install -r requirements.txt- Convert the model to the Distributed Llama format:
python converter/convert-llama.py path/to/Meta-Llama-3-8B q40- Convert the tokenizer to the Distributed Llama format:
python converter/convert-tokenizer-llama3.py path/to/tokenizer.model- Build the project:
make dllama- Run the Distributed Llama:
./dllama inference --weights-float-type q40 --buffer-float-type q80 --prompt "My name is" --steps 128 --nthreads 8 --model dllama_meta-llama-3-8b_q40.bin --tokenizer llama3-tokenizer.t