llama.cpp實現(xiàn)大模型格式轉(zhuǎn)換、量化、推理

1.llama.cpp介紹

llama.cpp是一個開源項目,專門為在本地CPU上部署量化模型而設(shè)計。它提供了一種簡單而高效的方法,將訓練好的量化模型轉(zhuǎn)換為可在CPU上運行的低配推理版本。

1.1 工作原理

llama.cpp的核心是一個優(yōu)化的量化推理引擎。這個引擎能夠高效地在CPU上執(zhí)行量化模型的推理任務(wù)。它通過一系列的優(yōu)化技術(shù),如使用定點數(shù)代替浮點數(shù)進行計算、批量處理和緩存優(yōu)化等,來提高推理速度并降低功耗。

1.2 優(yōu)點

  • 高效性能:llama.cpp針對CPU進行了優(yōu)化,能夠在保證精度的同時提供高效的推理性能。
  • 低資源占用:由于采用了量化技術(shù),llama.cpp可以顯著減少模型所需的存儲空間和計算資源。
  • 易于集成:llama.cpp提供了簡潔的API和接口,方便開發(fā)者將其集成到自己的項目中。
  • 跨平臺支持:llama.cpp可在多種操作系統(tǒng)和CPU架構(gòu)上運行,具有很好的可移植性。

1.3 應用場景

llama.cpp適用于各種需要部署量化模型的應用場景,如智能家居、物聯(lián)網(wǎng)設(shè)備、邊緣計算等。在這些場景中,llama.cpp可以幫助開發(fā)者在資源受限的環(huán)境中實現(xiàn)實時推斷和高能效計算。

2.下載編譯

2.1 下載

git clone https://github.com/ggerganov/llama.cpp

2.2 編譯

cd llama.cpp-master
make

make前目錄內(nèi)容如下:


image.png

make后目錄內(nèi)容如下:


image.png

make前后多了一些llama-xx命令,來執(zhí)行大模型相關(guān)的操作;

3.LLM操作

本文是使用面壁MiniCPM-2B-sft-bf16來進行試驗,llama.cpp有支持的可操作模型列表,支持轉(zhuǎn)換的模型格式有PyTorch的 .bin 、huggingface 的 .safetensors,根據(jù)支持列表進行下載操作即可。

3.1 格式轉(zhuǎn)換

格式轉(zhuǎn)換主要是將下載的模型進行g(shù)guf格式轉(zhuǎn)換,使用convert-hf-to-gguf.py轉(zhuǎn)換腳本讀取模型配置、分詞器、張量名稱+數(shù)據(jù),并將它們轉(zhuǎn)換為GGUF元數(shù)據(jù)和張量,以便在CPU上進行快速推理,而不需要GPU

GGUF格式是GPT-Generated Unified Format,由Georgi Gerganov定義發(fā)布的一種大模型文件格式。
它設(shè)計用于快速加載和保存模型,支持各種模型,并允許添加新功能同時保持兼容性。
GGUF文件格式專為存儲推斷模型而設(shè)計,特別適用于語言模型如GPT

轉(zhuǎn)換命令:

python3 convert_hf_to_gguf.py ./models/MiniCPM-2B-sft-bf16/

INFO:hf-to-gguf:Loading model: MiniCPM-2B-sft-bf16
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model part 'pytorch_model.bin'
INFO:hf-to-gguf:token_embd.weight,           torch.bfloat16 --> F16, shape = {2304, 122753}
INFO:hf-to-gguf:output_norm.weight,          torch.bfloat16 --> F32, shape = {2304}
INFO:hf-to-gguf:blk.0.attn_norm.weight,      torch.bfloat16 --> F32, shape = {2304}
........
INFO:hf-to-gguf:Set meta model
INFO:hf-to-gguf:Set model parameters
INFO:hf-to-gguf:Set model tokenizer
INFO:gguf.vocab:Setting special token type bos to 1
INFO:gguf.vocab:Setting special token type eos to 2
INFO:gguf.vocab:Setting special token type unk to 0
INFO:gguf.vocab:Setting add_bos_token to True
INFO:gguf.vocab:Setting add_eos_token to False
INFO:gguf.vocab:Setting chat_template to {% for message in messages %}{% if message['role'] == 'user' %}{{'<用戶>' + message['content'].strip() + '<AI>'}}{% else %}{{message['content'].strip()}}{% endif %}{% endfor %}
INFO:hf-to-gguf:Set model quantization version
INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:models/MiniCPM-2B-sft-bf16/CPM-2B-sft-F16.gguf: n_tensors = 362, total_size = 5.5G
Writing: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5.45G/5.45G [00:11<00:00, 456Mbyte/s]
INFO:hf-to-gguf:Model successfully exported to models/MiniCPM-2B-sft-bf16/CPM-2B-sft-F16.gguf

可以看到,在執(zhí)行轉(zhuǎn)換后,會在model目錄下生成對應的F16 gguf文件,大小約為5.45G

3.2 量化

量化主要是為了減少模型推理對硬件資源的要求,提高推理效率,但是模型的精度也會降低,通過犧牲模型參數(shù)的精度,來換取模型的推理速度

使用 llama-quantize量化模型
量化模型的命名方法遵循: Q + 量化比特位 + 變種。量化位數(shù)越少,對硬件資源的要求越低,推理速度越快,但是模型的精度也越低。

量化指令:

./llama-quantize ./models/MiniCPM-2B-sft-bf16/CPM-2B-sft-F16.gguf ./models/MiniCPM-2B-sft-bf16/CPM-2B-sft-Q4_K_M.gguf Q4_K_M

main: build = 0 (unknown)
main: built with cc (Ubuntu 11.2.0-19ubuntu1) 11.2.0 for x86_64-linux-gnu
main: quantizing './models/MiniCPM-2B-sft-bf16/CPM-2B-sft-F16.gguf' to './models/MiniCPM-2B-sft-bf16/CPM-2B-sft-Q4_K_M.gguf' as Q4_K_M
llama_model_loader: loaded meta data with 30 key-value pairs and 362 tensors from ./models/MiniCPM-2B-sft-bf16/CPM-2B-sft-F16.gguf (version GGUF V3 (latest))
llama_model_loader: - kv   0:                       general.architecture str              = minicpm
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = CPM 2B
llama_model_loader: - kv   3:                       general.organization str              = Openbmb
........
llama_tensor_get_type : tensor cols 5760 x 2304 are not divisible by 256, required for q6_K - using fallback quantization q8_0
converting to q8_0 .. size =    25.31 MiB ->    13.45 MiB
[ 354/ 362]              blk.39.attn_norm.weight - [ 2304,     1,     1,     1], type =    f32, size =    0.009 MB
[ 355/ 362]                 blk.39.attn_q.weight - [ 2304,  2304,     1,     1], type =    f16, converting to q4_K .. size =    10.12 MiB ->     2.85 MiB
[ 356/ 362]                 blk.39.attn_k.weight - [ 2304,  2304,     1,     1], type =    f16, converting to q4_K .. size =    10.12 MiB ->     2.85 MiB
[ 357/ 362]                 blk.39.attn_v.weight - [ 2304,  2304,     1,     1], type =    f16, converting to q6_K .. size =    10.12 MiB ->     4.15 MiB
[ 358/ 362]            blk.39.attn_output.weight - [ 2304,  2304,     1,     1], type =    f16, converting to q4_K .. size =    10.12 MiB ->     2.85 MiB
[ 359/ 362]               blk.39.ffn_norm.weight - [ 2304,     1,     1,     1], type =    f32, size =    0.009 MB
[ 360/ 362]               blk.39.ffn_gate.weight - [ 2304,  5760,     1,     1], type =    f16, converting to q4_K .. size =    25.31 MiB ->     7.12 MiB
[ 361/ 362]                 blk.39.ffn_up.weight - [ 2304,  5760,     1,     1], type =    f16, converting to q4_K .. size =    25.31 MiB ->     7.12 MiB
[ 362/ 362]               blk.39.ffn_down.weight - [ 5760,  2304,     1,     1], type =    f16, 

llama_tensor_get_type : tensor cols 5760 x 2304 are not divisible by 256, required for q6_K - using fallback quantization q8_0
converting to q8_0 .. size =    25.31 MiB ->    13.45 MiB
llama_model_quantize_internal: model size  =  5197.65 MB
llama_model_quantize_internal: quant size  =  1716.20 MB
llama_model_quantize_internal: WARNING: 40 of 281 tensor(s) required fallback quantization

main: quantize time = 29242.62 ms
main:    total time = 29242.62 ms

量化后的模型gguf文件為:CPM-2B-sft-Q4_K_M.gguf,大小為:1.8G

3.3 推理

推理命令:

./llama-cli -m ./models/MiniCPM-2B-sft-bf16/CPM-2B-sft-Q4_K_M.gguf -n 128 --prompt "<用戶>你知道openmbmb么<AI>"

推理過程及輸出如下:

Log start
main: build = 0 (unknown)
main: built with cc (Ubuntu 11.2.0-19ubuntu1) 11.2.0 for x86_64-linux-gnu
main: seed = 1725847164
llama_model_loader: loaded meta data with 30 key-value pairs and 362 tensors from ./models/MiniCPM-2B-sft-bf16/CPM-2B-sft-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = minicpm
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = CPM 2B
llama_model_loader: - kv   3:                       general.organization str              = Openbmb
......
system_info: n_threads = 8 (n_threads_batch = 8) / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling params: 
    repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
    top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
    mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
 sampler constr: 
    logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> temp-ext -> softmax -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = 128, n_keep = 1


 <用戶>你知道openmbmb么<AI> OpenMBMB是一個開源的、面向?qū)ο蟮亩嗾Z言模型框架,可以輕松地實現(xiàn)自然語言處理任務(wù)。 [end of text]

llama_perf_print:    sampling time =       2.35 ms /    36 runs   (    0.07 ms per token, 15286.62 tokens per second)
llama_perf_print:        load time =     513.93 ms
llama_perf_print: prompt eval time =     150.72 ms /    12 tokens (   12.56 ms per token,    79.62 tokens per second)
llama_perf_print:        eval time =    1178.25 ms /    23 runs   (   51.23 ms per token,    19.52 tokens per second)
llama_perf_print:       total time =    1334.43 ms /    35 tokens
Log end

通過以下命令可以看到支持的參數(shù):

./llama-cli -h

-s,    --seed SEED                      RNG seed (default: -1, use random seed for < 0)
-t,    --threads N                      number of threads to use during generation (default: -1)
                                        (env: LLAMA_ARG_THREADS)
-tb,   --threads-batch N                number of threads to use during batch and prompt processing (default:
                                        same as --threads)
-C,    --cpu-mask M                     CPU affinity mask: arbitrarily long hex. Complements cpu-range
                                        (default: "")
........

Conversation Mode:

./llama-cli -m ./models/MiniCPM-2B-sft-bf16/CPM-2B-sft-Q4_K_M.gguf -cnv

.....
.....
== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.


> 你好
 你好!有什么我可以幫助您的嗎?

> 你是誰
 作為一個AI語言模型,我沒有個人身份或情感。我被設(shè)計為幫助回答問題和提供信息。我通過接受來自各種來源的數(shù)據(jù)來工作,這些數(shù)據(jù)來自互聯(lián)網(wǎng)、書籍、論文、數(shù)據(jù)庫和其他資源。我的目標是根據(jù)輸入提供有用和相關(guān)的答案。如果您有任何問題,請隨時問我!

> /bye
 再見!

3.4 API服務(wù)

llama.cpp提供了與OpenAI API兼容的API接口,使用make生成的llama-server來啟動API服務(wù)

./llama-server -m ./models/MiniCPM-2B-sft-bf16/CPM-2B-sft-Q4_K_M.gguf --host 0.0.0.0 --port 1234

INFO [                    init] initializing slots | tid="140241878706112" timestamp=1725859292 n_slots=1
INFO [                    init] new slot | tid="140241878706112" timestamp=1725859292 id_slot=0 n_ctx_slot=4096
INFO [                    main] model loaded | tid="140241878706112" timestamp=1725859292
INFO [                    main] chat template | tid="140241878706112" timestamp=1725859292 chat_example="You are a helpful assistant<用戶>Hello<AI>Hi there<用戶>How are you?<AI>" built_in=true
INFO [            update_slots] all slots are idle | tid="140241878706112" timestamp=1725859292
INFO [   launch_slot_with_task] slot is processing task | tid="140241878706112" timestamp=1725859313 id_slot=0 id_task=0
INFO [            update_slots] kv cache rm [p0, end) | tid="140241878706112" timestamp=1725859313 id_slot=0 id_task=0 p0=0
INFO [                 release] slot released | tid="140241878706112" timestamp=1725859351 id_slot=0 id_task=0 n_past=687 truncated=false
INFO [           print_timings] prompt eval time     =      59.60 ms /     2 tokens (   29.80 ms per token,    33.56 tokens per second) | tid="140241878706112" timestamp=1725859351 id_slot=0 id_task=0 t_prompt_processing=59.6 n_prompt_tokens_processed=2 t_token=29.8 n_tokens_second=33.557046979865774
INFO [           print_timings] generation eval time =   37964.14 ms /   686 runs   (   55.34 ms per token,    18.07 tokens per second) | tid="140241878706112" timestamp=1725859351 id_slot=0 id_task=0 t_token_generation=37964.139 n_decoded=686 t_token=55.34131049562683 n_tokens_second=18.069684130068115
INFO [           print_timings]           total time =   38023.74 ms | tid="140241878706112" timestamp=1725859351 id_slot=0 id_task=0 t_prompt_processing=59.6 t_token_generation=37964.139 t_total=38023.739
INFO [            update_slots] all slots are idle | tid="140241878706112" timestamp=1725859351
INFO [      log_server_request] request | tid="140241853523520" timestamp=1725859385 remote_addr="127.0.0.1" remote_port=49130 status=200 method="POST" path="/completion" params={}

本地可以是curl命令來進行請求:

curl --request POST     --url http://localhost:1234/completion
     --header "Content-Type: application/json"
     --data '{"prompt": "介紹一下MiniCpm"}'

server端打印如下:

INFO [   launch_slot_with_task] slot is processing task | tid="140241878706112" timestamp=1725859435 id_slot=0 id_task=1016
INFO [            update_slots] kv cache rm [p0, end) | tid="140241878706112" timestamp=1725859435 id_slot=0 id_task=1016 p0=0
INFO [                 release] slot released | tid="140241878706112" timestamp=1725859466 id_slot=0 id_task=1016 n_past=581 truncated=false
INFO [           print_timings] prompt eval time     =      92.93 ms /     6 tokens (   15.49 ms per token,    64.56 tokens per second) | tid="140241878706112" timestamp=1725859466 id_slot=0 id_task=1016 t_prompt_processing=92.932 n_prompt_tokens_processed=6 t_token=15.488666666666667 n_tokens_second=64.5633366332372
INFO [           print_timings] generation eval time =   31077.38 ms /   576 runs   (   53.95 ms per token,    18.53 tokens per second) | tid="140241878706112" timestamp=1725859466 id_slot=0 id_task=1016 t_token_generation=31077.377 n_decoded=576 t_token=53.95377951388889 n_tokens_second=18.53438274407779
INFO [           print_timings]           total time =   31170.31 ms | tid="140241878706112" timestamp=1725859466 id_slot=0 id_task=1016 t_prompt_processing=92.932 t_token_generation=31077.377 t_total=31170.309
INFO [            update_slots] all slots are idle | tid="140241878706112" timestamp=1725859466

client端打印如下:

{"content":"\nMiniCpm是一種基于深度學習的超參數(shù)優(yōu)化方法,其核心思想是通過學習數(shù)據(jù)的統(tǒng)計特性,利用貝葉斯優(yōu)化算法進行超參數(shù)的搜索和優(yōu)化。
在MiniCpm中,超參數(shù)通常表示為一個概率分布的函數(shù),即P(參數(shù)|數(shù)據(jù))。通過學習數(shù)據(jù)的統(tǒng)計特性,MiniCpm可以找到最優(yōu)的P(參數(shù)|數(shù)據(jù)),
從而得到最佳的超參數(shù)。\n\n在MiniCpm中,首先需要定義一個貝葉斯優(yōu)化算法。常見的貝葉斯優(yōu)化算法有NUTS、SAM、Nelder-Mead等。
在MiniCpm中,我們使用NUTS算法作為貝葉斯優(yōu)化算法。NUTS算法通過從參數(shù)空間中隨機選擇一些候選參數(shù),計算出它們的期望值,
然后根據(jù)期望值計算出一個概率分布P(參數(shù)|數(shù)據(jù))。接著,根據(jù)P(參數(shù)|數(shù)據(jù))計算得到的新參數(shù)集合,再次計算出它們的期望值,
以此類推。重復這個過程,直到得到一個接近最優(yōu)的P(參數(shù)|數(shù)據(jù)),從而得到最佳的超參數(shù)。
\n\nMiniCpm的步驟如下:\n\n1. 定義一個貝葉斯優(yōu)化算法。在MiniCpm中,我們使用NUTS算法作為貝葉斯優(yōu)化算法。
\n\n2. 選擇一個合適的超參數(shù)搜索空間。超參數(shù)的搜索空間應該足夠大,以覆蓋數(shù)據(jù)的統(tǒng)計特性。
\n\n3. 初始化一個超參數(shù)搜索空間,通常是一個連續(xù)的參數(shù)空間。
\n\n4. 定義一個概率分布函數(shù),即P(參數(shù)|數(shù)據(jù))。在MiniCpm中,P(參數(shù)|數(shù)據(jù))通常表示為一個概率分布的函數(shù),即P(參數(shù)|數(shù)據(jù)) = P(參數(shù)|數(shù)據(jù))。
\n\n5. 選擇一個搜索策略,用于在超參數(shù)搜索空間中搜索最優(yōu)的超參數(shù)。常見的搜索策略有NUTS、SAM、Nelder-Mead等。在MiniCpm中,我們使用NUTS算法作為搜索策略。
\n\n6. 搜索超參數(shù)的過程。在搜索過程中,通過計算期望值得到新的參數(shù)集合,并重復計算期望值直到得到一個接近最優(yōu)的超參數(shù)。
\n\n7. 評估超參數(shù)的性能。通過計算目標函數(shù)的梯度,來評估超參數(shù)的性能。
\n\n8. 調(diào)整超參數(shù)的搜索策略。根據(jù)超參數(shù)的性能,調(diào)整搜索策略的參數(shù),以獲得更好的搜索效果。
\n\n9. 停止搜索。當超參數(shù)搜索空間變得非常小,或超參數(shù)的性能不再提高時,停止搜索。
\n\n10. 輸出最優(yōu)的超參數(shù)。根據(jù)搜索的結(jié)果,輸出最優(yōu)的超參數(shù)。
\n\n以上就是關(guān)于MiniCpm的基本知識點。在實際使用中,還需要根據(jù)具體的問題和數(shù)據(jù),選擇合適的超參數(shù)搜索空間和搜索策略,以獲得更好的效果。
","id_slot":0,"stop":true,"model":"./models/MiniCPM-2B-sft-bf16/CPM-2B-sft-Q4_K_M.gguf","tokens_predicted":576,
"tokens_evaluated":6,"generation_settings":{"n_ctx":4096,"n_predict":-1,"model":"./models/MiniCPM-2B-sft-bf16/CPM-2B-sft-Q4_K_M.gguf","seed":1725859291,
"temperature":0.800000011920929,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,
"min_p":0.05000000074505806,"tfs_z":1.0,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,
"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"penalize_nl":false,"stop":[],"max_tokens":-1,"n_keep":0,"n_discard":0,"ignore_eos":false,
"stream":false,"n_probs":0,"min_keep":0,"grammar":"","samplers":["top_k","tfs_z","typ_p","top_p","min_p","temperature"]},
"prompt":"介紹一下MiniCpm","truncated":false,"stopped_eos":true,"stopped_word":false,"stopped_limit":false,"stopping_word":"","tokens_cached":581,"timings":{"prompt_n":6,"prompt_ms":92.932,"prompt_per_token_ms":15.488666666666667,"prompt_per_second":64.5633366332372,"predicted_n":576,"predicted_ms":31077.377,"predicted_per_token_ms":53.95377951388889,"predicted_per_second":18.53438274407779},"index":0}

以上簡單介紹了一下llama.cpp實現(xiàn)大模型格式轉(zhuǎn)換、量化、推理,記錄一下本地操作過程,操作過程中,參考了以下兩篇文章,非常感謝!
https://blog.csdn.net/abcd51685168/article/details/140806221
https://developer.baidu.com/article/details/3185708

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

推薦閱讀更多精彩內(nèi)容