少妇与子亂伦免费看,好姑娘1免费观看完整版中文,激情全黄做爰片

1.llama.cpp介紹

llama.cpp是一個開源項目，專門為在本地CPU上部署量化模型而設(shè)計。它提供了一種簡單而高效的方法，將訓練好的量化模型轉(zhuǎn)換為可在CPU上運行的低配推理版本。

1.1 工作原理

llama.cpp的核心是一個優(yōu)化的量化推理引擎。這個引擎能夠高效地在CPU上執(zhí)行量化模型的推理任務(wù)。它通過一系列的優(yōu)化技術(shù)，如使用定點數(shù)代替浮點數(shù)進行計算、批量處理和緩存優(yōu)化等，來提高推理速度并降低功耗。

1.2 優(yōu)點

高效性能：llama.cpp針對CPU進行了優(yōu)化，能夠在保證精度的同時提供高效的推理性能。
低資源占用：由于采用了量化技術(shù)，llama.cpp可以顯著減少模型所需的存儲空間和計算資源。
易于集成：llama.cpp提供了簡潔的API和接口，方便開發(fā)者將其集成到自己的項目中。
跨平臺支持：llama.cpp可在多種操作系統(tǒng)和CPU架構(gòu)上運行，具有很好的可移植性。

1.3 應用場景

llama.cpp適用于各種需要部署量化模型的應用場景，如智能家居、物聯(lián)網(wǎng)設(shè)備、邊緣計算等。在這些場景中，llama.cpp可以幫助開發(fā)者在資源受限的環(huán)境中實現(xiàn)實時推斷和高能效計算。

2.下載編譯

2.1 下載

git clone https://github.com/ggerganov/llama.cpp

2.2 編譯

cd llama.cpp-master
make

make前目錄內(nèi)容如下：

image.png

make后目錄內(nèi)容如下：

image.png

make前后多了一些llama-xx命令，來執(zhí)行大模型相關(guān)的操作；

3.LLM操作

本文是使用面壁MiniCPM-2B-sft-bf16來進行試驗，llama.cpp有支持的可操作模型列表，支持轉(zhuǎn)換的模型格式有PyTorch的 .bin 、huggingface 的 .safetensors，根據(jù)支持列表進行下載操作即可。

3.1 格式轉(zhuǎn)換

格式轉(zhuǎn)換主要是將下載的模型進行g(shù)guf格式轉(zhuǎn)換，使用convert-hf-to-gguf.py轉(zhuǎn)換腳本讀取模型配置、分詞器、張量名稱+數(shù)據(jù)，并將它們轉(zhuǎn)換為GGUF元數(shù)據(jù)和張量，以便在CPU上進行快速推理，而不需要GPU

GGUF格式是GPT-Generated Unified Format，由Georgi Gerganov定義發(fā)布的一種大模型文件格式。
它設(shè)計用于快速加載和保存模型，支持各種模型，并允許添加新功能同時保持兼容性。
GGUF文件格式專為存儲推斷模型而設(shè)計，特別適用于語言模型如GPT

轉(zhuǎn)換命令：

python3 convert_hf_to_gguf.py ./models/MiniCPM-2B-sft-bf16/

INFO:hf-to-gguf:Loading model: MiniCPM-2B-sft-bf16
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model part 'pytorch_model.bin'
INFO:hf-to-gguf:token_embd.weight,           torch.bfloat16 --> F16, shape = {2304, 122753}
INFO:hf-to-gguf:output_norm.weight,          torch.bfloat16 --> F32, shape = {2304}
INFO:hf-to-gguf:blk.0.attn_norm.weight,      torch.bfloat16 --> F32, shape = {2304}
........
INFO:hf-to-gguf:Set meta model
INFO:hf-to-gguf:Set model parameters
INFO:hf-to-gguf:Set model tokenizer
INFO:gguf.vocab:Setting special token type bos to 1
INFO:gguf.vocab:Setting special token type eos to 2
INFO:gguf.vocab:Setting special token type unk to 0
INFO:gguf.vocab:Setting add_bos_token to True
INFO:gguf.vocab:Setting add_eos_token to False
INFO:gguf.vocab:Setting chat_template to {% for message in messages %}{% if message['role'] == 'user' %}{{'<用戶>' + message['content'].strip() + '<AI>'}}{% else %}{{message['content'].strip()}}{% endif %}{% endfor %}
INFO:hf-to-gguf:Set model quantization version
INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:models/MiniCPM-2B-sft-bf16/CPM-2B-sft-F16.gguf: n_tensors = 362, total_size = 5.5G
Writing: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5.45G/5.45G [00:11<00:00, 456Mbyte/s]
INFO:hf-to-gguf:Model successfully exported to models/MiniCPM-2B-sft-bf16/CPM-2B-sft-F16.gguf

可以看到，在執(zhí)行轉(zhuǎn)換后，會在model目錄下生成對應的F16 gguf文件，大小約為5.45G

3.2 量化

量化主要是為了減少模型推理對硬件資源的要求，提高推理效率，但是模型的精度也會降低，通過犧牲模型參數(shù)的精度，來換取模型的推理速度

使用 llama-quantize量化模型
量化模型的命名方法遵循: Q + 量化比特位 + 變種。量化位數(shù)越少，對硬件資源的要求越低，推理速度越快，但是模型的精度也越低。

量化指令：

./llama-quantize ./models/MiniCPM-2B-sft-bf16/CPM-2B-sft-F16.gguf ./models/MiniCPM-2B-sft-bf16/CPM-2B-sft-Q4_K_M.gguf Q4_K_M

main: build = 0 (unknown)
main: built with cc (Ubuntu 11.2.0-19ubuntu1) 11.2.0 for x86_64-linux-gnu
main: quantizing './models/MiniCPM-2B-sft-bf16/CPM-2B-sft-F16.gguf' to './models/MiniCPM-2B-sft-bf16/CPM-2B-sft-Q4_K_M.gguf' as Q4_K_M
llama_model_loader: loaded meta data with 30 key-value pairs and 362 tensors from ./models/MiniCPM-2B-sft-bf16/CPM-2B-sft-F16.gguf (version GGUF V3 (latest))
llama_model_loader: - kv   0:                       general.architecture str              = minicpm
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = CPM 2B
llama_model_loader: - kv   3:                       general.organization str              = Openbmb
........
llama_tensor_get_type : tensor cols 5760 x 2304 are not divisible by 256, required for q6_K - using fallback quantization q8_0
converting to q8_0 .. size =    25.31 MiB ->    13.45 MiB
[ 354/ 362]              blk.39.attn_norm.weight - [ 2304,     1,     1,     1], type =    f32, size =    0.009 MB
[ 355/ 362]                 blk.39.attn_q.weight - [ 2304,  2304,     1,     1], type =    f16, converting to q4_K .. size =    10.12 MiB ->     2.85 MiB
[ 356/ 362]                 blk.39.attn_k.weight - [ 2304,  2304,     1,     1], type =    f16, converting to q4_K .. size =    10.12 MiB ->     2.85 MiB
[ 357/ 362]                 blk.39.attn_v.weight - [ 2304,  2304,     1,     1], type =    f16, converting to q6_K .. size =    10.12 MiB ->     4.15 MiB
[ 358/ 362]            blk.39.attn_output.weight - [ 2304,  2304,     1,     1], type =    f16, converting to q4_K .. size =    10.12 MiB ->     2.85 MiB
[ 359/ 362]               blk.39.ffn_norm.weight - [ 2304,     1,     1,     1], type =    f32, size =    0.009 MB
[ 360/ 362]               blk.39.ffn_gate.weight - [ 2304,  5760,     1,     1], type =    f16, converting to q4_K .. size =    25.31 MiB ->     7.12 MiB
[ 361/ 362]                 blk.39.ffn_up.weight - [ 2304,  5760,     1,     1], type =    f16, converting to q4_K .. size =    25.31 MiB ->     7.12 MiB
[ 362/ 362]               blk.39.ffn_down.weight - [ 5760,  2304,     1,     1], type =    f16, 

llama_tensor_get_type : tensor cols 5760 x 2304 are not divisible by 256, required for q6_K - using fallback quantization q8_0
converting to q8_0 .. size =    25.31 MiB ->    13.45 MiB
llama_model_quantize_internal: model size  =  5197.65 MB
llama_model_quantize_internal: quant size  =  1716.20 MB
llama_model_quantize_internal: WARNING: 40 of 281 tensor(s) required fallback quantization

main: quantize time = 29242.62 ms
main:    total time = 29242.62 ms

量化后的模型gguf文件為：CPM-2B-sft-Q4_K_M.gguf，大小為：1.8G

3.3 推理

推理命令：

./llama-cli -m ./models/MiniCPM-2B-sft-bf16/CPM-2B-sft-Q4_K_M.gguf -n 128 --prompt "<用戶>你知道openmbmb么<AI>"

推理過程及輸出如下：

Log start
main: build = 0 (unknown)
main: built with cc (Ubuntu 11.2.0-19ubuntu1) 11.2.0 for x86_64-linux-gnu
main: seed = 1725847164
llama_model_loader: loaded meta data with 30 key-value pairs and 362 tensors from ./models/MiniCPM-2B-sft-bf16/CPM-2B-sft-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = minicpm
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = CPM 2B
llama_model_loader: - kv   3:                       general.organization str              = Openbmb
......
system_info: n_threads = 8 (n_threads_batch = 8) / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling params: 
    repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
    top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
    mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
 sampler constr: 
    logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> temp-ext -> softmax -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = 128, n_keep = 1


 <用戶>你知道openmbmb么<AI> OpenMBMB是一個開源的、面向?qū)ο蟮亩嗾Z言模型框架，可以輕松地實現(xiàn)自然語言處理任務(wù)。 [end of text]

llama_perf_print:    sampling time =       2.35 ms /    36 runs   (    0.07 ms per token, 15286.62 tokens per second)
llama_perf_print:        load time =     513.93 ms
llama_perf_print: prompt eval time =     150.72 ms /    12 tokens (   12.56 ms per token,    79.62 tokens per second)
llama_perf_print:        eval time =    1178.25 ms /    23 runs   (   51.23 ms per token,    19.52 tokens per second)
llama_perf_print:       total time =    1334.43 ms /    35 tokens
Log end

通過以下命令可以看到支持的參數(shù)：

./llama-cli -h

-s,    --seed SEED                      RNG seed (default: -1, use random seed for < 0)
-t,    --threads N                      number of threads to use during generation (default: -1)
                                        (env: LLAMA_ARG_THREADS)
-tb,   --threads-batch N                number of threads to use during batch and prompt processing (default:
                                        same as --threads)
-C,    --cpu-mask M                     CPU affinity mask: arbitrarily long hex. Complements cpu-range
                                        (default: "")
........

Conversation Mode:

./llama-cli -m ./models/MiniCPM-2B-sft-bf16/CPM-2B-sft-Q4_K_M.gguf -cnv

.....
.....
== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.


> 你好
 你好！有什么我可以幫助您的嗎？

> 你是誰
 作為一個AI語言模型，我沒有個人身份或情感。我被設(shè)計為幫助回答問題和提供信息。我通過接受來自各種來源的數(shù)據(jù)來工作，這些數(shù)據(jù)來自互聯(lián)網(wǎng)、書籍、論文、數(shù)據(jù)庫和其他資源。我的目標是根據(jù)輸入提供有用和相關(guān)的答案。如果您有任何問題，請隨時問我！

> /bye
 再見！

3.4 API服務(wù)

llama.cpp提供了與OpenAI API兼容的API接口，使用make生成的llama-server來啟動API服務(wù)

./llama-server -m ./models/MiniCPM-2B-sft-bf16/CPM-2B-sft-Q4_K_M.gguf --host 0.0.0.0 --port 1234

INFO [                    init] initializing slots | tid="140241878706112" timestamp=1725859292 n_slots=1
INFO [                    init] new slot | tid="140241878706112" timestamp=1725859292 id_slot=0 n_ctx_slot=4096
INFO [                    main] model loaded | tid="140241878706112" timestamp=1725859292
INFO [                    main] chat template | tid="140241878706112" timestamp=1725859292 chat_example="You are a helpful assistant<用戶>Hello<AI>Hi there<用戶>How are you?<AI>" built_in=true
INFO [            update_slots] all slots are idle | tid="140241878706112" timestamp=1725859292
INFO [   launch_slot_with_task] slot is processing task | tid="140241878706112" timestamp=1725859313 id_slot=0 id_task=0
INFO [            update_slots] kv cache rm [p0, end) | tid="140241878706112" timestamp=1725859313 id_slot=0 id_task=0 p0=0
INFO [                 release] slot released | tid="140241878706112" timestamp=1725859351 id_slot=0 id_task=0 n_past=687 truncated=false
INFO [           print_timings] prompt eval time     =      59.60 ms /     2 tokens (   29.80 ms per token,    33.56 tokens per second) | tid="140241878706112" timestamp=1725859351 id_slot=0 id_task=0 t_prompt_processing=59.6 n_prompt_tokens_processed=2 t_token=29.8 n_tokens_second=33.557046979865774
INFO [           print_timings] generation eval time =   37964.14 ms /   686 runs   (   55.34 ms per token,    18.07 tokens per second) | tid="140241878706112" timestamp=1725859351 id_slot=0 id_task=0 t_token_generation=37964.139 n_decoded=686 t_token=55.34131049562683 n_tokens_second=18.069684130068115
INFO [           print_timings]           total time =   38023.74 ms | tid="140241878706112" timestamp=1725859351 id_slot=0 id_task=0 t_prompt_processing=59.6 t_token_generation=37964.139 t_total=38023.739
INFO [            update_slots] all slots are idle | tid="140241878706112" timestamp=1725859351
INFO [      log_server_request] request | tid="140241853523520" timestamp=1725859385 remote_addr="127.0.0.1" remote_port=49130 status=200 method="POST" path="/completion" params={}

本地可以是curl命令來進行請求：

curl --request POST     --url http://localhost:1234/completion
     --header "Content-Type: application/json"
     --data '{"prompt": "介紹一下MiniCpm"}'

server端打印如下：

INFO [   launch_slot_with_task] slot is processing task | tid="140241878706112" timestamp=1725859435 id_slot=0 id_task=1016
INFO [            update_slots] kv cache rm [p0, end) | tid="140241878706112" timestamp=1725859435 id_slot=0 id_task=1016 p0=0
INFO [                 release] slot released | tid="140241878706112" timestamp=1725859466 id_slot=0 id_task=1016 n_past=581 truncated=false
INFO [           print_timings] prompt eval time     =      92.93 ms /     6 tokens (   15.49 ms per token,    64.56 tokens per second) | tid="140241878706112" timestamp=1725859466 id_slot=0 id_task=1016 t_prompt_processing=92.932 n_prompt_tokens_processed=6 t_token=15.488666666666667 n_tokens_second=64.5633366332372
INFO [           print_timings] generation eval time =   31077.38 ms /   576 runs   (   53.95 ms per token,    18.53 tokens per second) | tid="140241878706112" timestamp=1725859466 id_slot=0 id_task=1016 t_token_generation=31077.377 n_decoded=576 t_token=53.95377951388889 n_tokens_second=18.53438274407779
INFO [           print_timings]           total time =   31170.31 ms | tid="140241878706112" timestamp=1725859466 id_slot=0 id_task=1016 t_prompt_processing=92.932 t_token_generation=31077.377 t_total=31170.309
INFO [            update_slots] all slots are idle | tid="140241878706112" timestamp=1725859466

client端打印如下：

{"content":"\nMiniCpm是一種基于深度學習的超參數(shù)優(yōu)化方法，其核心思想是通過學習數(shù)據(jù)的統(tǒng)計特性，利用貝葉斯優(yōu)化算法進行超參數(shù)的搜索和優(yōu)化。
在MiniCpm中，超參數(shù)通常表示為一個概率分布的函數(shù)，即P(參數(shù)|數(shù)據(jù))。通過學習數(shù)據(jù)的統(tǒng)計特性，MiniCpm可以找到最優(yōu)的P(參數(shù)|數(shù)據(jù))，
從而得到最佳的超參數(shù)。\n\n在MiniCpm中，首先需要定義一個貝葉斯優(yōu)化算法。常見的貝葉斯優(yōu)化算法有NUTS、SAM、Nelder-Mead等。
在MiniCpm中，我們使用NUTS算法作為貝葉斯優(yōu)化算法。NUTS算法通過從參數(shù)空間中隨機選擇一些候選參數(shù)，計算出它們的期望值，
然后根據(jù)期望值計算出一個概率分布P(參數(shù)|數(shù)據(jù))。接著，根據(jù)P(參數(shù)|數(shù)據(jù))計算得到的新參數(shù)集合，再次計算出它們的期望值，
以此類推。重復這個過程，直到得到一個接近最優(yōu)的P(參數(shù)|數(shù)據(jù))，從而得到最佳的超參數(shù)。
\n\nMiniCpm的步驟如下：\n\n1. 定義一個貝葉斯優(yōu)化算法。在MiniCpm中，我們使用NUTS算法作為貝葉斯優(yōu)化算法。
\n\n2. 選擇一個合適的超參數(shù)搜索空間。超參數(shù)的搜索空間應該足夠大，以覆蓋數(shù)據(jù)的統(tǒng)計特性。
\n\n3. 初始化一個超參數(shù)搜索空間，通常是一個連續(xù)的參數(shù)空間。
\n\n4. 定義一個概率分布函數(shù)，即P(參數(shù)|數(shù)據(jù))。在MiniCpm中，P(參數(shù)|數(shù)據(jù))通常表示為一個概率分布的函數(shù)，即P(參數(shù)|數(shù)據(jù)) = P(參數(shù)|數(shù)據(jù))。
\n\n5. 選擇一個搜索策略，用于在超參數(shù)搜索空間中搜索最優(yōu)的超參數(shù)。常見的搜索策略有NUTS、SAM、Nelder-Mead等。在MiniCpm中，我們使用NUTS算法作為搜索策略。
\n\n6. 搜索超參數(shù)的過程。在搜索過程中，通過計算期望值得到新的參數(shù)集合，并重復計算期望值直到得到一個接近最優(yōu)的超參數(shù)。
\n\n7. 評估超參數(shù)的性能。通過計算目標函數(shù)的梯度，來評估超參數(shù)的性能。
\n\n8. 調(diào)整超參數(shù)的搜索策略。根據(jù)超參數(shù)的性能，調(diào)整搜索策略的參數(shù)，以獲得更好的搜索效果。
\n\n9. 停止搜索。當超參數(shù)搜索空間變得非常小，或超參數(shù)的性能不再提高時，停止搜索。
\n\n10. 輸出最優(yōu)的超參數(shù)。根據(jù)搜索的結(jié)果，輸出最優(yōu)的超參數(shù)。
\n\n以上就是關(guān)于MiniCpm的基本知識點。在實際使用中，還需要根據(jù)具體的問題和數(shù)據(jù)，選擇合適的超參數(shù)搜索空間和搜索策略，以獲得更好的效果。
","id_slot":0,"stop":true,"model":"./models/MiniCPM-2B-sft-bf16/CPM-2B-sft-Q4_K_M.gguf","tokens_predicted":576,
"tokens_evaluated":6,"generation_settings":{"n_ctx":4096,"n_predict":-1,"model":"./models/MiniCPM-2B-sft-bf16/CPM-2B-sft-Q4_K_M.gguf","seed":1725859291,
"temperature":0.800000011920929,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,
"min_p":0.05000000074505806,"tfs_z":1.0,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,
"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"penalize_nl":false,"stop":[],"max_tokens":-1,"n_keep":0,"n_discard":0,"ignore_eos":false,
"stream":false,"n_probs":0,"min_keep":0,"grammar":"","samplers":["top_k","tfs_z","typ_p","top_p","min_p","temperature"]},
"prompt":"介紹一下MiniCpm","truncated":false,"stopped_eos":true,"stopped_word":false,"stopped_limit":false,"stopping_word":"","tokens_cached":581,"timings":{"prompt_n":6,"prompt_ms":92.932,"prompt_per_token_ms":15.488666666666667,"prompt_per_second":64.5633366332372,"predicted_n":576,"predicted_ms":31077.377,"predicted_per_token_ms":53.95377951388889,"predicted_per_second":18.53438274407779},"index":0}

以上簡單介紹了一下llama.cpp實現(xiàn)大模型格式轉(zhuǎn)換、量化、推理，記錄一下本地操作過程，操作過程中，參考了以下兩篇文章，非常感謝！
https://blog.csdn.net/abcd51685168/article/details/140806221
https://developer.baidu.com/article/details/3185708

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

llama.cpp實現(xiàn)大模型格式轉(zhuǎn)換、量化、推理

llama.cpp實現(xiàn)大模型格式轉(zhuǎn)換、量化、推理

1.llama.cpp介紹

1.1 工作原理

1.2 優(yōu)點

1.3 應用場景

2.下載編譯

2.1 下載

2.2 編譯

3.LLM操作

3.1 格式轉(zhuǎn)換

3.2 量化

3.3 推理

3.4 API服務(wù)

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

llama.cpp實現(xiàn)大模型格式轉(zhuǎn)換、量化、推理

1.llama.cpp介紹

1.1 工作原理

1.2 優(yōu)點

1.3 應用場景

2.下載編譯

2.1 下載

2.2 編譯

3.LLM操作

3.1 格式轉(zhuǎn)換

3.2 量化

3.3 推理

3.4 API服務(wù)

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

llama.cpp實現(xiàn)大模型格式轉(zhuǎn)換、量化、推理