2023年的深度學習入門指南(4) - 在你的電腦上運行大模型

上一篇我們介紹了大模型的基礎，自注意力機制以及其實現Transformer模塊。因為Transformer被PyTorch和TensorFlow等框架所支持，所以我們只要能夠配置好框架的GPU或者其他加速硬件的支持，就可以運行起來了。

而想運行大模型，恐怕就沒有這么容易了，很有可能你需要一臺Linux電腦。因為目前流行的AI軟件一般都依賴大量的開源工具，尤其是要進行優化的情況下，很可能需要從源碼進行編譯。一旦涉及到開源軟件和編譯這些事情，在Windows上的難度就變成hard模式了。

大部分開發者自身都是在開源系統上做開發的，Windows的適配關注得較少，甚至完全不關心。雖然從Cygwin, MinGW, CMake到WSL，各方都為Windows上支持大量Linux開源庫進行了不少努力，但是就像在Linux上沒有Windows那么多游戲一樣，這是生態的問題。

我們先選取幾個Windows的兼容性稍好的項目，讓用Windows的同學們也可以體驗本機的大模型。

Nomic AI gpt4all (基于LLaMA)

2022年末chatgpt橫空出世之后，Meta公司認為openai背離了open的宗旨，于是半開放了他們的大模型LLaMA。半開放的原因是，網絡的權重文件需要跟Meta公司申請。
LLaMA主要是針對英語材料進行訓練，也引用了部分使用拉丁字母和西里爾字母的語言。它的分詞器可以支持漢語和日語，但是并沒有使用漢語和日語的材料。

因為不并對所有人開放，我們講解LLaMA是沒有意義的。但是我們可以嘗試一些基于LLaMA的項目，比如Nomic AI的gpt4all。

gpt4all的貼心之處是針對Windows, M1 Mac和Intel Mac三種平臺都進行了適配，當然默認肯定是支持Linux的。而且，推理使用CPU就可以。

下面我們就將其運行起來吧。

首先下載gpt4all的代碼：

git clone https://github.com/nomic-ai/gpt4all

第二步，下載量化之后的網絡權重值文件：https://the-eye.eu/public/AI/models/nomic-ai/gpt4all/gpt4all-lora-quantized.bin

第三步，將下載的gpt4all-lora-quantized.bin放在gpt4all的chat目錄下

第四步，運行gpt4all-lora-quantized可執行文件。以Windows平臺為例，就是運行gpt4all-lora-quantized-win64.exe。可以在powershell中執行，也可以直接點擊。

運行后，當加載完模型之后，我們就可以跟gpt4all對話了：

gpt4all.png

下面是gpt4all生成的C++快速排序的代碼：

Please generate quick sort algorithm in C++ language
Here's how you can implement Quick Sort Algorithm using C++ :

// Implementation of QuickSort() function for array A[] with size N and pivot value P[N/2]
void QS(int *A, int n) { // Function prototype declaration
    if (n <=1 ) return;   /* If there is only one element in the input Array */
        // QuickSort() function for array A[] with size N and pivot value P[N/2]
        else{                      //else statement to check condition of quick sort algorithm.
            int i = n-1, j=0;               /* Indexes used in the loop*/
             while (i>= 0) {              /* Loop for sorting elements */
                if(A[j]> A[i]){ // Swap condition to check element's position.
                    int temp = A[i];       // Copying value of array index i into a temporary variable 'temp'.
                    A[i]=A[j];             /* Exchanging values at two indices */
                }else{// Swap condition to check element's position.
                    int temp = A[j];       // Copying value of array index j into a temporary variable 'temp'.
                    A[i]=A[j], A[j] = temp; /* Exchanging values at two indices */
                }               i--, j++;             /* Decrement and increment loop indexes*/
            }; // End of while Loop for sorting elements.
        }//End if statement to check condition QuickSort() function call with array A[] size N & pivot value P[N/2].

恭喜你！你終于成功在自己的電腦上跑起了一個跟chatgpt一樣了不起的預訓練大模型。

這個模型雖然能力比chatgpt差了不少，但是它的模型來自LLaMA，訓練數據部分取自chatgpt的對話記錄，采用了低秩適應(LoRA: Low-Rank Adaptation)和量化的技術來縮減模型大小。每一項技術都是目前的前沿技術。

清華團隊的單卡大模型：ChatGLM-6B

清華團隊的ChatGLM-6B也是一個目標為運行在單GPU卡上的大模型，其底座為1300萬參數的ChatGLM。

下圖是2022年10月斯坦福大學對當時主要大模型的評測結果：

scores.png

從上圖數據來看，雖然ChatGLM是落后于InstuctGPT的chatgpt，但是從跟其它模型的比較來看實力還不錯。

我們就來試試基于ChatGLM的壓縮后的效果。

我們先下載代碼：

git clone https://github.com/THUDM/ChatGLM-6B

然后安裝其依賴的庫：

pip install -r requirements.txt

上面的gpt4all的Python編程接口不支持Windows，而ChatGLM-6B基于Hugging Face Transformer庫開發，支持Windows下的Python編程，我們寫個代碼來調用它吧：

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True)
model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4-qe", trust_remote_code=True).half().cuda()
model = model.eval()
response, history = model.chat(tokenizer, "用C++實現快速排序", history=[])
print(response)

輸出的結果如下：

chatglm.png

#include <iostream>
using namespace std;

void swap(int& a, int& b) {
    int temp = a;
    a = b;
    b = temp;
}

int partition(int arr[], int low, int high) {
    int pivot = arr[high];
    int i = (low - 1);

    for (int j = low; j < high - 1; j++) {
        if (arr[j] < pivot) {
            i++;
            swap(arr[i], arr[j]);
        }
    }
    swap(arr[i+1], arr[high]);
    return i+1;
}

void quicksort(int arr[], int low, int high) {
    if (low < high) {
        int pi = partition(arr, low, high);
        quicksort(arr, low, pi - 1);
        quicksort(arr, pi + 1, high);
    }
}

int main() {
    int arr[] = {5, 2, 9, 1, 6, 3, 8};
    int n = sizeof(arr) / sizeof(arr[0]);

    quicksort(arr, 0, n-1);

    cout << arr[0] << endl;
    return 0;
}

是不是效果還可以？有點chatgpt的意思了吧？

如果你的PyTorch或者Tensorflow的GPU支持裝好了的話，這個推理就是用GPU來完成的。我選用了最省顯存的4位量化，如果你的顯卡更好，可以選擇壓縮比更低一些的模型。

這里面我們可以引出Transformer時代的門戶，hugging face。我們在上面代碼中所使用的from的 transformers庫，就是hugging face出品的。

from transformers import AutoTokenizer, AutoModel

huggingface.png

從上圖我們可以看到，Hugging face基本上就是各種Transformer模型的集散地。使用Hugging face的接口，就可以使用基本上所有的開源的大模型。

大模型是如何煉成的

雖然網絡權值需要申請，但是Meta的LLaMA大模型的模型代碼是開源的。我們來看看LLaMA的Transformer跟我們上一節構造的標準的Transformer有什么區別：

class Transformer(nn.Module):
    def __init__(self, params: ModelArgs):
        super().__init__()
        self.params = params
        self.vocab_size = params.vocab_size
        self.n_layers = params.n_layers

        self.tok_embeddings = ParallelEmbedding(
            params.vocab_size, params.dim, init_method=lambda x: x
        )

        self.layers = torch.nn.ModuleList()
        for layer_id in range(params.n_layers):
            self.layers.append(TransformerBlock(layer_id, params))

        self.norm = RMSNorm(params.dim, eps=params.norm_eps)
        self.output = ColumnParallelLinear(
            params.dim, params.vocab_size, bias=False, init_method=lambda x: x
        )

        self.freqs_cis = precompute_freqs_cis(
            self.params.dim // self.params.n_heads, self.params.max_seq_len * 2
        )

我們看到，為了加強并發訓練，Meta的全連接網絡用的是它們自己的ColumnParallelLinear。它們的詞嵌入層也是自己做的并發版。

根據層次數，它也是堆了若干層的TransformerBlock。

我們再來看這個Block:

class TransformerBlock(nn.Module):
    def __init__(self, layer_id: int, args: ModelArgs):
        super().__init__()
        self.n_heads = args.n_heads
        self.dim = args.dim
        self.head_dim = args.dim // args.n_heads
        self.attention = Attention(args)
        self.feed_forward = FeedForward(
            dim=args.dim, hidden_dim=4 * args.dim, multiple_of=args.multiple_of
        )
        self.layer_id = layer_id
        self.attention_norm = RMSNorm(args.dim, eps=args.norm_eps)
        self.ffn_norm = RMSNorm(args.dim, eps=args.norm_eps)

    def forward(self, x: torch.Tensor, start_pos: int, freqs_cis: torch.Tensor, mask: Optional[torch.Tensor]):
        h = x + self.attention.forward(self.attention_norm(x), start_pos, freqs_cis, mask)
        out = h + self.feed_forward.forward(self.ffn_norm(h))
        return out

我們發現，它沒有使用標準的多頭注意力，而是自己實現了一個注意力類。

class Attention(nn.Module):
    def __init__(self, args: ModelArgs):
        super().__init__()

        self.n_local_heads = args.n_heads // fs_init.get_model_parallel_world_size()
        self.head_dim = args.dim // args.n_heads

        self.wq = ColumnParallelLinear(
            args.dim,
            args.n_heads * self.head_dim,
            bias=False,
            gather_output=False,
            init_method=lambda x: x,
        )
        self.wk = ColumnParallelLinear(
            args.dim,
            args.n_heads * self.head_dim,
            bias=False,
            gather_output=False,
            init_method=lambda x: x,
        )
        self.wv = ColumnParallelLinear(
            args.dim,
            args.n_heads * self.head_dim,
            bias=False,
            gather_output=False,
            init_method=lambda x: x,
        )
        self.wo = RowParallelLinear(
            args.n_heads * self.head_dim,
            args.dim,
            bias=False,
            input_is_parallel=True,
            init_method=lambda x: x,
        )

        self.cache_k = torch.zeros(
            (args.max_batch_size, args.max_seq_len, self.n_local_heads, self.head_dim)
        ).cuda()
        self.cache_v = torch.zeros(
            (args.max_batch_size, args.max_seq_len, self.n_local_heads, self.head_dim)
        ).cuda()

鬧了半天就是支持了并發和加了cache的多頭注意力，K,V,Q穿了個馬甲，本質上還是多頭自注意力。

其它有趣的工程

LM Flow

LM Flow也是最近很火的項目，它是香港科技大學在LLaMA的基礎上搞的全流程開源的，可以在單3090 GPU上進行訓練的工程。

其地址在：https://github.com/OptimalScale/LMFlow

LMFlow目前的獨特價值在于，它提供的流程比較完整。

比如，在目前的開源項目中，LMFlow是少有的提供了Instruction Tuning的工程。

我們來看個Instruction Tuning的例子：

{"id": 0, "instruction": "The sentence you are given might be too wordy, complicated, or unclear. Rewrite the sentence and make your writing clearer by keeping it concise. Whenever possible, break complex sentences into multiple sentences and eliminate unnecessary words.", "input": "If you have any questions about my rate or if you find it necessary to increase or decrease the scope for this project, please let me know.", "infer30b_before_item": " Output: The sentence you are given might be too wordy, complicated, or unclear. Rewrite the sentence and make your writing clearer by keeping it concise. Whenever possible, break complex sentences into multiple sentences and eliminate unnecessary words. If you have any questions about my rate or if you find it necessary to increase or decrease the scope for this project, please let me know.\n---\nInput: Input: The sentence you are given might be too wordy, complicated, or unclear. Rewrite the sentence and make your writing clearer by keeping it concise. Whenever possible, break complex sentences into multiple sentences and eliminate unnecessary words. If you have any questions about my rate or if you find it necessary to increase or decrease the scope for this project, please let me know.\n Output: Output: The sentence you are given might be too wordy, complicated, or unclear. Rewrite the sentence and make your writing clearer by keeping it concise. Whenever possible, break complex sentences into multiple sentences and eliminate unnecessary words. If you have any questions about my rate or if you find it necessary to increase or decrease the scope for this project, please let me know.\n---\nInput: Input: The sentence you are given might be too wordy, complicated,", "infer30b_after_item": " \n Output: If you have any questions about my rate or need to adjust the scope for this project, please let me know. \n\n", "infer13b_before_item": " The sentence you are given might be too wordy, complicated, or unclear. Rewrite the sentence and make your writing clearer by keeping it concise. Whenever possible, break complex sentences into multiple sentences and eliminate unnecessary words. If you have any questions about my rate or if you find it necessary to increase or decrease the scope for this project, please let me know.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n", "infer13b_after_item": " \n Output: If you have any questions about my rate or if you find it necessary to increase or decrease the scope for this project, please let me know. \n\n", "infer7b_before_item": " The sentence you are given might be too wordy, complicated, or unclear. Rewrite the sentence and make your writing clearer by keeping it concise. Whenever possible, break complex sentences into multiple sentences and eliminate unnecessary words. If you have any questions about my rate or if you find it necessary to increase or decrease the scope for this project, please let me know.\nInput: The sentence you are given might be too wordy, complicated, or unclear. Rewrite the sentence and make your writing clearer by keeping it concise. Whenever possible, break complex sentences into multiple sentences and eliminate unnecessary words. If you have any questions about my rate or if you find it necessary to increase or decrease the scope for this project, please let me know.\nOutput: The sentence you are given might be too wordy, complicated, or unclear. Rewrite the sentence and make your writing clearer by keeping it concise. Whenever possible, break complex sentences into multiple sentences and eliminate unnecessary words. If you have any questions about my rate or if you find it necessary to increase or decrease the scope for this project, please let me know.\nInput: The sentence you are given might be too wordy, complicated, or unclear. Rewrite the sentence and make your writing clearer by", "infer7b_after_item": " \n Output: If you have any questions about my rate or if you find it necessary to increase or decrease the scope for this project, please let me know. \n\n"}

這讓我們見識到了，原來糾錯就是這樣搞的。這是LLaMA中所缺少的。

HuggingGPT

最近浙大和微軟的團隊又推出了充分利用Hugging Face的門戶中樞地位的Jarvis工程。

overview.jpg

很不幸的是，上面的兩個工程，加上前面工程的高級應用，很難在Windows上面完成。我們后面將統一介紹這些需要在Linux環境下的實驗。

小結

通過對大模型進行剪枝、降秩、量化等手段，我們是可以在資源受限的電腦上運行推理的。當然，性能是有所損失的。我們可以根據業務場景去平衡，如果能用prompt engineer解決最好
HuggingFace是預訓練大模型的編程接口和模型集散地
大模型的基本原理仍然是我們上節學習的自注意力模型

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

2023年的深度學習入門指南(4) - 在你的電腦上運行大模型

2023年的深度學習入門指南(4) - 在你的電腦上運行大模型

2023年的深度學習入門指南(4) - 在你的電腦上運行大模型

Nomic AI gpt4all (基于LLaMA)

清華團隊的單卡大模型：ChatGLM-6B

大模型是如何煉成的

其它有趣的工程

LM Flow

HuggingGPT

小結

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

2023年的深度學習入門指南(4) - 在你的電腦上運行大模型

2023年的深度學習入門指南(4) - 在你的電腦上運行大模型

Nomic AI gpt4all (基于LLaMA)

清華團隊的單卡大模型：ChatGLM-6B

大模型是如何煉成的

其它有趣的工程

LM Flow

HuggingGPT

小結

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频