BERT測試實踐

運行環境NF5288M5

運行環境NF5488M5

谷歌BERT預訓練任務

http://www.lxweimin.com/p/22e462f01d8c

樣本生成

Bert工程中發布了一個“生成預訓練數據”的腳本。該腳本的輸入是待訓練的數據（純文本文件）和字典，輸出是處理得到的tfrecord文件。待訓練的數據一句話一行，一個段落或文檔中間用空格隔開。

# Input fileformat:

(1) One sentenceper line. These should ideally be actual sentences, not entire paragraphs orarbitrary spans of text. (Because we use the sentence boundaries for the"next sentence prediction" task).

(2) Blank linesbetween documents. Document boundaries are needed so that the "nextsentence prediction" task doesn't span between documents.

該生成訓練數據的腳本，會一次性將輸入文件的所有內容填充到內存在做處理，如果文件過大需要多次調用腳本生成不同的TFRECORD文件。

python create_pretraining_data.py \

? --input_file=./sample_text.txt \

?--output_file=/tmp/tf_examples.tfrecord \

?--vocab_file=$BERT_BASE_DIR/vocab.txt \

? --do_lower_case=True \

? --max_seq_length=128 \

? --max_predictions_per_seq=20 \

? --masked_lm_prob=0.15 \

? --random_seed=12345 \

? --dupe_factor=5

****在生成數據的過程中老是提醒我生成的數據是空數據，沒辦法我就逐行的debug發現輸入文件讀不到，確認路徑不存在問題。后來，竟然是因為文件名的最后多加了一個空格。

訓練步驟

如果想從頭開始訓練的話就不要添加init_checkpoint這個超參。解釋下下面的參數，input_file是指預訓練用的數據集，在上面流程中產生的tfrecord文件；output_dir是存放日志和模型的文件夾；do_train & do_eval是否去做這兩個操作，必須有大于等于一個是True；bert_config_file構建bert模型時需要的參數，下載的模型文件中有這個json文件；init_checkpoint模型訓練的起點；后面的幾個參數分別是批次大小、最大預測的詞數、訓練的步數、預熱學習率的步數、初始學習率。

python run_pretraining.py \

?--input_file=/tmp/tf_examples.tfrecord \

?--output_dir=/tmp/pretraining_output \

? --do_train=True \

? --do_eval=True \

?--bert_config_file=$BERT_BASE_DIR/bert_config.json \

?--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \

? --train_batch_size=32 \

? --max_seq_length=128 \

? --max_predictions_per_seq=20 \

? --num_train_steps=20 \

? --num_warmup_steps=10 \

? --learning_rate=2e-5

1）輸入數據為TFRECORD格式數據，該數據可以使用樣本生成步驟中的腳本來生成。

TFRECORD中包含的數據包括：

Input_ids：: [101, 1131, 3090, 1106, 9416, 1103, 18127, 103, 117, 1115, 103, 1821, 170, 14798, 103, 4267, 20394, 1785, 2111, 103, 102, 170, 4984, 2851, 117, 178, 1821, 117, 6442, 106, 112, 1598, 1119, 103, 8228, 8788, 103, 117, 15992, 103, 8290, 3472, 118, 118, 112, 103, 4984, 2851, 106, 103, 118, 4984, 117, 1191, 1103, 103, 1104, 103, 103, 2621, 1104, 1103, 27466, 17893, 117, 1621, 1103, 16358, 5700, 1104, 1103, 2211, 1362, 118, 118, 5750, 117, 1256, 1154, 103, 16358, 1403, 118, 15398, 2111, 119, 1218, 117, 1170, 1155, 117, 178, 6111, 1437, 1128, 1103, 1236, 1106, 1103, 19026, 112, 188, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Input_mask:: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Segement id:: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Mask_lm_position:: [4, 7, 10, 14, 16, 19, 33, 36, 39, 45, 49, 55, 57, 58, 79, 92, 0, 0, 0, 0]

Mask_lm_ids:: [1143, 1864, 178, 1104, 6871, 119, 117, 1193, 1117, 170, 118, 14931, 5027, 1209, 1103, 1209, 0, 0, 0, 0]

Mask_lm_weights:: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0]

Next_sentence_lable 0

2) 給數據打batch，用tf.data，函數是input_fn_builder

3) 構建模型，輸入input_id input_mask segement_id

4) embedding及embedding的后處理，得到的大小batch*seq_len*emd_size

5)? ?經過transformer_model,得到的sequence_out大小batch*seq_len*hidden_size,如果是句子級別的分類等任務輸出可以選擇pool_out,大小batch*hidden_size

6)? ?mask_lm_loss 的求取，輸入sequence_out,得到per example_loss的大小 batchsize* mask_lm_ids_length,總loss 標量的loss

7)? next_seq_loss的求取，輸入pool_out，得到batch_size *2的分類，得到per example_loss的大小 batchsize*1，總loss 標量的loss

8)? 優化器然后loss反傳求解梯度，學習率反向更新權重

預訓練速度1

NF5288M5：96-100examples/s

預訓練速度2

NF5488M5：107~112examples/s

BERT模型構建過程

http://www.lxweimin.com/p/d7ce41b58801

（1）模型配置

????????模型配置，比較簡單，依次是：詞典大小、隱層神經元個數、transformer的層數、attention的頭數、激活函數、中間層神經元個數、隱層dropout比例、attention里面dropout比例、sequence最大長度、token_type_ids的詞典大小、truncated_normal_initializer的stdev。

（2）word embedding

（3）詞向量的后處理（添加位置信息、詞性信息）

（4）構造attention mask

（5）attention layer（多頭attention）

（6）transformer

（7）BERT模型構造

Bert模型返回的結果

***bert主要流程是先embedding（包括位置和token_type的embedding），然后調用transformer得到輸出結果，其中embedding、embedding_table、所有transformer層輸出、最后transformer層輸出以及pooled_output都可以獲得，用于遷移學習的fine-tune和預測任務；

***bert對于transformer的使用僅限于encoder，沒有decoder的過程。這是因為模型存粹是為了預訓練服務，而預訓練是通過語言模型，不同于NLP其他特定任務。在做遷移學習時可以自行添加；

***正因為沒有decoder的操作，所以在attention函數里面也相應地減少了很多不必要的功能。

BERT預訓練tips了解

1）Masked LM 和nextsentence prediction? loss

```***** Evalresults *****

? global_step = 20

? loss = 0.0979674

? masked_lm_accuracy = 0.985479

? masked_lm_loss = 0.0979328

? next_sentence_accuracy = 1.0

? next_sentence_loss = 3.45724e-05

```

2）更換自己詞典時，注意vocab_size的大小

3）Check point開始訓練，專有行業的語料影評分析

4）The learning rate we used inthe paper was 1e-4. However, if you are doing additional steps of pre-trainingstarting from an existing BERT checkpoint, you should use a smaller learningrate (e.g., 2e-5).

5）Longer sequences are disproportionately expensive because? attention is quadratic to the sequence length.In otherwords, a batch of 64 sequences of length 512 is much more expensive than abatch of 256 sequences of length 128. The fully-connected/convolutional cost isthe same, but the attention cost is far greater for the 512-length sequences.Therefore, one good recipe is to pre-train for, say, 90,000 steps with asequence length of 128 and then for 10,000 additional steps with a sequencelength of 512. The very long sequences are mostly needed to learn positionalembeddings, which can be learned fairly quickly. Note that this does requiregenerating the data twice with different values of`max_seq_length`.

6）Isthis code compatible with Cloud TPUs? What about GPUs?

Yes, all of the code in this repository worksout-of-the-box with CPU, GPU, and Cloud TPU. However, GPU training issingle-GPU only.

7）選擇BERT-Base, Uncased這個模型呢？原因有三：1、訓練語料為英文，所以不選擇中文或者多語種；2、設備條件有限，如果您的顯卡內存小于16個G，那就請乖乖選擇base,不要折騰large了；3、cased表示區分大小寫，uncased表示不區分大小寫。除非你明確知道你的任務對大小寫敏感（比如命名實體識別、詞性標注等）那么通常情況下uncased效果更好。