??原來一致湊合著使用Torch中torch.util.data下的數據集工具做數據處理,但是其中的DataLoader要求樣本的長度是對齊的,而且對不同的數據源需要做細節處理。
??Torch提供了torchtext.data模塊用來實現文本的處理,并并結合中文分詞工具,基本上可以滿足日常的文本處理了。
??這個主題就是介紹torchtext并入門,主要介紹Field,Example,Dataset,Vectors的使用,并使用LSTM網絡做了一個文本分類的例子。實際torchtext還是很彪悍的工具模塊。
tortext 模塊結構
-
torchtext模塊包含文本數據處理與文本數據集
- 文本數據處理
- torchtext.data
- torchtext.data.utils
- torchtext.data.functional
- torchtext.data.metrics
- torchtext.vocab
- torchtext.utils
- 文本數據集
- torchtext.datasets
- torchtext.experimental.datasets
- examples
- 文本數據處理
-
注意:
- 這里先從
torchtext.data
開始使用TorchText。
- 這里先從
torchtext.data結構
-
torchtext.data
- Dataset, Batch, and Example
- Fields
- Iterators
- Pipeline
- Functions
-
文本處理的核心模式:
- Dataset指定文本數據源;
- Field指定處理字段;
- Iterator遍歷數據集;
-
下面是TorchText的使用模式示意圖
- 參考鏈接:
http://mlexplained.com/2018/02/08/a-comprehensive-tutorial-to-torchtext/
- 參考鏈接:
TorchText使用模式示意圖
TorchText使用例子
- 下面我們從一個例子來說明TorchText的使用模式。
- 環境安裝
- 數據源
- 定義字段Field
- 構建數據集
- 構建批次數據
- 詞向量與構建詞表
- 使用數據集
環境安裝
- 安裝torchtext
pip install torchtext
- 注意:
- 因為bug的緣故,建議采用直接在github安裝修正版:
pip install https://github.com/pytorch/text/archive/master.zip
- 因為bug的緣故,建議采用直接在github安裝修正版:
安裝torchtext
- 可選安裝1 - 分詞工具
pip install spacy
python -m spacy download en
- spacy官網:
https://spacy.io/models/
Spacy分詞工具
- 注意:
- 安裝訓練庫會因為網絡緣故,無法下載,可以使用如下安裝:
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz
- 也可以直接下載,并安裝(這兒采用的。)。
- 安裝訓練庫會因為網絡緣故,無法下載,可以使用如下安裝:
安裝庫
- 可選安裝2 - 分詞工具
pip install sacremoses
安裝sacremoses
- 安裝 -分詞工具
- 結巴分詞
pip install jieba
數據源
-
下載地址
https://github.com/bigboNed3/chinese_text_cnn
-
下載的文件:
- 訓練集:
train.tsv
- 測試集:
test.tsv
- 驗證集:
dev.tsv
- 訓練集:
數據源文件
-
注意:
- 文件也可以使用其他方式存儲,比如text文件,json文件等。
-
數據格式:
- 序號(多余的字段)
- label
- text
數據格式
定義字段Field
Field類幫助文檔
- 構造Field對象的參數設置有兩種方式:
- 在構造器中設置
- 使用屬性設置(我們后面采用屬性設置)
from torchtext.data import Field
Field?
?[1;31mInit signature:?[0m
?[0mField?[0m?[1;33m(?[0m?[1;33m
?[0m ?[0msequential?[0m?[1;33m=?[0m?[1;32mTrue?[0m?[1;33m,?[0m?[1;33m
?[0m ?[0muse_vocab?[0m?[1;33m=?[0m?[1;32mTrue?[0m?[1;33m,?[0m?[1;33m
?[0m ?[0minit_token?[0m?[1;33m=?[0m?[1;32mNone?[0m?[1;33m,?[0m?[1;33m
?[0m ?[0meos_token?[0m?[1;33m=?[0m?[1;32mNone?[0m?[1;33m,?[0m?[1;33m
?[0m ?[0mfix_length?[0m?[1;33m=?[0m?[1;32mNone?[0m?[1;33m,?[0m?[1;33m
?[0m ?[0mdtype?[0m?[1;33m=?[0m?[0mtorch?[0m?[1;33m.?[0m?[0mint64?[0m?[1;33m,?[0m?[1;33m
?[0m ?[0mpreprocessing?[0m?[1;33m=?[0m?[1;32mNone?[0m?[1;33m,?[0m?[1;33m
?[0m ?[0mpostprocessing?[0m?[1;33m=?[0m?[1;32mNone?[0m?[1;33m,?[0m?[1;33m
?[0m ?[0mlower?[0m?[1;33m=?[0m?[1;32mFalse?[0m?[1;33m,?[0m?[1;33m
?[0m ?[0mtokenize?[0m?[1;33m=?[0m?[1;32mNone?[0m?[1;33m,?[0m?[1;33m
?[0m ?[0mtokenizer_language?[0m?[1;33m=?[0m?[1;34m'en'?[0m?[1;33m,?[0m?[1;33m
?[0m ?[0minclude_lengths?[0m?[1;33m=?[0m?[1;32mFalse?[0m?[1;33m,?[0m?[1;33m
?[0m ?[0mbatch_first?[0m?[1;33m=?[0m?[1;32mFalse?[0m?[1;33m,?[0m?[1;33m
?[0m ?[0mpad_token?[0m?[1;33m=?[0m?[1;34m'<pad>'?[0m?[1;33m,?[0m?[1;33m
?[0m ?[0munk_token?[0m?[1;33m=?[0m?[1;34m'<unk>'?[0m?[1;33m,?[0m?[1;33m
?[0m ?[0mpad_first?[0m?[1;33m=?[0m?[1;32mFalse?[0m?[1;33m,?[0m?[1;33m
?[0m ?[0mtruncate_first?[0m?[1;33m=?[0m?[1;32mFalse?[0m?[1;33m,?[0m?[1;33m
?[0m ?[0mstop_words?[0m?[1;33m=?[0m?[1;32mNone?[0m?[1;33m,?[0m?[1;33m
?[0m ?[0mis_target?[0m?[1;33m=?[0m?[1;32mFalse?[0m?[1;33m,?[0m?[1;33m
?[0m?[1;33m)?[0m?[1;33m?[0m?[0m
?[1;31mDocstring:?[0m
Defines a datatype together with instructions for converting to Tensor.
Field class models common text processing datatypes that can be represented
by tensors. It holds a Vocab object that defines the set of possible values
for elements of the field and their corresponding numerical representations.
The Field object also holds other parameters relating to how a datatype
should be numericalized, such as a tokenization method and the kind of
Tensor that should be produced.
If a Field is shared between two columns in a dataset (e.g., question and
answer in a QA dataset), then they will have a shared vocabulary.
Attributes:
sequential: Whether the datatype represents sequential data. If False,
no tokenization is applied. Default: True.
use_vocab: Whether to use a Vocab object. If False, the data in this
field should already be numerical. Default: True.
init_token: A token that will be prepended to every example using this
field, or None for no initial token. Default: None.
eos_token: A token that will be appended to every example using this
field, or None for no end-of-sentence token. Default: None.
fix_length: A fixed length that all examples using this field will be
padded to, or None for flexible sequence lengths. Default: None.
dtype: The torch.dtype class that represents a batch of examples
of this kind of data. Default: torch.long.
preprocessing: The Pipeline that will be applied to examples
using this field after tokenizing but before numericalizing. Many
Datasets replace this attribute with a custom preprocessor.
Default: None.
postprocessing: A Pipeline that will be applied to examples using
this field after numericalizing but before the numbers are turned
into a Tensor. The pipeline function takes the batch as a list, and
the field's Vocab.
Default: None.
lower: Whether to lowercase the text in this field. Default: False.
tokenize: The function used to tokenize strings using this field into
sequential examples. If "spacy", the SpaCy tokenizer is
used. If a non-serializable function is passed as an argument,
the field will not be able to be serialized. Default: string.split.
tokenizer_language: The language of the tokenizer to be constructed.
Various languages currently supported only in SpaCy.
include_lengths: Whether to return a tuple of a padded minibatch and
a list containing the lengths of each examples, or just a padded
minibatch. Default: False.
batch_first: Whether to produce tensors with the batch dimension first.
Default: False.
pad_token: The string token used as padding. Default: "<pad>".
unk_token: The string token used to represent OOV words. Default: "<unk>".
pad_first: Do the padding of the sequence at the beginning. Default: False.
truncate_first: Do the truncating of the sequence at the beginning. Default: False
stop_words: Tokens to discard during the preprocessing step. Default: None
is_target: Whether this field is a target variable.
Affects iteration over batches. Default: False
?[1;31mFile:?[0m c:\program files\python36\lib\site-packages\torchtext\data\field.py
?[1;31mType:?[0m type
?[1;31mSubclasses:?[0m ReversibleField, NestedField, LabelField, ShiftReduceField, ParsedTextField, BABI20Field
Field類說明
CLASS torchtext.data.Field(
sequential=True, # 是否序列數據, 默認值True,如果為False,則不需要init_token參數。
use_vocab=True, # 是否使用詞袋對象,默認True,如果指定False,則不需要處理這個字典,表示這個字典默認是Numerical。
init_token=None, # 加載每個字段前的處理函數。
eos_token=None, # 加載完每個字段后的處理函數。
fix_length=None, # 指定字段的文本長度。
dtype=torch.int64, # 數據類型。
preprocessing=None, # 在token后,轉換為Numerical之前的處理管道。
postprocessing=None, # 轉換為Numerical之后的處理管道。
lower=False, # 是否小寫轉換。
tokenize=None, # 用來把文本轉成序列文本單詞的函數,缺省的使用string.split函數,如果指定Spacy,就使用Spacy分詞工具。
tokenizer_language='en', # 指定文本語言,在指定除en意外語言,tokenize必須使用Spacy。
include_lengths=False, # 是否只返回補丁長度,還是返回不定長度與數據長度。返回兩個長度使用元組類型。
batch_first=False, # 是否把批次大小放第一個維度(這是因為LSTM等網絡模塊對格式的要求)。
pad_token='<pad>', # 用來做補丁對齊處理的token函數。
unk_token='<unk>', # 出現OOV的處理函數。OOV(Out-of-vocabulary)就是出現不在詞袋內的單詞的處理函數。
pad_first=False, # 補丁對齊的兩種情況,在前補丁(True),在后補丁(False)
truncate_first=False, # 文本超過長度的截斷方式:丟棄前面(True)與后面(False)
stop_words=None, # 預處理步驟中需要丟棄的單詞(停用詞)。
is_target=False) # 是否是標簽字段。
構建Feild字段的例子
- 根據上面的數據源來構建三個字段:
- 索引(無字段名)
- 標簽(label)
- 特征(text)
- 構建默認對象
- 因為索引不是我們需要的數據列,所以該字段不用處理。
from torchtext.data import Field
fld_label = Field()
fld_text = Field()
- 設置基本屬性
# 標簽字段比較簡答
fld_label.sequential = False # 這個屬性默認True
fld_label.use_vocab = False # 這個屬性默認True
# 特征字段
fld_text .sequential = True # 這個屬性默認True
fld_text .use_vocab = True # 這個屬性默認True
# 因為sequential為True,則必須指定分詞屬性token
- 設置token屬性,指定分詞函數
- 該函數的要求:
- 參數:傳入一個樣本的特征(就是text字段)
- 返回:返回一個列表,就是分詞以后的結果,這樣字段的數據就不是字符串,而是單詞列表。
- 該函數的要求:
import re
import jieba
regex = re.compile(r'[^\u4e00-\u9fa5aA-Za-z0-9]')
def word_cut(text):
text = regex.sub(' ', text)
return [word for word in jieba.cut(text) if word.strip()]
fld_text .tokenize = word_cut
構建數據集
Dataset的幫助文檔
from torchtext.data import Dataset
Dataset?
?[1;31mInit signature:?[0m ?[0mDataset?[0m?[1;33m(?[0m?[0mexamples?[0m?[1;33m,?[0m ?[0mfields?[0m?[1;33m,?[0m ?[0mfilter_pred?[0m?[1;33m=?[0m?[1;32mNone?[0m?[1;33m)?[0m?[1;33m?[0m?[0m
?[1;31mDocstring:?[0m
Defines a dataset composed of Examples along with its Fields.
Attributes:
sort_key (callable): A key to use for sorting dataset examples for batching
together examples with similar lengths to minimize padding.
examples (list(Example)): The examples in this dataset.
fields (dict[str, Field]): Contains the name of each column or field, together
with the corresponding Field object. Two fields with the same Field object
will have a shared vocabulary.
?[1;31mInit docstring:?[0m
Create a dataset from a list of Examples and Fields.
Arguments:
examples: List of Examples.
fields (List(tuple(str, Field))): The Fields to use in this tuple. The
string is a field name, and the Field is the associated field.
filter_pred (callable or None): Use only examples for which
filter_pred(example) is True, or use all examples if None.
Default is None.
?[1;31mFile:?[0m c:\program files\python36\lib\site-packages\torchtext\data\dataset.py
?[1;31mType:?[0m type
?[1;31mSubclasses:?[0m TabularDataset, LanguageModelingDataset, SST, TranslationDataset, SequenceTaggingDataset, TREC, IMDB, BABI20
Dataset屬性說明
- 構造器說明:
Dataset(examples, fields, filter_pred=None)
# examples:數據列表,類型是Examples列表。
# fields:字段列表,類型是tuple(str, Field))列表。
# filter_pred:過濾數據集的條件,類型是可調用對象或者函數,樣本是否使用,根據函數的返回值確定。True就使用。若為None,樣本全部使用。
- 屬性說明:
- sort_key :類型是callable:
- examples:類型list(Example)
- fields:類型dict[str, Field]
構建數據集的字段
Dataset需要的字段是列表類型:fields (List(tuple(str, Field)))
下面例子是完整的字段的構建例子
from torchtext.data import Field
import re
regex = re.compile(r'[^\u4e00-\u9fa5aA-Za-z0-9]')
# ----------------------------------------------------------------------
# 1. 數據集需要的Fields定義:fields (List(tuple(str, Field)))
fld_label = Field()
fld_text = Field()
# 標簽字段比較簡答
fld_label.sequential = False # 這個屬性默認True
fld_label.use_vocab = False # 這個屬性默認True
# 特征字段
fld_text .sequential = True # 這個屬性默認True
fld_text .use_vocab = True # 這個屬性默認True
# 因為sequential為True,則必須指定分詞屬性token
def word_cut(text):
text = regex.sub(' ', text)
return [word for word in jieba.cut(text) if word.strip()]
fld_text .tokenize = word_cut
# 構建Dataset需要的fields
fields = [("text", fld_text),("label", fld_label)] # 兩個字段
fields
[('text', <torchtext.data.field.Field at 0x2a9d2336a20>),
('label', <torchtext.data.field.Field at 0x2a9d2336780>)]
Example幫助文檔
from torchtext.data import Example
help(Example)
Help on class Example in module torchtext.data.example:
class Example(builtins.object)
| Defines a single training or test example.
|
| Stores each column of the example as an attribute.
|
| Class methods defined here:
|
| fromCSV(data, fields, field_to_index=None) from builtins.type
|
| fromJSON(data, fields) from builtins.type
|
| fromdict(data, fields) from builtins.type
|
| fromlist(data, fields) from builtins.type
|
| fromtree(data, fields, subtrees=False) from builtins.type
|
| ----------------------------------------------------------------------
| Data descriptors defined here:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
構建Example對象
- Example提供一組類函數實現Example對象構建,所謂的工廠模式就是這個了。
- 參數需要數據與字段描述。
- 數據與字段的長度應該是對應的。
from torchtext.data import Field
from torchtext.data import Example
# 這里使用上面構建的fields,上面的fields是否正確,這個就可以檢測
one_example = Example.fromlist(["我是數據,很長的數據", 1], fields) # 1是標簽
one_example
<torchtext.data.example.Example at 0x2a9d22e2ba8>
構建Example列表
- Example列表的構建需要數據源的數據 。
- 可以使用[..., ..., ...]構建,下面數據多,我們使用循環構建,
- 數據是csv格式,csv得分隔符可以體現在擴展名上。
- csv: Comma-Separated Values,
- tsv: Tab-Separated Values
import pandas as pd
from torchtext.data import Field
from torchtext.data import Example
# ----------------------------------------------------------------------
# 2. 數據集需要的exampls列表(list(Example)):
# 使用pandas讀取csv文件,其他方式也可以。比如csv庫。
data = pd.read_csv("datasets/train.tsv", sep='\t') # csv: Comma-Separated Values,tsv: Tab-Separated Values
examples = []
for txt, lab in zip(data["text"], data["label"]):
one_example = Example.fromlist([txt, lab], fields)
examples.append(one_example)
examples[0:5] # 顯示5個
[<torchtext.data.example.Example at 0x2a9d233c4a8>,
<torchtext.data.example.Example at 0x2a9b5fee9e8>,
<torchtext.data.example.Example at 0x2a9d8fae9e8>,
<torchtext.data.example.Example at 0x2a9d8faea90>,
<torchtext.data.example.Example at 0x2a9d8fae9b0>]
構建數據集
- 使用Dataset構造器構建數據集
Dataset(examples, fields, filter_pred=None)
from torchtext.data import Dataset
# 這個數據集與torch.utils.data的Dataset是有差異的。 torch.utils.data的DataLoader要求數據是整齊的,就是每個記錄長度一樣。
dataset = Dataset(examples, fields)
dataset
<torchtext.data.dataset.Dataset at 0x2a9b7f1a780>
深入理解數據集
- Dataset應該提供函數操作數據。下面通過幫助文檔了解。
- 尤其提供數據的變量訪問:
__getitem__(self, i)
__len__(self)
__iter__(self)
-
split(self, split_ratio=0.7, stratified=False, strata_field='label', random_state=None)
- 數據集切分:訓練集 + 測試機
-
filter_examples(self, field_names)
- 根據字段過濾數據字段。
- 尤其提供數據的變量訪問:
help(dataset)
Help on Dataset in module torchtext.data.dataset object:
class Dataset(torch.utils.data.dataset.Dataset)
| Defines a dataset composed of Examples along with its Fields.
|
| Attributes:
| sort_key (callable): A key to use for sorting dataset examples for batching
| together examples with similar lengths to minimize padding.
| examples (list(Example)): The examples in this dataset.
| fields (dict[str, Field]): Contains the name of each column or field, together
| with the corresponding Field object. Two fields with the same Field object
| will have a shared vocabulary.
|
| Method resolution order:
| Dataset
| torch.utils.data.dataset.Dataset
| builtins.object
|
| Methods defined here:
|
| __getattr__(self, attr)
|
| __getitem__(self, i)
|
| __init__(self, examples, fields, filter_pred=None)
| Create a dataset from a list of Examples and Fields.
|
| Arguments:
| examples: List of Examples.
| fields (List(tuple(str, Field))): The Fields to use in this tuple. The
| string is a field name, and the Field is the associated field.
| filter_pred (callable or None): Use only examples for which
| filter_pred(example) is True, or use all examples if None.
| Default is None.
|
| __iter__(self)
|
| __len__(self)
|
| filter_examples(self, field_names)
| Remove unknown words from dataset examples with respect to given field.
|
| Arguments:
| field_names (list(str)): Within example only the parts with field names in
| field_names will have their unknown words deleted.
|
| split(self, split_ratio=0.7, stratified=False, strata_field='label', random_state=None)
| Create train-test(-valid?) splits from the instance's examples.
|
| Arguments:
| split_ratio (float or List of floats): a number [0, 1] denoting the amount
| of data to be used for the training split (rest is used for test),
| or a list of numbers denoting the relative sizes of train, test and valid
| splits respectively. If the relative size for valid is missing, only the
| train-test split is returned. Default is 0.7 (for the train set).
| stratified (bool): whether the sampling should be stratified.
| Default is False.
| strata_field (str): name of the examples Field stratified over.
| Default is 'label' for the conventional label field.
| random_state (tuple): the random seed used for shuffling.
| A return value of `random.getstate()`.
|
| Returns:
| Tuple[Dataset]: Datasets for train, validation, and
| test splits in that order, if the splits are provided.
|
| ----------------------------------------------------------------------
| Class methods defined here:
|
| download(root, check=None) from builtins.type
| Download and unzip an online archive (.zip, .gz, or .tgz).
|
| Arguments:
| root (str): Folder to download data to.
| check (str or None): Folder whose existence indicates
| that the dataset has already been downloaded, or
| None to check the existence of root/{cls.name}.
|
| Returns:
| str: Path to extracted dataset.
|
| splits(path=None, root='.data', train=None, validation=None, test=None, **kwargs) from builtins.type
| Create Dataset objects for multiple splits of a dataset.
|
| Arguments:
| path (str): Common prefix of the splits' file paths, or None to use
| the result of cls.download(root).
| root (str): Root dataset storage directory. Default is '.data'.
| train (str): Suffix to add to path for the train set, or None for no
| train set. Default is None.
| validation (str): Suffix to add to path for the validation set, or None
| for no validation set. Default is None.
| test (str): Suffix to add to path for the test set, or None for no test
| set. Default is None.
| Remaining keyword arguments: Passed to the constructor of the
| Dataset (sub)class being used.
|
| Returns:
| Tuple[Dataset]: Datasets for train, validation, and
| test splits in that order, if provided.
|
| ----------------------------------------------------------------------
| Data and other attributes defined here:
|
| sort_key = None
|
| ----------------------------------------------------------------------
| Methods inherited from torch.utils.data.dataset.Dataset:
|
| __add__(self, other)
|
| ----------------------------------------------------------------------
| Data descriptors inherited from torch.utils.data.dataset.Dataset:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
- 數據集遍歷方式1
# 數據集訪問與遍歷
for i in range(5): #len(dataset)
print(dataset[i])
<torchtext.data.example.Example object at 0x000002A9D233C4A8>
<torchtext.data.example.Example object at 0x000002A9B5FEE9E8>
<torchtext.data.example.Example object at 0x000002A9D8FAE9E8>
<torchtext.data.example.Example object at 0x000002A9D8FAEA90>
<torchtext.data.example.Example object at 0x000002A9D8FAE9B0>
- 數據集遍歷方式2
# 數據集訪問與遍歷
for one_ex in dataset:
print(one_ex)
break
<torchtext.data.example.Example object at 0x000002A9D233C4A8>
構建批次數據
- 數據集的數據使用迭代器來完成訪問。從上面例子應該知道,從數據集無法訪問到具體的數據值,沒有提供訪問的標準接口。
Iterator的幫助文檔
- 使用Iterator類也是兩種方式:
- 構造器:
__init__(self, dataset, batch_size, sort_key=None, device=None, batch_size_fn=None, train=True, repeat=False, shuffle=None, sort=None, sort_within_batch=None)
- 使用類函數
splits(datasets, batch_sizes=None, **kwargs)
- 構造器:
- Iterator提供了數據遍歷方式,只是遍歷的是批次。
__iter__(self)
__len__(self)
- 直接返回數據:
data(self)
from torchtext.data import Iterator
help(Iterator)
Help on class Iterator in module torchtext.data.iterator:
class Iterator(builtins.object)
| Defines an iterator that loads batches of data from a Dataset.
|
| Attributes:
| dataset: The Dataset object to load Examples from.
| batch_size: Batch size.
| batch_size_fn: Function of three arguments (new example to add, current
| count of examples in the batch, and current effective batch size)
| that returns the new effective batch size resulting from adding
| that example to a batch. This is useful for dynamic batching, where
| this function would add to the current effective batch size the
| number of tokens in the new example.
| sort_key: A key to use for sorting examples in order to batch together
| examples with similar lengths and minimize padding. The sort_key
| provided to the Iterator constructor overrides the sort_key
| attribute of the Dataset, or defers to it if None.
| train: Whether the iterator represents a train set.
| repeat: Whether to repeat the iterator for multiple epochs. Default: False.
| shuffle: Whether to shuffle examples between epochs.
| sort: Whether to sort examples according to self.sort_key.
| Note that shuffle and sort default to train and (not train).
| sort_within_batch: Whether to sort (in descending order according to
| self.sort_key) within each batch. If None, defaults to self.sort.
| If self.sort is True and this is False, the batch is left in the
| original (ascending) sorted order.
| device (str or `torch.device`): A string or instance of `torch.device`
| specifying which device the Variables are going to be created on.
| If left as default, the tensors will be created on cpu. Default: None.
|
| Methods defined here:
|
| __init__(self, dataset, batch_size, sort_key=None, device=None, batch_size_fn=None, train=True, repeat=False, shuffle=None, sort=None, sort_within_batch=None)
| Initialize self. See help(type(self)) for accurate signature.
|
| __iter__(self)
|
| __len__(self)
|
| create_batches(self)
|
| data(self)
| Return the examples in the dataset in order, sorted, or shuffled.
|
| init_epoch(self)
| Set up the batch generator for a new epoch.
|
| load_state_dict(self, state_dict)
|
| state_dict(self)
|
| ----------------------------------------------------------------------
| Class methods defined here:
|
| splits(datasets, batch_sizes=None, **kwargs) from builtins.type
| Create Iterator objects for multiple splits of a dataset.
|
| Arguments:
| datasets: Tuple of Dataset objects corresponding to the splits. The
| first such object should be the train set.
| batch_sizes: Tuple of batch sizes to use for the different splits,
| or None to use the same batch_size for all splits.
| Remaining keyword arguments: Passed to the constructor of the
| iterator class being used.
|
| ----------------------------------------------------------------------
| Data descriptors defined here:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
|
| epoch
使用splits函數構建Iterator對象
- splits函數的核心參數是datasets與batch_size
- datasets:需要list類型;
- batch_size:類型與datasets匹配;
from torchtext.data import Iterator
print(len(dataset))
it_dataset, = Iterator.splits((dataset, ), batch_sizes=(100, ))
it_dataset, len(it_dataset)
6300
(<torchtext.data.iterator.Iterator at 0x2a9d9c0b0b8>, 63)
詞向量與構建詞表
構建的Iterator還不能直接工作,因為Iterator的工作需要詞表,通過詞表才能把文本轉化為數值(原理是TF詞頻)
-
構建此表兩種方式
- 使用與訓練的詞向量,使用vectors參數指定
- 使用默認的詞向量,設置vectors = None
預訓練的詞向量
- 這里我們只關心中文,英文可以使用spacy與sacremoses
- 下載地址:
https://github.com/Embedding/Chinese-Word-Vectors
- 下載地址:
預先訓練的詞向量
-
下載的詞向量文件
-
700+Mb,比較刺激的文件。
詞向量訓練文件
-
加載詞向量文件
from torchtext.vocab import Vectors
# 會有一個加載過程。
vectors = Vectors(name="sgns.zhihu.word", cache="datasets")
vectors
0%|
| 0/259922 [00:00<?, ?it/s]Skipping token b'259922' with 1-dimensional vector [b'300']; likely a header
100%|██████████████████████████████████████████████████| 259922/259922 [00:30<00:00, 8568.81it/s]
<torchtext.vocab.Vectors at 0x2a9d9b11ac8>
使用詞向量構建詞表
# 文本使用預先訓練的詞向量
fld_text.build_vocab(dataset, vectors = vectors) # 見上面的詞向量
# 標簽是整數,不用詞向量。
fld_label.build_vocab(dataset)
使用數據集
遍歷
- 現在可以使用it_dataset迭代數據集了Iterator。
__iter__(self)
__len__(self)
- 注意:沒有
__item__
函數。智能迭代。
for item in it_dataset:
print(item)
[torchtext.data.batch.Batch of size 100]
[.text]:[torch.LongTensor of size 54x100]
[.label]:[torch.LongTensor of size 100]
[torchtext.data.batch.Batch of size 100]
[.text]:[torch.LongTensor of size 48x100]
[.label]:[torch.LongTensor of size 100]
.......
[torchtext.data.batch.Batch of size 100]
[.text]:[torch.LongTensor of size 54x100]
[.label]:[torch.LongTensor of size 100]
[torchtext.data.batch.Batch of size 100]
[.text]:[torch.LongTensor of size 53x100]
[.label]:[torch.LongTensor of size 100]
[torchtext.data.batch.Batch of size 100]
[.text]:[torch.LongTensor of size 48x100]
[.label]:[torch.LongTensor of size 100]
取數據
- 取文本
for item in it_dataset:
print(item.text) # item.label
tensor([[ 284, 2568, 115, ..., 66, 62, 14],
[1041, 2, 990, ..., 848, 92, 158],
[ 445, 369, 17, ..., 19, 585, 1103],
...,
[ 1, 1, 1, ..., 1, 1, 1],
[ 1, 1, 1, ..., 1, 1, 1],
[ 1, 1, 1, ..., 1, 1, 1]])
......
tensor([[ 96, 548, 197, ..., 45, 12, 47],
[ 635, 1167, 62, ..., 1036, 1306, 10],
[9668, 14, 14, ..., 357, 1329, 36],
...,
[ 1, 1, 1, ..., 1, 1, 1],
[ 1, 1, 1, ..., 1, 1, 1],
[ 1, 1, 1, ..., 1, 1, 1]])
- 取標簽
for item in it_dataset:
print(item.label)
tensor([0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1,
0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0,
1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0,
0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1,
1, 0, 1, 0])
......
tensor([1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1,
0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0,
1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,
0, 1, 0, 1])
文本分類中的TorchText應用
數據集處理
函數封裝
import pandas as pd
from torchtext.data import Field
from torchtext.data import Example
from torchtext.data import Dataset
from torchtext.data import Iterator
from torchtext.vocab import Vectors
import re
import jieba
regex = re.compile(r'[^\u4e00-\u9fa5aA-Za-z0-9]')
fld_label = Field()
fld_text = Field()
# 標簽字段比較簡答
fld_label.sequential = False # 這個屬性默認True
fld_label.use_vocab = False # 這個屬性默認True
# 特征字段
fld_text.sequential = True # 這個屬性默認True
fld_text.use_vocab = True # 這個屬性默認True
fld_text.batch_first=True
# 因為sequential為True,則必須指定分詞屬性token
def word_cut(text):
text = regex.sub(' ', text)
return [word for word in jieba.cut(text) if word.strip()]
fld_text .tokenize = word_cut
# 構建Dataset需要的fields
fields = [("text", fld_text),("label", fld_label)] # 兩個字段
def load_data(data_file):
data = pd.read_csv(data_file, sep='\t') # csv: Comma-Separated Values,tsv: Tab-Separated Values
examples = []
for txt, lab in zip(data["text"], data["label"]):
one_example = Example.fromlist([txt, lab], fields)
examples.append(one_example)
dataset = Dataset(examples, fields)
it_dataset, = Iterator.splits((dataset, ), batch_sizes=(1000, )) # 每個批次過大,GPU容易溢出
vectors = Vectors(name="sgns.zhihu.word", cache="datasets")
fld_text.build_vocab(dataset, vectors = vectors) # 見上面的詞向量
# 標簽是整數,不用詞向量。
fld_label.build_vocab(dataset)
return it_dataset
加載訓練集與測試集
- 數據集文件說明:
- 訓練集:train.tsv
- 驗證集:valid.tsv
it_train = load_data("datasets/train.tsv")
it_valid = load_data("datasets/valid.tsv")
it_train, it_train
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\gaoke\AppData\Local\Temp\jieba.cache
Loading model cost 0.570 seconds.
Prefix dict has been built successfully.
(<torchtext.data.iterator.Iterator at 0x1c9406f3588>,
<torchtext.data.iterator.Iterator at 0x1c9406f3588>)
模型
- 模型就使用LSTM
import torch
import torch.nn as nn
class RNN(nn.Module):
def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim,
n_layers=2, bidirectional=True, dropout=0.2, pad_idx=0):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)
self.rnn = nn.LSTM(embedding_dim, hidden_dim, num_layers=n_layers, batch_first=True, bidirectional=bidirectional)
self.fc = nn.Linear(hidden_dim * 2, output_dim)
self.dropout = nn.Dropout(dropout)
def forward(self, text):
embedded = self.dropout(self.embedding(text))
output, (hidden, cell) = self.rnn(embedded)
hidden = self.dropout(torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1))
return self.fc(hidden.squeeze(0))
訓練
訓練的核心函數
- 參數:
- 訓練集
- 驗證集
- 模型
import torch.nn.functional as F
def train(train_iter, valid_iter, model):
# 訓練超參數
EPOCHES = 10
CUDA = torch.cuda.is_available() # GPU內存不夠
# CUDA = False
if CUDA:
model.cuda()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
for epoch in range(1, EPOCHES):
for batch in train_iter: # 訓練集
feature, target = batch.text, batch.label
if CUDA:
feature, target = feature.cuda(), target.cuda()
optimizer.zero_grad()
logits = model(feature)
loss = F.cross_entropy(logits, target)
loss.backward()
optimizer.step()
# 測試預測準確率
corrects = 0.0
with torch.no_grad():
# sample_num樣本數量
sample_num = 0
for item in valid_iter:
feature, target = item.text, item.label
if CUDA:
feature, target = feature.cuda(), target.cuda()
logits = model(feature)
corrects += (torch.max(logits, 1)[1].view(target.size()).data == target.data).sum()
sample_num += len(feature)
print(F"輪數:{epoch:03d},\t準確率:{corrects/sample_num}")
準備訓練的條件
- 條件包含:
- 構建網絡需要的參數
- 需要詞向量化過程中詞表等變量
- 數據集(已經準備好)
- 構建網絡需要的參數
# 參數
vocabulary_size = len(fld_text.vocab)
embedding_dim = fld_text.vocab.vectors.size()[-1]
class_num = len(fld_label.vocab)
hidden_dim = 128
print(vocabulary_size, embedding_dim, hidden_dim, class_num)
# 構建網絡模型
net = RNN(vocabulary_size, embedding_dim, hidden_dim, class_num)
11361 300 128 4
訓練并驗證
print("開始訓練....")
train(it_train, it_valid, net)
# 保存模型
torch.save(net.state_dict(), "rnn.model")
開始訓練....
輪數:001, 準確率:0.9114285707473755
輪數:002, 準確率:0.9372857213020325
輪數:003, 準確率:0.9451428651809692
輪數:004, 準確率:0.9494285583496094
輪數:005, 準確率:0.9472857117652893
輪數:006, 準確率:0.9490000009536743
輪數:007, 準確率:0.951714277267456
輪數:008, 準確率:0.953000009059906
輪數:009, 準確率:0.9485714435577393
附錄:
- 預測的實現代碼就很簡單了,這里就不列出了。