TORCH04-01TorchText之文本數(shù)據(jù)集處理

??原來一致湊合著使用Torch中torch.util.data下的數(shù)據(jù)集工具做數(shù)據(jù)處理,但是其中的DataLoader要求樣本的長度是對(duì)齊的,而且對(duì)不同的數(shù)據(jù)源需要做細(xì)節(jié)處理。
??Torch提供了torchtext.data模塊用來實(shí)現(xiàn)文本的處理,并并結(jié)合中文分詞工具,基本上可以滿足日常的文本處理了。
??這個(gè)主題就是介紹torchtext并入門,主要介紹Field,Example,Dataset,Vectors的使用,并使用LSTM網(wǎng)絡(luò)做了一個(gè)文本分類的例子。實(shí)際torchtext還是很彪悍的工具模塊。


tortext 模塊結(jié)構(gòu)

  • torchtext模塊包含文本數(shù)據(jù)處理與文本數(shù)據(jù)集

    1. 文本數(shù)據(jù)處理
      1. torchtext.data
      2. torchtext.data.utils
      3. torchtext.data.functional
      4. torchtext.data.metrics
      5. torchtext.vocab
      6. torchtext.utils
    2. 文本數(shù)據(jù)集
      1. torchtext.datasets
      2. torchtext.experimental.datasets
      3. examples
  • 注意:

    • 這里先從torchtext.data開始使用TorchText。

torchtext.data結(jié)構(gòu)

  • torchtext.data

    1. Dataset, Batch, and Example
    2. Fields
    3. Iterators
    4. Pipeline
    5. Functions
  • 文本處理的核心模式:

    1. Dataset指定文本數(shù)據(jù)源;
    2. Field指定處理字段;
    3. Iterator遍歷數(shù)據(jù)集;
  • 下面是TorchText的使用模式示意圖

    • 參考鏈接:http://mlexplained.com/2018/02/08/a-comprehensive-tutorial-to-torchtext/
TorchText使用模式示意圖

TorchText使用例子

  • 下面我們從一個(gè)例子來說明TorchText的使用模式。
    1. 環(huán)境安裝
    2. 數(shù)據(jù)源
    3. 定義字段Field
    4. 構(gòu)建數(shù)據(jù)集
    5. 構(gòu)建批次數(shù)據(jù)
    6. 詞向量與構(gòu)建詞表
    7. 使用數(shù)據(jù)集

環(huán)境安裝

  1. 安裝torchtext
    • pip install torchtext
  • 注意:
    • 因?yàn)閎ug的緣故,建議采用直接在github安裝修正版:
      • pip install https://github.com/pytorch/text/archive/master.zip
安裝torchtext
  1. 可選安裝1 - 分詞工具
    • pip install spacy
    • python -m spacy download en
  • spacy官網(wǎng):
    • https://spacy.io/models/
Spacy分詞工具
  • 注意:
    • 安裝訓(xùn)練庫會(huì)因?yàn)榫W(wǎng)絡(luò)緣故,無法下載,可以使用如下安裝:
      • pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz
      • 也可以直接下載,并安裝(這兒采用的。)。
安裝庫
  1. 可選安裝2 - 分詞工具
    • pip install sacremoses
安裝sacremoses
  1. 安裝 -分詞工具
    • 結(jié)巴分詞
    • pip install jieba

數(shù)據(jù)源

  • 下載地址

    • https://github.com/bigboNed3/chinese_text_cnn
  • 下載的文件:

    • 訓(xùn)練集:train.tsv
    • 測試集:test.tsv
    • 驗(yàn)證集:dev.tsv
數(shù)據(jù)源文件
  • 注意:

    • 文件也可以使用其他方式存儲(chǔ),比如text文件,json文件等。
  • 數(shù)據(jù)格式:

    • 序號(hào)(多余的字段)
    • label
    • text
數(shù)據(jù)格式

定義字段Field

Field類幫助文檔

  • 構(gòu)造Field對(duì)象的參數(shù)設(shè)置有兩種方式:
    1. 在構(gòu)造器中設(shè)置
    2. 使用屬性設(shè)置(我們后面采用屬性設(shè)置)
from torchtext.data import Field

Field?
?[1;31mInit signature:?[0m
?[0mField?[0m?[1;33m(?[0m?[1;33m
?[0m    ?[0msequential?[0m?[1;33m=?[0m?[1;32mTrue?[0m?[1;33m,?[0m?[1;33m
?[0m    ?[0muse_vocab?[0m?[1;33m=?[0m?[1;32mTrue?[0m?[1;33m,?[0m?[1;33m
?[0m    ?[0minit_token?[0m?[1;33m=?[0m?[1;32mNone?[0m?[1;33m,?[0m?[1;33m
?[0m    ?[0meos_token?[0m?[1;33m=?[0m?[1;32mNone?[0m?[1;33m,?[0m?[1;33m
?[0m    ?[0mfix_length?[0m?[1;33m=?[0m?[1;32mNone?[0m?[1;33m,?[0m?[1;33m
?[0m    ?[0mdtype?[0m?[1;33m=?[0m?[0mtorch?[0m?[1;33m.?[0m?[0mint64?[0m?[1;33m,?[0m?[1;33m
?[0m    ?[0mpreprocessing?[0m?[1;33m=?[0m?[1;32mNone?[0m?[1;33m,?[0m?[1;33m
?[0m    ?[0mpostprocessing?[0m?[1;33m=?[0m?[1;32mNone?[0m?[1;33m,?[0m?[1;33m
?[0m    ?[0mlower?[0m?[1;33m=?[0m?[1;32mFalse?[0m?[1;33m,?[0m?[1;33m
?[0m    ?[0mtokenize?[0m?[1;33m=?[0m?[1;32mNone?[0m?[1;33m,?[0m?[1;33m
?[0m    ?[0mtokenizer_language?[0m?[1;33m=?[0m?[1;34m'en'?[0m?[1;33m,?[0m?[1;33m
?[0m    ?[0minclude_lengths?[0m?[1;33m=?[0m?[1;32mFalse?[0m?[1;33m,?[0m?[1;33m
?[0m    ?[0mbatch_first?[0m?[1;33m=?[0m?[1;32mFalse?[0m?[1;33m,?[0m?[1;33m
?[0m    ?[0mpad_token?[0m?[1;33m=?[0m?[1;34m'<pad>'?[0m?[1;33m,?[0m?[1;33m
?[0m    ?[0munk_token?[0m?[1;33m=?[0m?[1;34m'<unk>'?[0m?[1;33m,?[0m?[1;33m
?[0m    ?[0mpad_first?[0m?[1;33m=?[0m?[1;32mFalse?[0m?[1;33m,?[0m?[1;33m
?[0m    ?[0mtruncate_first?[0m?[1;33m=?[0m?[1;32mFalse?[0m?[1;33m,?[0m?[1;33m
?[0m    ?[0mstop_words?[0m?[1;33m=?[0m?[1;32mNone?[0m?[1;33m,?[0m?[1;33m
?[0m    ?[0mis_target?[0m?[1;33m=?[0m?[1;32mFalse?[0m?[1;33m,?[0m?[1;33m
?[0m?[1;33m)?[0m?[1;33m?[0m?[0m
?[1;31mDocstring:?[0m     
Defines a datatype together with instructions for converting to Tensor.

Field class models common text processing datatypes that can be represented
by tensors.  It holds a Vocab object that defines the set of possible values
for elements of the field and their corresponding numerical representations.
The Field object also holds other parameters relating to how a datatype
should be numericalized, such as a tokenization method and the kind of
Tensor that should be produced.

If a Field is shared between two columns in a dataset (e.g., question and
answer in a QA dataset), then they will have a shared vocabulary.

Attributes:
    sequential: Whether the datatype represents sequential data. If False,
        no tokenization is applied. Default: True.
    use_vocab: Whether to use a Vocab object. If False, the data in this
        field should already be numerical. Default: True.
    init_token: A token that will be prepended to every example using this
        field, or None for no initial token. Default: None.
    eos_token: A token that will be appended to every example using this
        field, or None for no end-of-sentence token. Default: None.
    fix_length: A fixed length that all examples using this field will be
        padded to, or None for flexible sequence lengths. Default: None.
    dtype: The torch.dtype class that represents a batch of examples
        of this kind of data. Default: torch.long.
    preprocessing: The Pipeline that will be applied to examples
        using this field after tokenizing but before numericalizing. Many
        Datasets replace this attribute with a custom preprocessor.
        Default: None.
    postprocessing: A Pipeline that will be applied to examples using
        this field after numericalizing but before the numbers are turned
        into a Tensor. The pipeline function takes the batch as a list, and
        the field's Vocab.
        Default: None.
    lower: Whether to lowercase the text in this field. Default: False.
    tokenize: The function used to tokenize strings using this field into
        sequential examples. If "spacy", the SpaCy tokenizer is
        used. If a non-serializable function is passed as an argument,
        the field will not be able to be serialized. Default: string.split.
    tokenizer_language: The language of the tokenizer to be constructed.
        Various languages currently supported only in SpaCy.
    include_lengths: Whether to return a tuple of a padded minibatch and
        a list containing the lengths of each examples, or just a padded
        minibatch. Default: False.
    batch_first: Whether to produce tensors with the batch dimension first.
        Default: False.
    pad_token: The string token used as padding. Default: "<pad>".
    unk_token: The string token used to represent OOV words. Default: "<unk>".
    pad_first: Do the padding of the sequence at the beginning. Default: False.
    truncate_first: Do the truncating of the sequence at the beginning. Default: False
    stop_words: Tokens to discard during the preprocessing step. Default: None
    is_target: Whether this field is a target variable.
        Affects iteration over batches. Default: False
?[1;31mFile:?[0m           c:\program files\python36\lib\site-packages\torchtext\data\field.py
?[1;31mType:?[0m           type
?[1;31mSubclasses:?[0m     ReversibleField, NestedField, LabelField, ShiftReduceField, ParsedTextField, BABI20Field

Field類說明

CLASS torchtext.data.Field(
    sequential=True,         # 是否序列數(shù)據(jù), 默認(rèn)值True,如果為False,則不需要init_token參數(shù)。
    use_vocab=True,          # 是否使用詞袋對(duì)象,默認(rèn)True,如果指定False,則不需要處理這個(gè)字典,表示這個(gè)字典默認(rèn)是Numerical。 
    init_token=None,         # 加載每個(gè)字段前的處理函數(shù)。
    eos_token=None,          # 加載完每個(gè)字段后的處理函數(shù)。
    fix_length=None,         # 指定字段的文本長度。
    dtype=torch.int64,       # 數(shù)據(jù)類型。
    preprocessing=None,      # 在token后,轉(zhuǎn)換為Numerical之前的處理管道。
    postprocessing=None,     # 轉(zhuǎn)換為Numerical之后的處理管道。
    lower=False,             # 是否小寫轉(zhuǎn)換。
    tokenize=None,           # 用來把文本轉(zhuǎn)成序列文本單詞的函數(shù),缺省的使用string.split函數(shù),如果指定Spacy,就使用Spacy分詞工具。
    tokenizer_language='en', # 指定文本語言,在指定除en意外語言,tokenize必須使用Spacy。
    include_lengths=False,   # 是否只返回補(bǔ)丁長度,還是返回不定長度與數(shù)據(jù)長度。返回兩個(gè)長度使用元組類型。
    batch_first=False,       # 是否把批次大小放第一個(gè)維度(這是因?yàn)長STM等網(wǎng)絡(luò)模塊對(duì)格式的要求)。
    pad_token='<pad>',       # 用來做補(bǔ)丁對(duì)齊處理的token函數(shù)。
    unk_token='<unk>',       # 出現(xiàn)OOV的處理函數(shù)。OOV(Out-of-vocabulary)就是出現(xiàn)不在詞袋內(nèi)的單詞的處理函數(shù)。
    pad_first=False,         # 補(bǔ)丁對(duì)齊的兩種情況,在前補(bǔ)丁(True),在后補(bǔ)丁(False)
    truncate_first=False,    # 文本超過長度的截?cái)喾绞剑簛G棄前面(True)與后面(False)
    stop_words=None,         # 預(yù)處理步驟中需要丟棄的單詞(停用詞)。
    is_target=False)         # 是否是標(biāo)簽字段。

構(gòu)建Feild字段的例子

  • 根據(jù)上面的數(shù)據(jù)源來構(gòu)建三個(gè)字段:
    1. 索引(無字段名)
    2. 標(biāo)簽(label)
    3. 特征(text)
  1. 構(gòu)建默認(rèn)對(duì)象
    • 因?yàn)樗饕皇俏覀冃枰臄?shù)據(jù)列,所以該字段不用處理。
from torchtext.data import Field
fld_label = Field()
fld_text = Field()
  1. 設(shè)置基本屬性
# 標(biāo)簽字段比較簡答
fld_label.sequential = False     # 這個(gè)屬性默認(rèn)True
fld_label.use_vocab = False      # 這個(gè)屬性默認(rèn)True

# 特征字段
fld_text .sequential = True     # 這個(gè)屬性默認(rèn)True
fld_text .use_vocab = True      # 這個(gè)屬性默認(rèn)True
# 因?yàn)閟equential為True,則必須指定分詞屬性token

  1. 設(shè)置token屬性,指定分詞函數(shù)
    • 該函數(shù)的要求:
      1. 參數(shù):傳入一個(gè)樣本的特征(就是text字段)
      2. 返回:返回一個(gè)列表,就是分詞以后的結(jié)果,這樣字段的數(shù)據(jù)就不是字符串,而是單詞列表。
import re
import jieba
regex = re.compile(r'[^\u4e00-\u9fa5aA-Za-z0-9]')
def word_cut(text):
    text = regex.sub(' ', text)
    return [word for word in jieba.cut(text) if word.strip()]

fld_text .tokenize = word_cut

構(gòu)建數(shù)據(jù)集

Dataset的幫助文檔

from torchtext.data import Dataset
Dataset?
?[1;31mInit signature:?[0m ?[0mDataset?[0m?[1;33m(?[0m?[0mexamples?[0m?[1;33m,?[0m ?[0mfields?[0m?[1;33m,?[0m ?[0mfilter_pred?[0m?[1;33m=?[0m?[1;32mNone?[0m?[1;33m)?[0m?[1;33m?[0m?[0m
?[1;31mDocstring:?[0m     
Defines a dataset composed of Examples along with its Fields.

Attributes:
    sort_key (callable): A key to use for sorting dataset examples for batching
        together examples with similar lengths to minimize padding.
    examples (list(Example)): The examples in this dataset.
    fields (dict[str, Field]): Contains the name of each column or field, together
        with the corresponding Field object. Two fields with the same Field object
        will have a shared vocabulary.
?[1;31mInit docstring:?[0m
Create a dataset from a list of Examples and Fields.

Arguments:
    examples: List of Examples.
    fields (List(tuple(str, Field))): The Fields to use in this tuple. The
        string is a field name, and the Field is the associated field.
    filter_pred (callable or None): Use only examples for which
        filter_pred(example) is True, or use all examples if None.
        Default is None.
?[1;31mFile:?[0m           c:\program files\python36\lib\site-packages\torchtext\data\dataset.py
?[1;31mType:?[0m           type
?[1;31mSubclasses:?[0m     TabularDataset, LanguageModelingDataset, SST, TranslationDataset, SequenceTaggingDataset, TREC, IMDB, BABI20

Dataset屬性說明

  1. 構(gòu)造器說明:
    Dataset(examples, fields, filter_pred=None)
          # examples:數(shù)據(jù)列表,類型是Examples列表。
          # fields:字段列表,類型是tuple(str, Field))列表。
          # filter_pred:過濾數(shù)據(jù)集的條件,類型是可調(diào)用對(duì)象或者函數(shù),樣本是否使用,根據(jù)函數(shù)的返回值確定。True就使用。若為None,樣本全部使用。
  1. 屬性說明:
    1. sort_key :類型是callable:
    2. examples:類型list(Example)
    3. fields:類型dict[str, Field]

構(gòu)建數(shù)據(jù)集的字段

  • Dataset需要的字段是列表類型:fields (List(tuple(str, Field)))

  • 下面例子是完整的字段的構(gòu)建例子

from torchtext.data import Field
import re
regex = re.compile(r'[^\u4e00-\u9fa5aA-Za-z0-9]')
# ----------------------------------------------------------------------
# 1. 數(shù)據(jù)集需要的Fields定義:fields (List(tuple(str, Field)))

fld_label = Field()
fld_text = Field()
# 標(biāo)簽字段比較簡答
fld_label.sequential = False     # 這個(gè)屬性默認(rèn)True
fld_label.use_vocab = False      # 這個(gè)屬性默認(rèn)True

# 特征字段
fld_text .sequential = True     # 這個(gè)屬性默認(rèn)True
fld_text .use_vocab = True      # 這個(gè)屬性默認(rèn)True

# 因?yàn)閟equential為True,則必須指定分詞屬性token
def word_cut(text):
    text = regex.sub(' ', text)
    return [word for word in jieba.cut(text) if word.strip()]
fld_text .tokenize = word_cut

# 構(gòu)建Dataset需要的fields
fields = [("text", fld_text),("label", fld_label)]   # 兩個(gè)字段

fields
[('text', <torchtext.data.field.Field at 0x2a9d2336a20>),
 ('label', <torchtext.data.field.Field at 0x2a9d2336780>)]

Example幫助文檔

from torchtext.data import Example
help(Example)
Help on class Example in module torchtext.data.example:

class Example(builtins.object)
 |  Defines a single training or test example.
 |  
 |  Stores each column of the example as an attribute.
 |  
 |  Class methods defined here:
 |  
 |  fromCSV(data, fields, field_to_index=None) from builtins.type
 |  
 |  fromJSON(data, fields) from builtins.type
 |  
 |  fromdict(data, fields) from builtins.type
 |  
 |  fromlist(data, fields) from builtins.type
 |  
 |  fromtree(data, fields, subtrees=False) from builtins.type
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)

構(gòu)建Example對(duì)象

  • Example提供一組類函數(shù)實(shí)現(xiàn)Example對(duì)象構(gòu)建,所謂的工廠模式就是這個(gè)了。
    • 參數(shù)需要數(shù)據(jù)與字段描述。
    • 數(shù)據(jù)與字段的長度應(yīng)該是對(duì)應(yīng)的。
from torchtext.data import Field
from torchtext.data import Example

# 這里使用上面構(gòu)建的fields,上面的fields是否正確,這個(gè)就可以檢測
one_example = Example.fromlist(["我是數(shù)據(jù),很長的數(shù)據(jù)", 1], fields)     # 1是標(biāo)簽
one_example
<torchtext.data.example.Example at 0x2a9d22e2ba8>

構(gòu)建Example列表

  • Example列表的構(gòu)建需要數(shù)據(jù)源的數(shù)據(jù) 。
    • 可以使用[..., ..., ...]構(gòu)建,下面數(shù)據(jù)多,我們使用循環(huán)構(gòu)建,
    • 數(shù)據(jù)是csv格式,csv得分隔符可以體現(xiàn)在擴(kuò)展名上。
      • csv: Comma-Separated Values,
      • tsv: Tab-Separated Values
import pandas as pd
from torchtext.data import Field
from torchtext.data import Example
# ----------------------------------------------------------------------
# 2. 數(shù)據(jù)集需要的exampls列表(list(Example)):
# 使用pandas讀取csv文件,其他方式也可以。比如csv庫。
data = pd.read_csv("datasets/train.tsv", sep='\t')   # csv: Comma-Separated Values,tsv: Tab-Separated Values

examples = []
for txt, lab in zip(data["text"], data["label"]):
    one_example = Example.fromlist([txt, lab], fields)
    examples.append(one_example)
examples[0:5]    # 顯示5個(gè)
[<torchtext.data.example.Example at 0x2a9d233c4a8>,
 <torchtext.data.example.Example at 0x2a9b5fee9e8>,
 <torchtext.data.example.Example at 0x2a9d8fae9e8>,
 <torchtext.data.example.Example at 0x2a9d8faea90>,
 <torchtext.data.example.Example at 0x2a9d8fae9b0>]

構(gòu)建數(shù)據(jù)集

  • 使用Dataset構(gòu)造器構(gòu)建數(shù)據(jù)集
    • Dataset(examples, fields, filter_pred=None)
from torchtext.data import Dataset

# 這個(gè)數(shù)據(jù)集與torch.utils.data的Dataset是有差異的。 torch.utils.data的DataLoader要求數(shù)據(jù)是整齊的,就是每個(gè)記錄長度一樣。
dataset = Dataset(examples, fields)
dataset
<torchtext.data.dataset.Dataset at 0x2a9b7f1a780>

深入理解數(shù)據(jù)集

  • Dataset應(yīng)該提供函數(shù)操作數(shù)據(jù)。下面通過幫助文檔了解。
    • 尤其提供數(shù)據(jù)的變量訪問:
      1. __getitem__(self, i)
      2. __len__(self)
      3. __iter__(self)
      4. split(self, split_ratio=0.7, stratified=False, strata_field='label', random_state=None)
        • 數(shù)據(jù)集切分:訓(xùn)練集 + 測試機(jī)
      5. filter_examples(self, field_names)
        • 根據(jù)字段過濾數(shù)據(jù)字段。
help(dataset)
Help on Dataset in module torchtext.data.dataset object:

class Dataset(torch.utils.data.dataset.Dataset)
 |  Defines a dataset composed of Examples along with its Fields.
 |  
 |  Attributes:
 |      sort_key (callable): A key to use for sorting dataset examples for batching
 |          together examples with similar lengths to minimize padding.
 |      examples (list(Example)): The examples in this dataset.
 |      fields (dict[str, Field]): Contains the name of each column or field, together
 |          with the corresponding Field object. Two fields with the same Field object
 |          will have a shared vocabulary.
 |  
 |  Method resolution order:
 |      Dataset
 |      torch.utils.data.dataset.Dataset
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __getattr__(self, attr)
 |  
 |  __getitem__(self, i)
 |  
 |  __init__(self, examples, fields, filter_pred=None)
 |      Create a dataset from a list of Examples and Fields.
 |      
 |      Arguments:
 |          examples: List of Examples.
 |          fields (List(tuple(str, Field))): The Fields to use in this tuple. The
 |              string is a field name, and the Field is the associated field.
 |          filter_pred (callable or None): Use only examples for which
 |              filter_pred(example) is True, or use all examples if None.
 |              Default is None.
 |  
 |  __iter__(self)
 |  
 |  __len__(self)
 |  
 |  filter_examples(self, field_names)
 |      Remove unknown words from dataset examples with respect to given field.
 |      
 |      Arguments:
 |          field_names (list(str)): Within example only the parts with field names in
 |              field_names will have their unknown words deleted.
 |  
 |  split(self, split_ratio=0.7, stratified=False, strata_field='label', random_state=None)
 |      Create train-test(-valid?) splits from the instance's examples.
 |      
 |      Arguments:
 |          split_ratio (float or List of floats): a number [0, 1] denoting the amount
 |              of data to be used for the training split (rest is used for test),
 |              or a list of numbers denoting the relative sizes of train, test and valid
 |              splits respectively. If the relative size for valid is missing, only the
 |              train-test split is returned. Default is 0.7 (for the train set).
 |          stratified (bool): whether the sampling should be stratified.
 |              Default is False.
 |          strata_field (str): name of the examples Field stratified over.
 |              Default is 'label' for the conventional label field.
 |          random_state (tuple): the random seed used for shuffling.
 |              A return value of `random.getstate()`.
 |      
 |      Returns:
 |          Tuple[Dataset]: Datasets for train, validation, and
 |          test splits in that order, if the splits are provided.
 |  
 |  ----------------------------------------------------------------------
 |  Class methods defined here:
 |  
 |  download(root, check=None) from builtins.type
 |      Download and unzip an online archive (.zip, .gz, or .tgz).
 |      
 |      Arguments:
 |          root (str): Folder to download data to.
 |          check (str or None): Folder whose existence indicates
 |              that the dataset has already been downloaded, or
 |              None to check the existence of root/{cls.name}.
 |      
 |      Returns:
 |          str: Path to extracted dataset.
 |  
 |  splits(path=None, root='.data', train=None, validation=None, test=None, **kwargs) from builtins.type
 |      Create Dataset objects for multiple splits of a dataset.
 |      
 |      Arguments:
 |          path (str): Common prefix of the splits' file paths, or None to use
 |              the result of cls.download(root).
 |          root (str): Root dataset storage directory. Default is '.data'.
 |          train (str): Suffix to add to path for the train set, or None for no
 |              train set. Default is None.
 |          validation (str): Suffix to add to path for the validation set, or None
 |              for no validation set. Default is None.
 |          test (str): Suffix to add to path for the test set, or None for no test
 |              set. Default is None.
 |          Remaining keyword arguments: Passed to the constructor of the
 |              Dataset (sub)class being used.
 |      
 |      Returns:
 |          Tuple[Dataset]: Datasets for train, validation, and
 |          test splits in that order, if provided.
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  sort_key = None
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from torch.utils.data.dataset.Dataset:
 |  
 |  __add__(self, other)
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from torch.utils.data.dataset.Dataset:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
  1. 數(shù)據(jù)集遍歷方式1
# 數(shù)據(jù)集訪問與遍歷
for i in range(5):  #len(dataset)
    print(dataset[i])
<torchtext.data.example.Example object at 0x000002A9D233C4A8>
<torchtext.data.example.Example object at 0x000002A9B5FEE9E8>
<torchtext.data.example.Example object at 0x000002A9D8FAE9E8>
<torchtext.data.example.Example object at 0x000002A9D8FAEA90>
<torchtext.data.example.Example object at 0x000002A9D8FAE9B0>
  1. 數(shù)據(jù)集遍歷方式2
# 數(shù)據(jù)集訪問與遍歷
for one_ex in dataset: 
    print(one_ex)
    break
<torchtext.data.example.Example object at 0x000002A9D233C4A8>

構(gòu)建批次數(shù)據(jù)

  • 數(shù)據(jù)集的數(shù)據(jù)使用迭代器來完成訪問。從上面例子應(yīng)該知道,從數(shù)據(jù)集無法訪問到具體的數(shù)據(jù)值,沒有提供訪問的標(biāo)準(zhǔn)接口。

Iterator的幫助文檔

  • 使用Iterator類也是兩種方式:
    1. 構(gòu)造器:
      • __init__(self, dataset, batch_size, sort_key=None, device=None, batch_size_fn=None, train=True, repeat=False, shuffle=None, sort=None, sort_within_batch=None)
    2. 使用類函數(shù)
      • splits(datasets, batch_sizes=None, **kwargs)
  • Iterator提供了數(shù)據(jù)遍歷方式,只是遍歷的是批次。
    • __iter__(self)
    • __len__(self)
  • 直接返回?cái)?shù)據(jù):
    • data(self)
from torchtext.data import Iterator
help(Iterator)
Help on class Iterator in module torchtext.data.iterator:

class Iterator(builtins.object)
 |  Defines an iterator that loads batches of data from a Dataset.
 |  
 |  Attributes:
 |      dataset: The Dataset object to load Examples from.
 |      batch_size: Batch size.
 |      batch_size_fn: Function of three arguments (new example to add, current
 |          count of examples in the batch, and current effective batch size)
 |          that returns the new effective batch size resulting from adding
 |          that example to a batch. This is useful for dynamic batching, where
 |          this function would add to the current effective batch size the
 |          number of tokens in the new example.
 |      sort_key: A key to use for sorting examples in order to batch together
 |          examples with similar lengths and minimize padding. The sort_key
 |          provided to the Iterator constructor overrides the sort_key
 |          attribute of the Dataset, or defers to it if None.
 |      train: Whether the iterator represents a train set.
 |      repeat: Whether to repeat the iterator for multiple epochs. Default: False.
 |      shuffle: Whether to shuffle examples between epochs.
 |      sort: Whether to sort examples according to self.sort_key.
 |          Note that shuffle and sort default to train and (not train).
 |      sort_within_batch: Whether to sort (in descending order according to
 |          self.sort_key) within each batch. If None, defaults to self.sort.
 |          If self.sort is True and this is False, the batch is left in the
 |          original (ascending) sorted order.
 |      device (str or `torch.device`): A string or instance of `torch.device`
 |          specifying which device the Variables are going to be created on.
 |          If left as default, the tensors will be created on cpu. Default: None.
 |  
 |  Methods defined here:
 |  
 |  __init__(self, dataset, batch_size, sort_key=None, device=None, batch_size_fn=None, train=True, repeat=False, shuffle=None, sort=None, sort_within_batch=None)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  __iter__(self)
 |  
 |  __len__(self)
 |  
 |  create_batches(self)
 |  
 |  data(self)
 |      Return the examples in the dataset in order, sorted, or shuffled.
 |  
 |  init_epoch(self)
 |      Set up the batch generator for a new epoch.
 |  
 |  load_state_dict(self, state_dict)
 |  
 |  state_dict(self)
 |  
 |  ----------------------------------------------------------------------
 |  Class methods defined here:
 |  
 |  splits(datasets, batch_sizes=None, **kwargs) from builtins.type
 |      Create Iterator objects for multiple splits of a dataset.
 |      
 |      Arguments:
 |          datasets: Tuple of Dataset objects corresponding to the splits. The
 |              first such object should be the train set.
 |          batch_sizes: Tuple of batch sizes to use for the different splits,
 |              or None to use the same batch_size for all splits.
 |          Remaining keyword arguments: Passed to the constructor of the
 |              iterator class being used.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  epoch

使用splits函數(shù)構(gòu)建Iterator對(duì)象

  • splits函數(shù)的核心參數(shù)是datasets與batch_size
    • datasets:需要list類型;
    • batch_size:類型與datasets匹配;
from torchtext.data import Iterator
print(len(dataset))
it_dataset, = Iterator.splits((dataset, ), batch_sizes=(100, )) 
it_dataset, len(it_dataset)
6300





(<torchtext.data.iterator.Iterator at 0x2a9d9c0b0b8>, 63)

詞向量與構(gòu)建詞表

  • 構(gòu)建的Iterator還不能直接工作,因?yàn)镮terator的工作需要詞表,通過詞表才能把文本轉(zhuǎn)化為數(shù)值(原理是TF詞頻)

  • 構(gòu)建此表兩種方式

    • 使用與訓(xùn)練的詞向量,使用vectors參數(shù)指定
    • 使用默認(rèn)的詞向量,設(shè)置vectors = None

預(yù)訓(xùn)練的詞向量

  • 這里我們只關(guān)心中文,英文可以使用spacy與sacremoses
    • 下載地址:https://github.com/Embedding/Chinese-Word-Vectors
預(yù)先訓(xùn)練的詞向量
  • 下載的詞向量文件

    • 700+Mb,比較刺激的文件。


      詞向量訓(xùn)練文件
  • 加載詞向量文件

from torchtext.vocab import Vectors
# 會(huì)有一個(gè)加載過程。
vectors = Vectors(name="sgns.zhihu.word", cache="datasets")
vectors
  0%|    
  | 0/259922 [00:00<?, ?it/s]Skipping token b'259922' with 1-dimensional vector [b'300']; likely a header
100%|██████████████████████████████████████████████████| 259922/259922 [00:30<00:00, 8568.81it/s]

<torchtext.vocab.Vectors at 0x2a9d9b11ac8>

使用詞向量構(gòu)建詞表

# 文本使用預(yù)先訓(xùn)練的詞向量
fld_text.build_vocab(dataset, vectors = vectors)   # 見上面的詞向量

# 標(biāo)簽是整數(shù),不用詞向量。
fld_label.build_vocab(dataset)

使用數(shù)據(jù)集

遍歷

  • 現(xiàn)在可以使用it_dataset迭代數(shù)據(jù)集了Iterator。
    • __iter__(self)
    • __len__(self)
    • 注意:沒有__item__函數(shù)。智能迭代。
for item  in  it_dataset:
    print(item)
[torchtext.data.batch.Batch of size 100]
    [.text]:[torch.LongTensor of size 54x100]
    [.label]:[torch.LongTensor of size 100]

[torchtext.data.batch.Batch of size 100]
    [.text]:[torch.LongTensor of size 48x100]
    [.label]:[torch.LongTensor of size 100]

 .......

[torchtext.data.batch.Batch of size 100]
    [.text]:[torch.LongTensor of size 54x100]
    [.label]:[torch.LongTensor of size 100]

[torchtext.data.batch.Batch of size 100]
    [.text]:[torch.LongTensor of size 53x100]
    [.label]:[torch.LongTensor of size 100]

[torchtext.data.batch.Batch of size 100]
    [.text]:[torch.LongTensor of size 48x100]
    [.label]:[torch.LongTensor of size 100]

取數(shù)據(jù)

  1. 取文本
for item  in  it_dataset:
    print(item.text)    # item.label
tensor([[ 284, 2568,  115,  ...,   66,   62,   14],
        [1041,    2,  990,  ...,  848,   92,  158],
        [ 445,  369,   17,  ...,   19,  585, 1103],
        ...,
        [   1,    1,    1,  ...,    1,    1,    1],
        [   1,    1,    1,  ...,    1,    1,    1],
        [   1,    1,    1,  ...,    1,    1,    1]])
......
tensor([[  96,  548,  197,  ...,   45,   12,   47],
        [ 635, 1167,   62,  ..., 1036, 1306,   10],
        [9668,   14,   14,  ...,  357, 1329,   36],
        ...,
        [   1,    1,    1,  ...,    1,    1,    1],
        [   1,    1,    1,  ...,    1,    1,    1],
        [   1,    1,    1,  ...,    1,    1,    1]])
  1. 取標(biāo)簽
for item  in  it_dataset:
    print(item.label)
tensor([0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1,
        0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0,
        1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0,
        0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1,
        1, 0, 1, 0])

......
tensor([1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1,
        0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0,
        1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0,
        0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,
        0, 1, 0, 1])

文本分類中的TorchText應(yīng)用

數(shù)據(jù)集處理

函數(shù)封裝

import pandas as pd
from torchtext.data import Field
from torchtext.data import Example
from torchtext.data import Dataset
from torchtext.data import Iterator
from torchtext.vocab import Vectors
import re
import jieba
regex = re.compile(r'[^\u4e00-\u9fa5aA-Za-z0-9]')
fld_label = Field()
fld_text = Field()
# 標(biāo)簽字段比較簡答
fld_label.sequential = False     # 這個(gè)屬性默認(rèn)True
fld_label.use_vocab = False      # 這個(gè)屬性默認(rèn)True

# 特征字段
fld_text.sequential = True     # 這個(gè)屬性默認(rèn)True
fld_text.use_vocab = True      # 這個(gè)屬性默認(rèn)True
fld_text.batch_first=True

# 因?yàn)閟equential為True,則必須指定分詞屬性token
def word_cut(text):
    text = regex.sub(' ', text)
    return [word for word in jieba.cut(text) if word.strip()]
fld_text .tokenize = word_cut
# 構(gòu)建Dataset需要的fields
fields = [("text", fld_text),("label", fld_label)]   # 兩個(gè)字段
    
def load_data(data_file):

    
    data = pd.read_csv(data_file, sep='\t')   # csv: Comma-Separated Values,tsv: Tab-Separated Values

    examples = []
    for txt, lab in zip(data["text"], data["label"]):
        one_example = Example.fromlist([txt, lab], fields)
        examples.append(one_example)

    dataset = Dataset(examples, fields)

    it_dataset, = Iterator.splits((dataset, ), batch_sizes=(1000, ))    # 每個(gè)批次過大,GPU容易溢出

    vectors = Vectors(name="sgns.zhihu.word", cache="datasets")
    fld_text.build_vocab(dataset, vectors = vectors)   # 見上面的詞向量
    # 標(biāo)簽是整數(shù),不用詞向量。
    fld_label.build_vocab(dataset)
    
    return it_dataset

加載訓(xùn)練集與測試集

  • 數(shù)據(jù)集文件說明:
    • 訓(xùn)練集:train.tsv
    • 驗(yàn)證集:valid.tsv
it_train = load_data("datasets/train.tsv")
it_valid = load_data("datasets/valid.tsv")
it_train, it_train
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\gaoke\AppData\Local\Temp\jieba.cache
Loading model cost 0.570 seconds.
Prefix dict has been built successfully.


(<torchtext.data.iterator.Iterator at 0x1c9406f3588>,
 <torchtext.data.iterator.Iterator at 0x1c9406f3588>)

模型

  • 模型就使用LSTM
import torch
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim,
                 n_layers=2, bidirectional=True, dropout=0.2, pad_idx=0):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)
        self.rnn = nn.LSTM(embedding_dim, hidden_dim, num_layers=n_layers, batch_first=True, bidirectional=bidirectional)
        self.fc = nn.Linear(hidden_dim * 2, output_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, text):
        embedded = self.dropout(self.embedding(text))
        output, (hidden, cell) = self.rnn(embedded)
        hidden = self.dropout(torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1))

        return self.fc(hidden.squeeze(0)) 

訓(xùn)練

訓(xùn)練的核心函數(shù)

  • 參數(shù):
    1. 訓(xùn)練集
    2. 驗(yàn)證集
    3. 模型
import torch.nn.functional as F
def train(train_iter, valid_iter, model):
    # 訓(xùn)練超參數(shù)
    EPOCHES = 10
    CUDA = torch.cuda.is_available()   # GPU內(nèi)存不夠
    # CUDA = False
    if CUDA:
        model.cuda()
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
    for epoch in range(1, EPOCHES):
        for batch in train_iter:  # 訓(xùn)練集
            feature, target = batch.text, batch.label
            if CUDA:
                feature, target = feature.cuda(), target.cuda()
            optimizer.zero_grad()
            logits = model(feature)
            loss = F.cross_entropy(logits, target)
            loss.backward()
            optimizer.step()

        # 測試預(yù)測準(zhǔn)確率
        corrects = 0.0
        with torch.no_grad():
            # sample_num樣本數(shù)量
            sample_num = 0
            for item in valid_iter:
                feature, target = item.text, item.label
                if CUDA:
                    feature, target = feature.cuda(), target.cuda()
                logits = model(feature)
                corrects += (torch.max(logits, 1)[1].view(target.size()).data == target.data).sum()
                sample_num += len(feature)
            print(F"輪數(shù):{epoch:03d},\t準(zhǔn)確率:{corrects/sample_num}")

準(zhǔn)備訓(xùn)練的條件

  • 條件包含:
    • 構(gòu)建網(wǎng)絡(luò)需要的參數(shù)
      • 需要詞向量化過程中詞表等變量
    • 數(shù)據(jù)集(已經(jīng)準(zhǔn)備好)
# 參數(shù)
vocabulary_size = len(fld_text.vocab)
embedding_dim = fld_text.vocab.vectors.size()[-1]
class_num = len(fld_label.vocab)
hidden_dim = 128
print(vocabulary_size, embedding_dim, hidden_dim, class_num)
# 構(gòu)建網(wǎng)絡(luò)模型
net = RNN(vocabulary_size, embedding_dim, hidden_dim, class_num)
11361 300 128 4

訓(xùn)練并驗(yàn)證

print("開始訓(xùn)練....")
train(it_train, it_valid, net)

# 保存模型
torch.save(net.state_dict(), "rnn.model")
開始訓(xùn)練....
輪數(shù):001, 準(zhǔn)確率:0.9114285707473755
輪數(shù):002, 準(zhǔn)確率:0.9372857213020325
輪數(shù):003, 準(zhǔn)確率:0.9451428651809692
輪數(shù):004, 準(zhǔn)確率:0.9494285583496094
輪數(shù):005, 準(zhǔn)確率:0.9472857117652893
輪數(shù):006, 準(zhǔn)確率:0.9490000009536743
輪數(shù):007, 準(zhǔn)確率:0.951714277267456
輪數(shù):008, 準(zhǔn)確率:0.953000009059906
輪數(shù):009, 準(zhǔn)確率:0.9485714435577393

附錄:

  • 預(yù)測的實(shí)現(xiàn)代碼就很簡單了,這里就不列出了。
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

推薦閱讀更多精彩內(nèi)容

  • pyspark.sql模塊 模塊上下文 Spark SQL和DataFrames的重要類: pyspark.sql...
    mpro閱讀 9,505評(píng)論 0 13
  • 原文:https://my.oschina.net/liuyuantao/blog/751438 查詢集API 參...
    陽光小鎮(zhèn)少爺閱讀 3,852評(píng)論 0 8
  • 只是教程的搬運(yùn)工-.- Field的使用 Torchtext采用聲明式方法加載數(shù)據(jù),需要先聲明一個(gè)Field對(duì)象,...
    VanJordan閱讀 20,856評(píng)論 5 19
  • Torchtext使用教程 主要內(nèi)容: 如何使用torchtext建立語料庫 如何使用torchtext將詞轉(zhuǎn)下標(biāo)...
    直接往二閱讀 14,515評(píng)論 1 10
  • 每當(dāng)我站在川流不息的人群,逆流而行時(shí),身后一定要有一個(gè)人站著的話,我希望那個(gè)人是你。這是#kindle問簡書 今年...
    巫其格閱讀 3,826評(píng)論 22 77