ACE 2005 語料集事件預處理(英文)

ACE 2005 語料集

注: ACE 2005 語料集 無法免費下載到,需要付費才能獲得。

事件 (英文)

事件主要依賴于:

  1. tokenizer
  2. entity
  3. event

所以事件的英文樣本處理將上面的主要數據預處理出來。

sample.json

[
  {
    "sentence": "He visited all his friends.",
    "tokens": ["He", "visited", "all", "his", "friends", "."],
    "pos-tag": ["PRP", "VBD", "PDT", "PRP$", "NNS", "."],
    "golden-entity-mentions": [
      {
        "text": "He", 
        "entity-type": "PER:Individual",
        "start": 0,
        "end": 0
      },
      {
        "text": "his",
        "entity-type": "PER:Group",
        "start": 3,
        "end": 3
      },
      {
        "text": "all his friends",
        "entity-type": "PER:Group",
        "start": 2,
        "end": 5
      }
    ],
    "golden-event-mentions": [
      {
        "trigger": {
          "text": "visited",
          "start": 1,
          "end": 1
        },
        "arguments": [
          {
            "role": "Entity",
            "entity-type": "PER:Individual",
            "text": "He",
            "start": 0,
            "end": 0
          },
          {
            "role": "Entity",
            "entity-type": "PER:Group",
            "text": "all his friends",
            "start": 2,
            "end": 5
          }
        ],
        "event_type": "Contact:Meet"
      }
    ],
    "parse": "(ROOT\n  (S\n    (NP (PRP He))\n    (VP (VBD visited)\n      (NP (PDT all) (PRP$ his) (NNS friends)))\n    (. .)))"
  }
]

解析代碼

github: https://github.com/nlpcl-lab/ace2005-preprocessing

如何運行以及相關依賴參考 其中的 "README.md",但是在實際使用中還存在下面的問題。

相關環境

  1. python3 >= 3.7
  2. nltk
  3. Standord CoreNlp

nltk

pip install nltk

但是在運行的時候會提示需要 "Resource punkt not found."。

自動安裝:

import nltk
nltk.download('punkt')

手動安裝:

nltk 說明如下:

Create a folder nltk_data, e.g. C:\nltk_data, or /usr/local/share/nltk_data, and subfolders chunkers, grammars, misc, sentiment, taggers, corpora, help, models, stemmers, tokenizers.

Download individual packages from http://nltk.org/nltk_data/ (see the “download” links). Unzip them to the appropriate subfolder. For example, the Brown Corpus, found at: https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/brown.zip is to be unzipped to nltk_data/corpora/brown.

具體操作:

  1. http://www.nltk.org/nltk_data/ 下載 punkt
  2. C:\nltk_data 或者 /usr/local/share/nltk_data 創建 tokenizers, 然后將上一步下載的punkt解壓,放到 tokenizers 中。最后的文件目錄如下: 你的路徑/nltk_data/tokenizers/punkt

Standford CoreNLP 安裝

pip install stanfordcorenlp

然后,下載資源包 http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip unzip stanford-corenlp-full-2018-10-05.zip

將資源包解壓放到一個合適的目錄下,

with StanfordCoreNLP('你的路徑/stanford-corenlp-full-2018-10-05', memory='8g', timeout=60000) as nlp:

該資源包在試用的時候指定進入代碼.

?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。

推薦閱讀更多精彩內容