2019国自产拍,国产精品99久久不卡,男朋友用舌头进我下面正常吗

在前面學習了《快速入門hugegraph圖數據庫》和《hugegraph圖數據庫概念詳解》之后，大家一定想導入一定規模的真實數據到hugegraph練練手，本文就以Stanford的公開數據為例，教大家如何快速導入10億+的數據到hugegraph圖數據庫。

1. 環境準備

導入數據到hugegraph之前需要準備好一些必要環境，包括：安裝服務hugegraph-server和下載導入工具hugegraph-loader，請讀者先根據文檔安裝hugegraph-server，下載hugegraph-loader，hugegraph-server和hugegraph-loader在同一臺機器即可。

2. 數據準備

2.1 原始數據下載

本文以Stanford的公開數據集Friendster為例，該數據約31G左右，請大家自行去下載該數據。

下載完之后，我們來看看文件內容是什么樣的。

前10行

$ head -10 com-friendster.ungraph.txt
# Undirected graph: ../../data/output/friendster.txt
# Friendster
# Nodes: 65608366 Edges: 1806067135
# FromNodeId    ToNodeId
101 102
101 104
101 107
101 125
101 165
101 168

后10行

$ tail -10 com-friendster.ungraph.txt
124802963   124804978
124802963   124814064
124804978   124805174
124804978   124805533
124804978   124814064
124805174   124805533
124806381   124806684
124806596   124809830
124814064   124829667
124820359   124826374

可以看到，文件結構很簡單，每一行代表一條邊，其包含兩列，第一列是源頂點Id，第二列是目標頂點Id，兩列之間以\t分隔。另外，文件最上面幾行是一些概要信息，它說明了文件共有65608366個頂點，1806067135條邊（行）。而且從文件的前面10行和頂點數中都可以看出，這1806067135行中有很多頂點是重復出現的。當然，這是由于文件本身無法描述圖結構導致的。

了解過hugegraph-loader的讀者應該知道，hugegraph-loader暫時不支持在讀一次文件的時候既導入頂點又導入邊，所以我們需要對邊文件做一下處理，將所有的頂點Id去重后，輸出到一個單獨的頂點文件里面，這樣hugegraph-loader就可以分別導入頂點和邊了。

2.2 數據處理

這里數據處理的關鍵在于去重，在不考慮數據量的情況下，我們可以按照以下步驟去重并寫入到新文件：

定義一個內存的set容器，便于判斷某個Id是否存在
按行讀取源文件，每一行解析出兩個整型Id
對每個Id，先判斷set容器中是否包含它，如果不包含，則加入到容器，并寫入到新文件中

依靠內存的set容器我們就能實現去重，這是數據處理的核心思想。但是有一個問題需要考慮到，那就是set容器是否足夠放下所有不重復的頂點Id，我們可以計算一下：

// 65608366個頂點Id
// 每個頂點Id是整型，即32字節
(65608366 * 32) / (1024 * 1024 * 1024) = 1.9G

很幸運，目前絕大多數的機器的內存都是能放得下1.9G的數據的，除非你已經十幾年沒有換過電腦了，所以大家可以自己寫一個腳本按照我上面的邏輯快速地實現去重。

不過，我下面還是給大家介紹一種更加通用一點的處理方案，以免下一次換了一個數據集，而那個數據集的頂點Id占的內存是3.9G、5.9G或7.9G，這時，估計就有一部分人的機器裝不下了。

下面我要介紹的這種方案在處理海量數據領域頗為常見，其核心思想是分而治之：

將原始的全部頂點Id分成較均勻的若干份，保證在每份之間沒有重復的，在每份內部允許有重復的；
對每一份文件，應用上面的去重方法。

那如何才能將全部頂點Id分成較均勻的若干份呢？由于頂點Id都是連續的數字，我們可以做求余哈希，將所有余數相同的頂點Id寫到一個文件中。比如我們決定分成10份，那可以創建編號為0-9的10個文件，將所有頂點Id除以10求余，余數為0的寫到編號為0的文件，余數為1的寫到編號為1的文件，以此類推。

我已經按照上面的邏輯寫好了腳本，代碼如下：

#!/usr/bin/python
# coding=utf-8


def ensure_file_exist(shard_file_dict, shard_prefix, index):
    if not (shard_file_dict.has_key(index)):
        name = shard_file_path + shard_prefix + str(index)
        shard_file = open(name, "w")
        shard_file_dict[index] = shard_file

if __name__ == '__main__':

    raw_file_path = "path/raw_file.txt"
    output_file_path = "path/de_dup.txt"
    shard_file_path = "path/shard/"
    shard_prefix = "shard_"
    shard_count = 100
    shard_file_dict = {}

    # Split into many shard files
    with open(raw_file_path, "r+") as raw_file:
        # Read next line
        for raw_line in raw_file:
            # Skip comment line
            if raw_line.startswith('#'):
                continue
            parts = raw_line.split('\t')
            assert len(parts) == 2

            source_node_id = int(parts[0])
            target_node_id = int(parts[1])
            # Calculate the residue by shard_count
            source_node_residue = source_node_id % shard_count
            target_node_residue = target_node_id % shard_count

            # Create new file if it doesn't exist
            ensure_file_exist(shard_file_dict, shard_prefix, source_node_residue)
            ensure_file_exist(shard_file_dict, shard_prefix, target_node_residue)

            # Append to file with corresponding index
            shard_file_dict[source_node_residue].write(str(source_node_id) + '\n')
            shard_file_dict[target_node_residue].write(str(target_node_id) + '\n')

    print "Split original file info %s shard files" % shard_count

    # Close all files
    for shard_file in shard_file_dict.values():
        shard_file.close()

    print "Prepare duplicate and merge shard files into %s" % output_file_path
    merge_file = open(output_file_path, "w")
    line_count = 0

    # Deduplicate and merge into another file
    for index in shard_file_dict.keys():
        name = shard_file_path + shard_prefix + str(index)
        with open(name, "r+") as shard_file:
            elems = {}
            # Read next line
            for raw_line in shard_file:
                # Filter duplicate elems
                if not elems.has_key(raw_line):
                    elems[raw_line] = ""
                    merge_file.write(raw_line)
                    line_count += 1
        print "Processed shard file %s" % name

    merge_file.close()
    print "Processed all shard files and merge into %s" % merge_file
    print "%s lines after processing the file" % line_count

    print "Finished"

在使用這個腳本之前，需要修改raw_file_path、output_file_path、shard_file_path為你自己路徑。

處理完之后，我們再看看去重后的頂點文件

$ head -10 com-friendster.ungraph.vertex.txt
1007000
310000
1439000
928000
414000
1637000
1275000
129000
2537000
5356000

看一下文件有多少行

$ wc -l com-friendster.ungraph.vertex.txt
65608366 com-friendster.ungraph.vertex.txt

可以看到，確實是與文件描述相符的。

除了我說的這種方法外，肯定還有其他的處理辦法，比如大數據處理神器：MapReduce，大家可以自行選擇，只要能提取頂點Id并去重就行。

3. 導入準備

3.1 構建圖模型

由于頂點和邊除了Id外，都沒有其他的屬性，所以圖的schema其實很簡單。

schema.propertyKey("id").asInt().ifNotExist().create();
// 使用Id作為主鍵
schema.vertexLabel("person").primaryKeys("id").properties("id").ifNotExist().create();
schema.edgeLabel("friend").sourceLabel("person").targetLabel("person").ifNotExist().create();

3.2 編寫輸入源映射文件

這里只有一個頂點文件和邊文件，且文件的分隔符都是\t，所以將input.format指定為TEXT，input.delimiter使用默認即可。

頂點有一個屬性id，而頂點文件頭沒有指明列名，所以我們需要顯式地指定input.header為["id"]，input.header的作用是告訴hugegraph-loader文件的每一列的列名是什么，但要注意：列名并不一定就是頂點或邊的屬性名，描述文件中有一個mapping域用來將列名映射為屬性名。

邊沒有任何屬性，邊文件中只有源頂點和目標頂點的Id，我們需要先將input.header指定為["source_id", "target_id"]，這樣就給兩個Id列取了不同的名字。然后再分別指定source和target為["source_id"]和["target_id"]，source和target的作用是告訴hugegraph-loader邊的源頂點和目標頂點的Id與文件中的哪些列有關。

注意這里“有關”的含義。當頂點Id策略是PRIMARY_KEY時，source和target指定的列是主鍵列（加上mapping），用來拼接生成頂點Id；當頂點Id策略是CUSTOMIZE_STRING或CUSTOMIZE_NUMBER時，source和target指定的列就是Id列（加上mapping）。

由于這里頂點Id策略是PRIMARY_KEY的，所以source和target指定的列["source_id"]和["target_id"]將作為主鍵列，再在mapping域中指定source_id和target_id為id，hugegraph-loader就知道解析道一個source_id列的值value后，將其解釋為id:value，然后使用頂點Id拼接算法生成源頂點Id（目標頂點類似）。

{
  "vertices": [
    {
      "label": "person",
      "input": {
        "type": "file",
        "path": "path/com-friendster.ungraph.vertex.txt",
        "format": "TEXT",
        "header": ["id"],
        "charset": "UTF-8"
      }
    }
  ],
  "edges": [
    {
      "label": "friend",
      "source": ["source_id"],
      "target": ["target_id"],
      "input": {
        "type": "file",
        "path": "path/com-friendster.ungraph.txt",
        "format": "TEXT",
        "header": ["source_id", "target_id"],
        "comment_symbols": ["#"]
      },
      "mapping": {
        "source_id": "id",
        "target_id": "id"
      }
    }
  ]
}

由于邊文件中前面幾行是注釋行，可以使用"comment_symbols": ["#"]令hugegraph-loader忽略以#開頭的行。

更多關于映射文件的介紹請參考：官網hugegraph-loader編寫輸入源映射文件

4. 執行導入

進入到hugegraph-loader目錄下，執行以下命令（記得修改路徑）：

$ bin/hugegraph-loader -g hugegraph -f ../data/com-friendster/struct.json -s ../data/com-friendster/schema.groovy --check-vertex false

這時hugegraph-loader就會開始導入數據，并會打印進度到控制臺上，等所有頂點和邊導入完成后，會看到以下統計信息：

Vertices has been imported: 65608366
Edges has been imported: 1806067135
---------------------------------------------
vertices results:
    parse failure vertices   :  0
    insert failure vertices  :  0
    insert success vertices  :  65608366
---------------------------------------------
edges results:
    parse failure edges      :  0
    insert failure edges     :  0
    insert success edges     :  1806067135
---------------------------------------------
time results:
    vertices loading time    :  200
    edges loading time       :  8089
    total loading time       :  8289

頂點和邊的導入速度分別為：65608366 / 200 = 328041.83(頂點/秒)，1806067135 / 8089 = 223274.46(邊/秒)。

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

快速導入十億數據到hugegraph圖數據庫

快速導入十億數據到hugegraph圖數據庫

1. 環境準備

2. 數據準備

2.1 原始數據下載

2.2 數據處理

3. 導入準備

3.1 構建圖模型

3.2 編寫輸入源映射文件

4. 執行導入

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

快速導入十億數據到hugegraph圖數據庫

1. 環境準備

2. 數據準備

2.1 原始數據下載

2.2 數據處理

3. 導入準備

3.1 構建圖模型

3.2 編寫輸入源映射文件

4. 執行導入

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频