成人天堂,特级太黄A片免费播放一,av人人揉揉资源站免费

??隨著DB/mem使用越來越多，filter/index block的內存空間變得不可忽視。雖然cache_index_and_filter_blocks 配置只允許filter/index block數據的一部分cache在block cache中，但是還是會因為數據量的龐大影響RocksDB的性能。

占據了過多的block cache 空間，這些空間本來可以用于緩存data
當訪問cache miss時需要load miss的數據到內存中，這無疑增大了磁盤存儲的訪問壓力。

??接下來會更詳細地闡述這些問題的細節，并解釋對Index/filter進行分片是怎么減輕這些開銷的。

How large are the index/filter blocks?

??默認情況下，RocksDB每個SST file只有一個index/filter block。Index/filter的大小是由配置決定的，但是如果一個SST file為256MB的話，index/filter的block一般為0.5~5MB，這遠比普通的data block（4~32KB）大很多。如果內存占用合適的話，每個SST file的index/filter只會讀一次到內存即可，這樣就不會總是與data block 競爭cache空間，不過也有可能會因為cache 淘汰導致多次從disk讀取數據。

What is the big deal with large index/filter blocks?

??當index/filter block數據允許存儲在Block cache時，就會與data block競爭cache這個非常稀缺的資源。5MB的index/filter就可以存儲1000個data block（4KB），這將會導致更多data block的cache miss。多個SST file的index/filter彼此之間也會競爭cache并有可能會淘汰掉對方，也會加劇自身的cache miss。所以，導致了這樣一種情況會出現：index/filter在cache 中的生命期間，真正提供cache服務的概率很低。
??如果index/filter cache miss后，就需要從disk中reload，但是由于數據量很大，會導致很高的IO cost。一次簡單的point lookup有可能需要兩次data block（each for one layer of LSM）的讀操作，所以有可能會讀取很多MB的index/filter block數據。如果這種情況經常發生的話，disk cost就會更多消耗在index/filter 而不是data block上，這顯然是我們不希望看到的。

What is partitioned index/filters?

??如果要分片的話，SST file的index/filter會被分片為多個小 block，并會配備一個索引。當需要讀取index/filter時，只有top-level index會load到內存。然后，通過top-level index找出具體需要查詢的那個分片，然后加載那個分片數據到block cache。top-level index占用的內存空間很小，可以存儲在heap 也可以存儲在block cache中，這取決于cache_index_and_filter_blocks配置。

Pros

更高的cache 命中率:分片后，避免了超大的index/filter blocks 占用稀缺的cache空間，以更小的block形式加載需要的分片數據到block cache，這提高的內存空間的有效使用。
節省IO: 當index/filter分片數據 cache miss后，只有一個分片需要從disk load到內存，與load SSTfile的全部index/filter相比，這會大大減輕disk的負載。
No compromise on index/filters：如果沒有采取分片策略的話，要想減緩index/filter內存空間占用的問題可以采取以下方法：設置更大的block或者減少bloom bits來使index/filter變得更小。前者會導致剛才所述的cache 浪費問題，后者會影響bloom filter功能的正確性。

Cons

top-level index會占用額外空間： index/filter大小的0.1~1%
更高的disk IO：如果top-level index不在cache的話，會增加一次額外的IO。為了避免這種問題，可以將index 以更高的優先級存儲在heap或者 cache中。（todo）
損失了空間局部性: 如果是這樣的場景，頻繁且隨機地讀取相同SST文件的數據，這樣就會在每次讀取時都會load 不同的分片數據到內存，和一次性讀取所有的index/filter相比，顯然會更加低效。在RocksDB的benchmark中很少出現這種情況，但是這確實會發生在LSM tree的L0/L1數據訪問中。因此，這兩層的SST file 的index/filter可以不分片。(to do)

成功案例

HDD, 100TB DB

DB大小為86G，HDD存儲，在一個具有100TB數據的node上模擬小內存，使用Direct IO(關閉OS file cache)，block cache大小設置為60MB。分片后吞吐提升了11倍（ 5 op/s提升到55 op/s）。

/db_bench --benchmarks="readwhilewriting[X3],stats" 
--use_direct_reads=1
 -compaction_readahead_size 1048576 --use_existing_db --num=2000000000 --duration 600
 --cache_size=62914560 -cache_index_and_filter_blocks=false
 -statistics -histogram -bloom_bits=10 -target_file_size_base=268435456 
-block_size=32768 -threads 32 -partition_filters -partition_indexes 
-index_per_partition 100 -pin_l0_filter_and_index_blocks_in_cache
 -benchmark_write_rate_limit 204800
 -max_bytes_for_level_base 134217728 -cache_high_pri_pool_ratio 0.9

SSD, Linkbench

DB大小300G，SSD存儲，在相同node上模擬小內存（有可能會被其他的DB訪問），打開Direct IO(關閉OS file cache)，block cache size設置為6G和2G。沒有分片策略時，當把內存從6G降低到2G時，吞吐從38K tps降低到了23K。打開分片后，吞吐從38K降低到30K。

How to use it?

index_type = IndexType::kTwoLevelIndexSearch
這個配置是啟用index分片
NewBloomFilterPolicy(BITS, false)
使用full filters
partition_filters = true
這個配置是啟用filter分片
metadata_block_size = 4096
index 分片的大小設置
cache_index_and_filter_blocks = false [if you are on <= 5.14]
分片數據存儲在cache中。控制top-level索引的存儲位置，但是這種情況，在benchmark中實驗數據不多。
cache_index_and_filter_blocks = true and pin_top_level_index_and_filter = true [if you are on >= 5.15]
將所有的index/filter數據和top-level index都存儲在block cache。
cache_index_and_filter_blocks_with_high_priority = true
如字面意義
pin_l0_filter_and_index_blocks_in_cache = true

建議設置，因為這個配置會應用到index/filter 分片
只在compation style 是level-based時使用
需要注意：當把block 數據cache到block cache后，可能會導致超過內存設置的容量（如果strict_capacity_limit 沒有設置）。

block cache size: 如果之前都是將filter/index 存儲在heap，現在設置filter/index 數據cache到block cache的話，不要忘了增加block cache size，大小與從heap中減少的量大概一致。

Current limitations

如果沒有對index分區的話，是不能對filters分區的。
filter和index的partition 數量必須是一致的。換句話說，無論什么時候開始對index block進行切分，都要對filter block進行切分
filter block的大小是由index block切分的時機決定的。RocksDB很快就會根據metadata_block_size 來控制filter和index block的最大size。換句話說：filter block切分發生在下面這兩種情況，1)index block被切分了，所以會按照同樣的分片數目切分filter block 2)filter block的size超過了metadata_block_size

Under the hood

1 BlockBasedTable Format

分片之后index block存儲由

[index block]

變換為:

[index block - partition 1]
[index block - partition 2]
...
[index block - partition N]
[index block - top-level index]

??SST file的尾部是top-level index block，這個block本身就是partition blocks的索引。每一個index block 分片都是按照kBinarySearch格式存儲。top-level index，也是按照這種格式存儲。所以這些分片和索引數據可按照普通的data block reader來讀取。
??filter blocks也是按照相同的架構來分片。每個filter block都是按照KFullFilter格式存儲。top-level index按照kBinarySearch格式存儲，與index block一樣。
??如果分片的話，SST inspection工具 sst_dump不再匯報index/filter blocks的總大小，而是匯報index/filter的top-level index的大小。

2 Builder

??通過PartitionedIndexBuilder and PartitionedFilterBlockBuilder分別構建partitioned index和partitioned filter。
??PartitionedIndexBuilder 有一個指針(sub_index_builder_)，指向ShortenedIndexBuilder，這個實例可以用來構建單錢的index 分片。當設置了flush_policy_時，PartitionedIndexBuilder 會將這個指針寫入index block的最后一個key，然后創建一個新的ShortenedIndexBuilder。當調用了PartitionedIndexBuilder 的::Finish函數時，會在最早的sub index builder上調用::Finish函數，然后返回分片的block。下次調用PartitionedIndexBuilder::Finish時會攜帶上次返回的partition的offset信息，這個信息會被用作top-level index的值。最后一次調用PartitionedIndexBuilder::Finish會完成top-level index的構建。然后會將top-level index存儲在SST file中，其offset會被用作index block的offset。
??PartitionedFilterBlockBuilder 繼承自FullFilterBlockBuilder ，都有一個FilterBitsBuilder 來構建bloom filters。PartitionedFilterBlockBuilder 有一個指針指向PartitionedIndexBuilder，可以調用其ShouldCutFilterBlock 函數來確定是否該對一個filter block進行切分。在分片時會首先調用FilterBitsBuilder ，將返回的block數據和一個由PartitionedIndexBuilder::GetPartitionKey()生成的一個partition key存儲在一起，然后重置FilterBitsBuilder ，以供下次分片使用。最后，每調用一次PartitionedFilterBlockBuilder::Finish，都會返回一個partition以及當前partition用來構建top-level index的offset。最后一次調用::Finish會返回top-level索引的block。
??之所以PartitionedFilterBlockBuilder 會依賴PartitionedIndexBuilder 是為了優化SST file的index/filter 分片。如果不care這個的話，后續改進中會將這個邏輯刪除。

3 Reader

??PartitionIndexReader 可以通過讀取top-level index block來獲取分片索引信息。NewIterator 可以用作執行在top-level index的TwoLevelIterator 。這種簡單的實現是可行的，因為每個index 分片都是kBinarySearch 格式，這和data block的格式相同，很容易就可以當做lower level iterator來使用。PartitionedFilterBlockReader 使用top-level index來找到filter partition的offset，然后在BlockBasedTable 對象上調用GetFilter()來加載FilterBlockReader 對象，然后釋放掉FilterBlockReader 對象。

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

RocksDB系列十四:Partitioned Index Filters

RocksDB系列十四:Partitioned Index Filters

How large are the index/filter blocks?

What is the big deal with large index/filter blocks?

What is partitioned index/filters?

Pros

Cons

成功案例

HDD, 100TB DB

SSD, Linkbench

How to use it?

Current limitations

Under the hood

1 BlockBasedTable Format

2 Builder

3 Reader

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

RocksDB系列十四:Partitioned Index Filters

How large are the index/filter blocks?

What is the big deal with large index/filter blocks?

What is partitioned index/filters?

Pros

Cons

成功案例

HDD, 100TB DB

SSD, Linkbench

How to use it?

Current limitations

Under the hood

1 BlockBasedTable Format

2 Builder

3 Reader

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频