熟妇性MATURETUBE另类,伦理片在线,麻花传媒在线观看免费

??本文主要講解了RocksDB中二階段提交的實現。本文總結一下共有如下幾個要點：

Modification of the WAL format
Extension of the existing transaction API
Modification of the write path
Modification of the recovery path
Integration with MyRocks

1、 Modification of WAL Format

??WAL包含一個或多個log文件，每個log的內容都是序列化后的WriteBatches，在執行recovery 時，WriteBatches 可以從logs種重建出來。要修改WAL的格式或者擴展其功能，只需要關注WriteBatch即可。
??WriteBatch就是Records的有序集合，這些Record主要包括Put(k,v), Merge(k,v), Delete(k), SingleDelete(k)，每一個都代表了RocksDB的一種寫操作。每一個Record都有一個二進制的字符串表示。當Records 添加到WriteBatch時，他們的二進制表示也被append到WriteBatch的二進制字符串表示中。WriteBatch的二進制字符串前綴是其起始的序列號以及batch中的record 個數。每個record都會有一個column family modifier record（如果column family是default的話，可以省略）。
??可以通過擴展WriteBatch::Handler來遍歷WriteBatch并執行一些操作。MemTableInserter 就是WriteBatch::Handler的擴展，其功能就是將WriteBatch中的操作寫入到對應的 column family的MemTable中。
?? WriteBatch的邏輯形式有可能是這樣：

Sequence(0);NumRecords(3);Put(a,1);Merge(a,1);Delete(a);

??2PC的WriteBatch format還包括另外四條Records

Prepare(xid)
EndPrepare()
Commit(xid)
Rollback(xid)
??一個可以2PC的WriteBatch可能類似下面的邏輯：

Sequence(0);NumRecords(6);Prepare(foo);Put(a,b);Put(x,y);EndPrepare();Put(j,k);Commit(foo);

??Prepare(foo)和EndPrepare()之間的記錄是transaction (ID='foo')的操作。Commit(foo)表示提交這個transaction，Rollback(foo)表示回滾這個transaction。

Sequence ID Distribution

??當WriteBatch通過MemTableInserter被寫入到memtable時，WriteBatch中的每一個operation的sequence ID加上這個WriteBatch中的Oprator的index。但是，在2PC的WriteBatch中并沒有繼續保持這種sequence id的映射方法。Operations contained within a Prepare() enclosure will consume sequence IDs as if they were inserted starting at the location of their relative Commit() marker. This Commit() marker may be in a different WriteBatch or log from the prepared operations to which it applies.

Backwards Compatibility

??WAL format并沒有版本話，所以我們需要注意后相兼容。當前版本的RocksDB不能從一個包含2PC 標記的WAL 文件中recovery。在實際recover時，遇到不能識別的Record會打印fatal 信息。有點麻煩，但是開發者可以對當前的RocksDB版本打patch以便能夠跳過prepared sections和不能識別的markers，這樣就可以從新版本的WAL format 恢復數據。

2、Extension of Transaction API

??當前我們只focus到樂觀事物的2PC。client必須提前聲明是否使用二階段提交，例如以下代碼：

TransactionDB* db;
TransactionDB::Open(Options(), TransactionDBOptions(), "foodb", &db);

TransactionOptions txn_options;
txn_options.two_phase_commit = tr
txn_options.xid = "12345";
Transaction* txn = db->BeginTransaction(write_options, txn_options);
    
txn->Put(...);
txn->Prepare();
txn->Commit();

transaction狀態有:

enum ExecutionStatus {
  STARTED = 0,
  AWAITING_PREPARE = 1,
  PREPARED = 2,
  AWAITING_COMMIT = 3,
  COMMITED = 4,
  AWAITING_ROLLBACK = 5,
  ROLLEDBACK = 6,
  LOCKS_STOLEN = 7,
};

??transaction API會調用一個Prepare()函數。Prepare函數會通過一個context調用WriteImpl，通過context，WriteImpl和WriteThread可以訪問ExcutionStatus、XID和WriteBatch。WriteBatch會先寫入一個Prepare(xid)標記，然后寫入WriteBatch的內容，再寫入EndPrepare()標記。這期間并沒有memtable的寫入。當transaction執行了commit時，會再次調用WriteImpl。此時，Commit()標記會寫入WAL，WriteBatch的內容會寫入相應的memtable。當transaction調用Rollback()時，transaction內容會被清除，然后調用WriteImpl，寫入Rollback(xid)標記（如果當前事物處于Prepare狀態）。
??這些所謂的"meta markers"（Prepare(xid), EndPrepare(), Commit(xid), Rollback(xid)）不會直接寫入到write batch中。write path (WriteImpl())會持有正在寫的事物的context，并使用這個context將相關的markers寫入到WAL（所以這些標記在寫入到WAL之前先寫入到聚合后的WriteBatch）。在recovery時，這些標記會被MemTableInserter 用來重建prepared transactions。

Transaction Wallclock Expiration

??在transaction 提交時，會有一個callback，這個callback在transaction過期后會fail掉整個寫操作。如果transaction過期了，那么鎖很容易被其他transction搶占。如果一個transaction在prepare階段沒有過期的話，那么也不可能在commit階段過期。

TransactionDB Modification

??使用transaction前，client必須打開一個TransactionDB。這個TransactionDB 實例接下來就可以創建Transactions。TransactionDB 會持有一個映射（from XID to 其創建的所有兩階段的Transaction）。當Transaction被刪除或者Rollback時，就會從mapping中刪除掉。RocksDB提供API來查詢所有正在進行中的處于Prepare狀態的transaction。
??TransactionDB 記錄著一個min heap（所有包含prepared section的log numbers）。當transaction處于prepared狀態時，WriteBatch也會寫入log，這個log number就會存儲在transaction 對象中，隨后存入到min heap。當transaction commit時，log number就會從min heap中刪除，但是log number并不會用于被遺忘掉。接下來，就是各個memtable來記錄the oldest log，直到memtable flush到L0為止。

3、Modification of the Write Path

??write path可以被拆解為兩個主要點：DBImpl::WriteImpl(...) and the MemTableInserter。多個client線程都會調用WriteImpl。第一個線程會被設定角色為 leader，剩余的線程會被設定為follower。leader和followers會被group到一起，成為一個邏輯上的write group。leader負責取出writegroup中的所有WriteBatches，聚合在一起，然后將blob寫入到WAL。結合writegroup的大小和當前內存表對并行寫的支持，leader可以將所有WriteBatches寫入到memtable，也可以由各個線程寫入線程自己負責的WrtieBatches到內存表中。
??所有的memtable inserts都是由MemTableInserter負責。 a WriteBatch iterator handler也是WriteBatch::Handler的一種實現。這個handler遍歷WriteBatch中的所有元素(Put, Delete, Merge)，將每個call寫入到對應的MemTable。MemTableInserter 也會處理已就緒的merges, deletes and updates。
??Modification of the write path需要傳入一個參數到DBImpl::WriteImpl，這個參數是一個指針，指向一個2PC的transaction實例。通過這個實例，可以查詢到二階段transaction的當前狀態。一個2PC transaction會在preparation、commit和roll-back時各調用一次WriteImpl 。

Status DBImpl::WriteImpl(
  const WriteOptions& write_options, 
  WriteBatch* my_batch,
  WriteCallback* callback,
  Transaction* txn
) {
  WriteThread::Writer w;
  //...
  w.txn = txn; // writethreads also have txn context for memtable insert

  // we are now the group leader
  int total_count = 0;
  uint64_t total_byte_size = 0;
  for (auto writer : write_group) {
    if (writer->CheckCallback(this)) {
      if (writer->ShouldWriteToMem())
        total_count += WriteBatchInternal::Count(writer->batch)
       }
  }
  const SequenceNumber current_sequence = last_sequence + 1;
  last_sequence += total_count;

  // now we produce the WAL entry from our write group
  for (auto writer : write_group) {
    // currently only optimistic transactions use callbacks
    // and optimistic transaction do not support 2pc
   if (writer->CallbackFailed()) {
      continue;
    } else if (writer->IsCommitPhase()) {
      WriteBatchInternal::MarkCommit(merged_batch, writer->txn->XID_);
    } else if (writer->IsRollbackPhase()) {
      WriteBatchInternal::MarkRollback(merged_batch, writer->txn->XID_);
    } else if (writer->IsPreparePhase()) {
      WriteBatchInternal::MarkBeginPrepare(merged_batch, writer->txn->XID_);
      WriteBatchInternal::Append(merged_batch, writer->batch);
      WriteBatchInternal::MarkEndPrepare(merged_batch);
      writer->txn->log_number_ = logfile_number_;
    } else {
      assert(writer->ShouldWriteToMem());
      WriteBatchInternal::Append(merged_batch, writer->batch);
    }
  }
  //now do MemTable Inserts for WriteGroup
}

WriteBatchInternal::InsertInto也可以調整為只遍歷沒有相關聯的Transaction 或處于COMMIT狀態的寫。由上述代碼可以看出，當transaction處于prepared狀態時，transaction會記錄log num。在insert時，每個Memtable都會記錄最小的log number。

4、Modification of Recovery Path

??當前的recovery path已經很好地適配了兩階段提交，按照順序，依次遍歷log中的所有batches，按照log number 依次feed到MemTableInserter。MemTableInserter 會遍歷所有的batches，然后將值寫入到正確的MemTable中。基于當前的log number，每個MemTable知道該忽略掉哪些values。
??要想recovery 時可以處理2PC的一些操作，我們需要擴展MemTableInserter ，使其感知到4個新的meta markers。
??需要記住的是：當2PC transaction commit時，就會包含一些操作在多個CF上的insertions。這些MemTable是在不同的時間點上執行flush。我們仍然可以使用CF的log number，在recovered, two phase, committed transaction時避免重復寫入。
1、Two Phase Transactions TXN inserts into CFA and CFB
2、TXN prepared to LOG 1
3、TXN marked as COMMITTED in LOG 2
4、TXN is inserted into MemTables
5、CFA is flushed to L0
6、CFA log_number is now LOG 3
7、CFB has not been flushed and it still referencing LOG 1 prep section
8、CRASH RECOVERY
9、LOG 1 is still around because CFB was referencing LOG 1 prep section
10、Iterate over logs starting at LOG 1
11、CFB has prepared values reinserted into mem, again referencing LOG 1 prep section
12、CFA skips insertion from commit marker in LOG 2 because it is 13、consistent to LOG 3
13、CFB is flushed to L0 and is now consistent to LOG 3
14、LOG 1, LOG 2 can now be released

Rebuilding Transactions

??如上所述，modification of the recovery path只需修改MemTableInserter ，使其可以handle 新的meta-markers即可。在recovery時，我們不能訪問TransactionDB的實例，我們必須重建一個hollow ‘shill’的transaction。這就是所有recovered prepared transactions的衣蛾mapping（XID → (WriteBatch, log_number)）。當遇到一個Commit(xid) marker時，就會嘗試查找對應xid的shill transaction，然后寫入到Mem。如果遇到一個rollback(xid) marker，我們就會delete 這個shill transaction。recovery末期，以shill的形式剩下一個所有處于Prepared狀態的transaction。

log lifespan

??要想知道最小的log，我們必須找到每個CF的最小的log number。我們也需要考慮TransactionDB的prepared sections heap中的最小value。這代表了最早的log（包含一個還沒有提交的prepared section）。我們也需要考慮all MemTables和沒有flush的ImmutableMemTables 的最小prep section。這三種value的最小值就是含有數據但是還沒有flush到L0的最早的log。

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

RocksDB系列十八：二階段提交

RocksDB系列十八：二階段提交

1、 Modification of WAL Format

Sequence ID Distribution

Backwards Compatibility

2、Extension of Transaction API

Transaction Wallclock Expiration

TransactionDB Modification

3、Modification of the Write Path

4、Modification of Recovery Path

Rebuilding Transactions

log lifespan

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

RocksDB系列十八：二階段提交

1、 Modification of WAL Format

Sequence ID Distribution

Backwards Compatibility

2、Extension of Transaction API

Transaction Wallclock Expiration

TransactionDB Modification

3、Modification of the Write Path

4、Modification of Recovery Path

Rebuilding Transactions

log lifespan

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频