Btcd區(qū)塊的存取之ffldb

介紹完BoltDB后,我們回到btcd/database的源代碼。了解了BoltDB的實現(xiàn)后,btcd/database的接口定義和其調(diào)用方法將變得容易理解。然而,包database并未實現(xiàn)一個數(shù)據(jù)庫,它實際上是btcd中的存儲框架,使btcd支持多種數(shù)據(jù)庫,其中,ffldb是database包中提供的默認數(shù)據(jù)庫。在clone完代碼后,可以發(fā)現(xiàn)包database主要包含的文件有:

  • cmd/dbtool: 實現(xiàn)了一個從db文件中讀寫block的工具。
  • ffldb: 實現(xiàn)了一個默認的數(shù)據(jù)庫驅(qū)動,它參考BoltDB實現(xiàn)了DB、Bucket、Tx等;
  • internal/treap:一個樹堆的實現(xiàn),用于緩存元數(shù)據(jù);
  • testdata: 包含用于測試的db文件;
  • driver.go: 定義了Driver類型及注冊、打開數(shù)據(jù)庫的方法;
  • interface.go: 定義了DB、Bucket、Tx、Cursor等接口,幾乎與BoltDB中的定義一致;
  • error.go: 定義了包database中的錯誤代碼及對應的提示字符;
  • doc.go: 包database的描述;
  • driver_test.go、error_test.go、example_test.go、export_test.go: 對應的測試文件;

需要說明的是,ffldb并不是真正意義上的數(shù)據(jù)庫,它利用leveldb來存儲元數(shù)據(jù),用文件來存區(qū)塊。對元數(shù)據(jù)的存儲,ffldb參考BoltDB的實現(xiàn),支持Bucket及嵌套子Bucket;對區(qū)塊或者元數(shù)據(jù)的讀寫,它也實現(xiàn)了類似的Transaction。特別地,ffldb通過leveldb存儲元數(shù)據(jù)時,增加了一層緩存以提高讀寫效率。它的基本框架如下圖所示:

我們先來看看包database中的接口DB的定義:

//btcd/database/interface.go

type DB interface {
    // Type returns the database driver type the current database instance
    // was created with.
    Type() string

    ......
    Begin(writable bool) (Tx, error)

    ......
    View(fn func(tx Tx) error) error

    ......
    Update(fn func(tx Tx) error) error

    ......
    Close() error
}

可以看出,其中的接口定義與BoltDB中的定義幾乎一樣。事實上,Bucket及Cursor等接口均與BoltDB類似,Tx接口由于增加了對metadata和block的操作,有所不同:

//btcd/database/interface.go

// Tx represents a database transaction.  It can either by read-only or
// read-write.  The transaction provides a metadata bucket against which all
// read and writes occur.
//
// As would be expected with a transaction, no changes will be saved to the
// database until it has been committed.  The transaction will only provide a
// view of the database at the time it was created.  Transactions should not be
// long running operations.
type Tx interface {
    // Metadata returns the top-most bucket for all metadata storage.
    Metadata() Bucket

    ......
    StoreBlock(block *btcutil.Block) error

    ......
    HasBlock(hash *chainhash.Hash) (bool, error)

    ......
    HasBlocks(hashes []chainhash.Hash) ([]bool, error)

    ......
    FetchBlockHeader(hash *chainhash.Hash) ([]byte, error)

    ......
    FetchBlockHeaders(hashes []chainhash.Hash) ([][]byte, error)

    ......
    FetchBlock(hash *chainhash.Hash) ([]byte, error)

    ......
    FetchBlocks(hashes []chainhash.Hash) ([][]byte, error)

    ......
    FetchBlockRegion(region *BlockRegion) ([]byte, error)

    ......
    FetchBlockRegions(regions []BlockRegion) ([][]byte, error)

    // ******************************************************************
    // Methods related to both atomic metadata storage and block storage.
    // ******************************************************************

    ......
    Commit() error

    ......
    Rollback() error
}

由于篇幅原因,我們略去了各接口的注釋,讀者可以從源文件中閱讀。從Tx接口的定義中可以看出,它主要定義了三類方法:

  1. Metadata(), 通過它可以獲得根Bucket,所有的元數(shù)據(jù)均歸屬于Bucket,Bucket及其中的K/V對最終存于leveldb中。在一個Transaction中,對元數(shù)據(jù)的操作均是通過Metadata()得到Bucket后,再在Bucket中進行操作的;
  2. XxxBlockXxx,與Block操作相關的接口,它們主要是通過讀寫文件來讀寫B(tài)lock;
  3. Commit()和Rollback,在可寫Tx中寫元數(shù)據(jù)或者區(qū)塊后,均需要通過Commit()來提交修改并關閉Tx,或者通過Rollback來丟棄修改或關閉只讀Tx,作用與BoltDB中的一致;

ffldb提供了對上述各接口的實現(xiàn),我們接下來著重分析它的代碼。我們先來看看它的db類型定義:

//btcd/database/ffldb/db.go

// db represents a collection of namespaces which are persisted and implements
// the database.DB interface.  All database access is performed through
// transactions which are obtained through the specific Namespace.
type db struct {
    writeLock sync.Mutex   // Limit to one write transaction at a time.
    closeLock sync.RWMutex // Make database close block while txns active.
    closed    bool         // Is the database closed?
    store     *blockStore  // Handles read/writing blocks to flat files.
    cache     *dbCache     // Cache layer which wraps underlying leveldb DB.
}

其中各字段意義是:

  • writeLock: 互斥鎖,保證同時只有一個可寫transaction;
  • closeLock: 保證數(shù)據(jù)庫Close時所有已經(jīng)打開的transaction均已結(jié)束;
  • closed: 指示數(shù)據(jù)庫是否已經(jīng)關閉;
  • store: 指向blockStore,用于讀寫區(qū)塊;
  • cach: 指向dbCache,用于讀寫元數(shù)據(jù);

db實現(xiàn)了database.DB接口,其中各方法的實現(xiàn)與BoltDB中基本類似,也是通過View()或者Update()的回調(diào)方法獲取Tx對象或其引用,然后調(diào)用Tx中接口進行數(shù)據(jù)庫操作,故我們不再分析db的各方法實現(xiàn),重點分析Tx的實現(xiàn)。ffldb中transaction的定義如下,它實現(xiàn)了database.Tx接口:

//btcd/database/ffldb/db.go

// transaction represents a database transaction.  It can either be read-only or
// read-write and implements the database.Bucket interface.  The transaction
// provides a root bucket against which all read and writes occur.
type transaction struct {
    managed        bool             // Is the transaction managed?
    closed         bool             // Is the transaction closed?
    writable       bool             // Is the transaction writable?
    db             *db              // DB instance the tx was created from.
    snapshot       *dbCacheSnapshot // Underlying snapshot for txns.
    metaBucket     *bucket          // The root metadata bucket.
    blockIdxBucket *bucket          // The block index bucket.

    // Blocks that need to be stored on commit.  The pendingBlocks map is
    // kept to allow quick lookups of pending data by block hash.
    pendingBlocks    map[chainhash.Hash]int
    pendingBlockData []pendingBlock

    // Keys that need to be stored or deleted on commit.
    pendingKeys   *treap.Mutable
    pendingRemove *treap.Mutable

    // Active iterators that need to be notified when the pending keys have
    // been updated so the cursors can properly handle updates to the
    // transaction state.
    activeIterLock sync.RWMutex
    activeIters    []*treap.Iterator
}

其中各字段意義:

  • managed: transaction是否被db托管,托管狀態(tài)的transaction不能再主動調(diào)用Commit()或者Rollback();
  • closed: 指示當前transaction是否已經(jīng)結(jié)束;
  • writable: 指示當前transaction是否可寫;
  • db: 指向與當前transaction綁定的db對象;
  • snapshot: 當前transaction讀到的元數(shù)據(jù)緩存的一個快照,在transaction打開的時候?qū)bCache進行快照得到的,也是元數(shù)據(jù)存儲中MVCC機制的一部分,類似于BoltDB中讀meta page;
  • metaBucket: 存儲元數(shù)據(jù)的根Bucket;
  • blockIdxBucket: 存儲區(qū)塊hash與其序號的Bucket,它是metaBucket的第一個子Bucket,且只在ffldb內(nèi)部使用;
  • pendingBlocks: 記錄待提交Block的哈希與其在pendingBlockData中的位置的對應關系;
  • pendingBlockData: 順序記錄所有待提交Block的字節(jié)序列;
  • pendingKeys: 待添加或者更新的元數(shù)據(jù)集合,請注意,它指向一個樹堆;
  • pendingRemove: 待刪除的元數(shù)據(jù)集合,它也指向一個樹堆,與pendingKeys一樣,它們均通過dbCache向leveldb中更新;
  • activeIterLock: 對activeIters的保護鎖;
  • activeIters: 用于記錄當前transaction中查找dbCache的Iterators,當向dbCache中更新Key時,樹堆旋轉(zhuǎn)會更新節(jié)點間關系,故需將所有活躍的Iterator復位;

我們說transaction中主要有三類方法,我們先來看看它的Metadata()方法:

//btcd/database/ffldb/db.go

// Metadata returns the top-most bucket for all metadata storage.
//
// This function is part of the database.Tx interface implementation.
func (tx *transaction) Metadata() database.Bucket {
    return tx.metaBucket
}

可以看出它僅僅是返回根Bucket,剩下的操作均通過它來進行。我們來看看bucket的定義,它實現(xiàn)了database.Bucket:

//btcd/database/ffldb/db.go

// bucket is an internal type used to represent a collection of key/value pairs
// and implements the database.Bucket interface.
type bucket struct {
    tx *transaction
    id [4]byte
}

需要注意的是,ffldb中的bucket與BoltDB中的Bucket雖然有著相同的接口定義,但它們底層實際存儲K/V對的數(shù)據(jù)結(jié)構(gòu)并不相同,所以bucket的定義和查找方法大不相同。ffldb利用leveldb來存儲K/V,leveldb底層數(shù)據(jù)結(jié)構(gòu)為LSM樹(log-structured merge-tree),而BoltDB采用B+Tree。ffldb利用leveldb提供的接口來讀寫K/V,而levealdb中沒有Bucket的概念,也沒有對Key進行分層管理的方法,那ffldb中是如何實現(xiàn)bucket的呢?我們可以通過CreateBucket()來分析:

//btcd/database/ffldb/db.go

// CreateBucket creates and returns a new nested bucket with the given key.
//
// Returns the following errors as required by the interface contract:
//   - ErrBucketExists if the bucket already exists
//   - ErrBucketNameRequired if the key is empty
//   - ErrIncompatibleValue if the key is otherwise invalid for the particular
//     implementation
//   - ErrTxNotWritable if attempted against a read-only transaction
//   - ErrTxClosed if the transaction has already been closed
//
// This function is part of the database.Bucket interface implementation.
func (b *bucket) CreateBucket(key []byte) (database.Bucket, error) {
    
    ......

    // Ensure bucket does not already exist.
    bidxKey := bucketIndexKey(b.id, key)
    ......

    // Find the appropriate next bucket ID to use for the new bucket.  In
    // the case of the special internal block index, keep the fixed ID.
    var childID [4]byte
    if b.id == metadataBucketID && bytes.Equal(key, blockIdxBucketName) {
        childID = blockIdxBucketID
    } else {
        var err error
        childID, err = b.tx.nextBucketID()
        if err != nil {
            return nil, err
        }
    }

    // Add the new bucket to the bucket index.
    if err := b.tx.putKey(bidxKey, childID[:]); err != nil {
        str := fmt.Sprintf("failed to create bucket with key %q", key)
        return nil, convertErr(str, err)
    }
    return &bucket{tx: b.tx, id: childID}, nil
}

上面代碼主要包含:

  1. 通過bucketIndexKey()創(chuàng)建子Bucket的Key;
  2. 為子Bucket指定或者選擇一個id;
  3. 將子Bucket的Key和id作為K/V記錄存入父Bucket中,這一點與BoltDB相似;

與BoltDB中通過K/V的flag來標記Bucket不同,ffldb中通過Key的格式來標記Bucket:

//btcd/database/ffldb/db.go

// bucketIndexKey returns the actual key to use for storing and retrieving a
// child bucket in the bucket index.  This is required because additional
// information is needed to distinguish nested buckets with the same name.
func bucketIndexKey(parentID [4]byte, key []byte) []byte {
    // The serialized bucket index key format is:
    //   <bucketindexprefix><parentbucketid><bucketname>
    indexKey := make([]byte, len(bucketIndexPrefix)+4+len(key))
    copy(indexKey, bucketIndexPrefix)
    copy(indexKey[len(bucketIndexPrefix):], parentID[:])
    copy(indexKey[len(bucketIndexPrefix)+4:], key)
    return indexKey
}

可以看出,一個子Bucket的Key總是“<bucketindexprefix><parentbucketid><bucketname>”的形式,反過來說,如果一個Bucket中的Key是這一形式,那它對應一個子Bucket,它的Value記錄子Bucket的id。也就是說,ffldb中是通過Bucket的Key的分層形式來標記父子關系的。然而,BoltDB中子Bucket對應一顆獨立的B+Tree,當向子Bucket中添加K/V時,就是向?qū)腂+Tree中插入記錄,那ffldb是如何實現(xiàn)向子Bucket中添加K/V的呢,反過來說,如何確定K/V屬于哪個Bucket呢?我們來看看bucket的Put()方法:

//btcd/database/ffldb/db.go

// Put saves the specified key/value pair to the bucket.  Keys that do not
// already exist are added and keys that already exist are overwritten.
//
// Returns the following errors as required by the interface contract:
//   - ErrKeyRequired if the key is empty
//   - ErrIncompatibleValue if the key is the same as an existing bucket
//   - ErrTxNotWritable if attempted against a read-only transaction
//   - ErrTxClosed if the transaction has already been closed
//
// This function is part of the database.Bucket interface implementation.
func (b *bucket) Put(key, value []byte) error {

    ......

    return b.tx.putKey(bucketizedKey(b.id, key), value)
}

其中的關鍵也在于Key,在Bucket中添加記錄時,會通過bucketizedKey()對key進行處理:

//btcd/database/ffldb/db.go

// bucketizedKey returns the actual key to use for storing and retrieving a key
// for the provided bucket ID.  This is required because bucketizing is handled
// through the use of a unique prefix per bucket.
func bucketizedKey(bucketID [4]byte, key []byte) []byte {
    // The serialized block index key format is:
    //   <bucketid><key>
    bKey := make([]byte, 4+len(key))
    copy(bKey, bucketID[:])
    copy(bKey[4:], key)
    return bKey
}

也就是說,在向bucket中添加K/V時,Key會被轉(zhuǎn)換成“<bucketid><key>”的形式,從而標記這一記錄屬于id為“<bucketid>”的bucket。以上就是ffldb中通過兩種Key的分層格式來標記子Bucket及子Bucket中K/V對的方法,最終向leveldb中寫入K/V時并沒有bucket的概念,所有的Key是一種扁平結(jié)構(gòu)。與bucket綁定的cursor也是通過leveldb中的Iterator實現(xiàn)的,我們不再專門分析,感興趣的讀者可以自行分析。同時,從bucket的Put()方法也可以看出,添加的K/V會通過transaction的putKey()方法先加入到pendingKeys中:

//btcd/database/ffldb/db.go

// putKey adds the provided key to the list of keys to be updated in the
// database when the transaction is committed.
//
// NOTE: This function must only be called on a writable transaction.  Since it
// is an internal helper function, it does not check.
func (tx *transaction) putKey(key, value []byte) error {
    // Prevent the key from being deleted if it was previously scheduled
    // to be deleted on transaction commit.
    tx.pendingRemove.Delete(key)

    // Add the key/value pair to the list to be written on transaction
    // commit.
    tx.pendingKeys.Put(key, value)
    tx.notifyActiveIters()
    return nil
}

類似地,bucket的Delete()方法也是調(diào)用transaction的deleteKey()方法來實現(xiàn)。deleteKey()中,會將要刪除的key添加到pendingRemove中,待transaction Commit最終將pendingKeys添加到leveldb中,pendingRemove中的Key從leveldb中刪除。bucket的Get()方法也會最終調(diào)用transaction的fetchKey()方法來查詢,fetchKey()先從
pendingRemove或者pendingKeys查找,如果找不到再從dbCache的一個快照中查找。

transaction中第二類是讀取Block相關的方法,我們主要分析StoreBlock()和FetchBlock(),先來看看StoreBlock:

//btcd/database/ffldb/db.go

// StoreBlock stores the provided block into the database.  There are no checks
// to ensure the block connects to a previous block, contains double spends, or
// any additional functionality such as transaction indexing.  It simply stores
// the block in the database.
//
// Returns the following errors as required by the interface contract:
//   - ErrBlockExists when the block hash already exists
//   - ErrTxNotWritable if attempted against a read-only transaction
//   - ErrTxClosed if the transaction has already been closed
//
// This function is part of the database.Tx interface implementation.
func (tx *transaction) StoreBlock(block *btcutil.Block) error {

    ......

    // Reject the block if it already exists.
    blockHash := block.Hash()
    ......

    blockBytes, err := block.Bytes()
    ......

    // Add the block to be stored to the list of pending blocks to store
    // when the transaction is committed.  Also, add it to pending blocks
    // map so it is easy to determine the block is pending based on the
    // block hash.
    if tx.pendingBlocks == nil {
        tx.pendingBlocks = make(map[chainhash.Hash]int)
    }
    tx.pendingBlocks[*blockHash] = len(tx.pendingBlockData)
    tx.pendingBlockData = append(tx.pendingBlockData, pendingBlock{
        hash:  blockHash,
        bytes: blockBytes,
    })
    log.Tracef("Added block %s to pending blocks", blockHash)

    return nil
}

可以看出,StoreBlock()主要是把block先放入pendingBlockData,等待Commit時寫入文件。我們再來看看FetchBlock():

//btcd/database/ffldb/db.go

// FetchBlock returns the raw serialized bytes for the block identified by the
// given hash.  The raw bytes are in the format returned by Serialize on a
// wire.MsgBlock.
//
// Returns the following errors as required by the interface contract:
//   - ErrBlockNotFound if the requested block hash does not exist
//   - ErrTxClosed if the transaction has already been closed
//   - ErrCorruption if the database has somehow become corrupted
//
// In addition, returns ErrDriverSpecific if any failures occur when reading the
// block files.
//
// NOTE: The data returned by this function is only valid during a database
// transaction.  Attempting to access it after a transaction has ended results
// in undefined behavior.  This constraint prevents additional data copies and
// allows support for memory-mapped database implementations.
//
// This function is part of the database.Tx interface implementation.
func (tx *transaction) FetchBlock(hash *chainhash.Hash) ([]byte, error) {

    ......

    // When the block is pending to be written on commit return the bytes
    // from there.
    if idx, exists := tx.pendingBlocks[*hash]; exists {
        return tx.pendingBlockData[idx].bytes, nil
    }

    // Lookup the location of the block in the files from the block index.
    blockRow, err := tx.fetchBlockRow(hash)
    if err != nil {
        return nil, err
    }
    location := deserializeBlockLoc(blockRow)

    // Read the block from the appropriate location.  The function also
    // performs a checksum over the data to detect data corruption.
    blockBytes, err := tx.db.store.readBlock(hash, location)
    if err != nil {
        return nil, err
    }

    return blockBytes, nil
}

讀Block時先從pendingBlocks中查找,如果有則直接從pendingBlockData中返回;否則,通過db中的blockStore讀出區(qū)塊。我們先不深入blockStore,待介紹完transaction的Commit后再來分析它。關鍵地,我們可以發(fā)現(xiàn),通過transaction讀寫元數(shù)據(jù)或者Block時,均會先對pendingBlocks或pendingKeys與pedingRemove讀寫,它們可以看作transaction的緩沖,在Commit時被同步到文件或者leveldb中。Commit()最終調(diào)用writePendingAndCommit()進行實際操作:

//btcd/database/ffldb/db.go

// writePendingAndCommit writes pending block data to the flat block files,
// updates the metadata with their locations as well as the new current write
// location, and commits the metadata to the memory database cache.  It also
// properly handles rollback in the case of failures.
//
// This function MUST only be called when there is pending data to be written.
func (tx *transaction) writePendingAndCommit() error {

    ......

    // Loop through all of the pending blocks to store and write them.
    for _, blockData := range tx.pendingBlockData {
        log.Tracef("Storing block %s", blockData.hash)
        location, err := tx.db.store.writeBlock(blockData.bytes)
        if err != nil {
            rollback()
            return err
        }

        // Add a record in the block index for the block.  The record
        // includes the location information needed to locate the block
        // on the filesystem as well as the block header since they are
        // so commonly needed.
        blockHdr := blockData.bytes[0:blockHdrSize]
        blockRow := serializeBlockRow(location, blockHdr)
        err = tx.blockIdxBucket.Put(blockData.hash[:], blockRow)
        if err != nil {
            rollback()
            return err
        }
    }

    // Update the metadata for the current write file and offset.
    writeRow := serializeWriteRow(wc.curFileNum, wc.curOffset)
    if err := tx.metaBucket.Put(writeLocKeyName, writeRow); err != nil {
        rollback()
        return convertErr("failed to store write cursor", err)
    }

    // Atomically update the database cache.  The cache automatically
    // handles flushing to the underlying persistent storage database.
    return tx.db.cache.commitTx(tx)
}

writePendingAndCommit()中,主要包含:

  1. 通過blockStore將pendingBlockData中的區(qū)塊寫入文件,同時將區(qū)塊的hash與它在文件中的位置寫入blockIdxBucket,以便后續(xù)查找;
  2. 更新metaBucket中記錄當前文件讀寫位置的K/V;
  3. 通過dbCache中commitTx()將待提交的K/V寫入樹堆緩存,必要時寫入leveldb;
blockStore

transaction中讀寫元數(shù)據(jù)或者區(qū)塊時,最終會通過blockStore讀寫文件或者dbCache讀寫樹堆或者leveldb。所以接下來,我們主要分析blockStore和dbCache。我們先來看blockStore的定義:

//btcd/database/ffldb/blockio.go

// blockStore houses information used to handle reading and writing blocks (and
// part of blocks) into flat files with support for multiple concurrent readers.
type blockStore struct {
    // network is the specific network to use in the flat files for each
    // block.
    network wire.BitcoinNet

    // basePath is the base path used for the flat block files and metadata.
    basePath string

    // maxBlockFileSize is the maximum size for each file used to store
    // blocks.  It is defined on the store so the whitebox tests can
    // override the value.
    maxBlockFileSize uint32

    // The following fields are related to the flat files which hold the
    // actual blocks.   The number of open files is limited by maxOpenFiles.
    //
    // obfMutex protects concurrent access to the openBlockFiles map.  It is
    // a RWMutex so multiple readers can simultaneously access open files.
    //
    // openBlockFiles houses the open file handles for existing block files
    // which have been opened read-only along with an individual RWMutex.
    // This scheme allows multiple concurrent readers to the same file while
    // preventing the file from being closed out from under them.
    //
    // lruMutex protects concurrent access to the least recently used list
    // and lookup map.
    //
    // openBlocksLRU tracks how the open files are refenced by pushing the
    // most recently used files to the front of the list thereby trickling
    // the least recently used files to end of the list.  When a file needs
    // to be closed due to exceeding the the max number of allowed open
    // files, the one at the end of the list is closed.
    //
    // fileNumToLRUElem is a mapping between a specific block file number
    // and the associated list element on the least recently used list.
    //
    // Thus, with the combination of these fields, the database supports
    // concurrent non-blocking reads across multiple and individual files
    // along with intelligently limiting the number of open file handles by
    // closing the least recently used files as needed.
    //
    // NOTE: The locking order used throughout is well-defined and MUST be
    // followed.  Failure to do so could lead to deadlocks.  In particular,
    // the locking order is as follows:
    //   1) obfMutex
    //   2) lruMutex
    //   3) writeCursor mutex
    //   4) specific file mutexes
    //
    // None of the mutexes are required to be locked at the same time, and
    // often aren't.  However, if they are to be locked simultaneously, they
    // MUST be locked in the order previously specified.
    //
    // Due to the high performance and multi-read concurrency requirements,
    // write locks should only be held for the minimum time necessary.
    obfMutex         sync.RWMutex
    lruMutex         sync.Mutex
    openBlocksLRU    *list.List // Contains uint32 block file numbers.
    fileNumToLRUElem map[uint32]*list.Element
    openBlockFiles   map[uint32]*lockableFile

    // writeCursor houses the state for the current file and location that
    // new blocks are written to.
    writeCursor *writeCursor

    // These functions are set to openFile, openWriteFile, and deleteFile by
    // default, but are exposed here to allow the whitebox tests to replace
    // them when working with mock files.
    openFileFunc      func(fileNum uint32) (*lockableFile, error)
    openWriteFileFunc func(fileNum uint32) (filer, error)
    deleteFileFunc    func(fileNum uint32) error
}

其各字段意義如下:

  • network: 指示當前Block網(wǎng)絡類型,比如MainNet、TestNet或SimNet,在向文件中寫入?yún)^(qū)塊時會指定該區(qū)塊來自哪類網(wǎng)絡;
  • basePath: 存儲Block的文件在磁盤上的存儲路徑;
  • maxBlockFileSize: 存儲Block文件的最大的Size;
  • obfMutex: 對openBlockFiles進行保護的讀寫鎖;
  • lruMutex:對openBlocksLRU和fileNumToLRUElem進行保護的互斥鎖;
  • openBlocksLRU: 已打開文件的序號的LRU列表,默認的最大打開文件數(shù)是25;
  • fileNumToLRUElem: 記錄文件序號與openBlocksLRU中元素的對應關系;
  • openBlockFiles: 記錄所有打開的只讀文件的序號與文件指針的對應關系;
  • writeCursor: 指向當前寫入的文件,記錄其文件序號和寫偏移;
  • openFileFunc、openWriteFileFunc以及deleteFileFunc: openFile、openWriteFile和deleteFile的接口方法,主要用于測試,它的默認實現(xiàn)是blockStore的對應方法。

我們還是通過blockStore的readBlock()和writeBlock()方法來了解blockStore的工作機制。我們先來看看readBlokc():

//btcd/database/ffldb/blockio.go

// readBlock reads the specified block record and returns the serialized block.
// It ensures the integrity of the block data by checking that the serialized
// network matches the current network associated with the block store and
// comparing the calculated checksum against the one stored in the flat file.
// This function also automatically handles all file management such as opening
// and closing files as necessary to stay within the maximum allowed open files
// limit.
//
// Returns ErrDriverSpecific if the data fails to read for any reason and
// ErrCorruption if the checksum of the read data doesn't match the checksum
// read from the file.
//
// Format: <network><block length><serialized block><checksum>
func (s *blockStore) readBlock(hash *chainhash.Hash, loc blockLocation) ([]byte, error) {
    // Get the referenced block file handle opening the file as needed.  The
    // function also handles closing files as needed to avoid going over the
    // max allowed open files.
    blockFile, err := s.blockFile(loc.blockFileNum)
    if err != nil {
        return nil, err
    }

    serializedData := make([]byte, loc.blockLen)
    n, err := blockFile.file.ReadAt(serializedData, int64(loc.fileOffset))
    blockFile.RUnlock()
    if err != nil {
        str := fmt.Sprintf("failed to read block %s from file %d, "+
            "offset %d: %v", hash, loc.blockFileNum, loc.fileOffset,
            err)
        return nil, makeDbErr(database.ErrDriverSpecific, str, err)
    }

    // Calculate the checksum of the read data and ensure it matches the
    // serialized checksum.  This will detect any data corruption in the
    // flat file without having to do much more expensive merkle root
    // calculations on the loaded block.
    serializedChecksum := binary.BigEndian.Uint32(serializedData[n-4:])
    calculatedChecksum := crc32.Checksum(serializedData[:n-4], castagnoli)
    if serializedChecksum != calculatedChecksum {
        str := fmt.Sprintf("block data for block %s checksum "+
            "does not match - got %x, want %x", hash,
            calculatedChecksum, serializedChecksum)
        return nil, makeDbErr(database.ErrCorruption, str, nil)
    }

    // The network associated with the block must match the current active
    // network, otherwise somebody probably put the block files for the
    // wrong network in the directory.
    serializedNet := byteOrder.Uint32(serializedData[:4])
    if serializedNet != uint32(s.network) {
        str := fmt.Sprintf("block data for block %s is for the "+
            "wrong network - got %d, want %d", hash, serializedNet,
            uint32(s.network))
        return nil, makeDbErr(database.ErrDriverSpecific, str, nil)
    }

    // The raw block excludes the network, length of the block, and
    // checksum.
    return serializedData[8 : n-4], nil
}

其主要步驟為:

  1. 通過blockFile()查詢已經(jīng)打開的文件或者新打開一個文件;
  2. 通過file.ReadAt()方法從文件中的loc.fileOffset位置讀出區(qū)塊數(shù)據(jù),它的格式是“<network><block length><serialized block><checksum>”;
  3. 從區(qū)塊數(shù)據(jù)中解析出block的字節(jié)流;

其中比較重要的是通過blockFile()得到一個文件句柄,我們來看看它的實現(xiàn):

//btcd/database/ffldb/blockio.go

// blockFile attempts to return an existing file handle for the passed flat file
// number if it is already open as well as marking it as most recently used.  It
// will also open the file when it's not already open subject to the rules
// described in openFile.
//
// NOTE: The returned block file will already have the read lock acquired and
// the caller MUST call .RUnlock() to release it once it has finished all read
// operations.  This is necessary because otherwise it would be possible for a
// separate goroutine to close the file after it is returned from here, but
// before the caller has acquired a read lock.
func (s *blockStore) blockFile(fileNum uint32) (*lockableFile, error) {
    // When the requested block file is open for writes, return it.
    wc := s.writeCursor
    wc.RLock()
    if fileNum == wc.curFileNum && wc.curFile.file != nil {
        obf := wc.curFile
        obf.RLock()
        wc.RUnlock()
        return obf, nil
    }
    wc.RUnlock()

    // Try to return an open file under the overall files read lock.
    s.obfMutex.RLock()
    if obf, ok := s.openBlockFiles[fileNum]; ok {
        s.lruMutex.Lock()
        s.openBlocksLRU.MoveToFront(s.fileNumToLRUElem[fileNum])
        s.lruMutex.Unlock()

        obf.RLock()
        s.obfMutex.RUnlock()
        return obf, nil
    }
    s.obfMutex.RUnlock()

    // Since the file isn't open already, need to check the open block files
    // map again under write lock in case multiple readers got here and a
    // separate one is already opening the file.
    s.obfMutex.Lock()                                                               (1)
    if obf, ok := s.openBlockFiles[fileNum]; ok {
        obf.RLock()
        s.obfMutex.Unlock()
        return obf, nil
    }

    // The file isn't open, so open it while potentially closing the least
    // recently used one as needed.
    obf, err := s.openFileFunc(fileNum)
    if err != nil {
        s.obfMutex.Unlock()
        return nil, err
    }
    obf.RLock()
    s.obfMutex.Unlock()
    return obf, nil
}

它的主要步驟是:

  1. 檢查要查找的文件是否是writeCursor指向的文件,如果是則直接返回;請注意,對writeCursor的訪問通過其讀鎖保護;同時,blockFile()返回的lockableFile對象已經(jīng)被自己的讀鎖保護,由調(diào)用方負責釋放文件的讀鎖。如果返回writeCursor指向的文件,即正在向該文件寫入?yún)^(qū)塊,它在被寫滿時會被關閉,讀鎖可以保護關閉該文件時必須等讀文件結(jié)束;
  2. 接著,從blockStore記錄的openBlockFiles中查找文件,如果找到,將文件移至LRU列表的首位置,同時獲得文件讀鎖后返回;
  3. 代碼(1)處獲取s.obfMutex的寫鎖并再次從openBlockFiles中查找文件,這是為了防止剛剛從openBlockFiles查找完成后,目標文件被其他線程打開并添加到openBlockFiles中了,如果不作此保護,在openBlockFiles中未找到就打開新文件,可能出現(xiàn)同一個文件被多次打開的情況。有讀者可能會想到: 為什么不在第一次查找openBlockFiles就通過s.obfMutex的寫鎖保護呢?這里也為了提高對openBlockFiles的讀寫并發(fā),openBlockFiles存的均是最近打開過的文件,有較大概率在第一次查找openBlockFiles就能找到目標文件,通過s.obfMutex的讀鎖保護,能提高從openBlockFiles查找的并發(fā)量;
  4. 如果openBlockFiles中找不到目標文件,就調(diào)用openFile()打開新文件,請注意整個openFile()調(diào)用均在s.obfMutex的寫鎖保護下;
//btcd/database/ffldb/blockio.go

// openFile returns a read-only file handle for the passed flat file number.
// The function also keeps track of the open files, performs least recently
// used tracking, and limits the number of open files to maxOpenFiles by closing
// the least recently used file as needed.
//
// This function MUST be called with the overall files mutex (s.obfMutex) locked
// for WRITES.
func (s *blockStore) openFile(fileNum uint32) (*lockableFile, error) {
    // Open the appropriate file as read-only.
    filePath := blockFilePath(s.basePath, fileNum)
    file, err := os.Open(filePath)
    if err != nil {
        return nil, makeDbErr(database.ErrDriverSpecific, err.Error(),
            err)
    }
    blockFile := &lockableFile{file: file}

    // Close the least recently used file if the file exceeds the max
    // allowed open files.  This is not done until after the file open in
    // case the file fails to open, there is no need to close any files.
    //
    // A write lock is required on the LRU list here to protect against
    // modifications happening as already open files are read from and
    // shuffled to the front of the list.
    //
    // Also, add the file that was just opened to the front of the least
    // recently used list to indicate it is the most recently used file and
    // therefore should be closed last.
    s.lruMutex.Lock()
    lruList := s.openBlocksLRU
    if lruList.Len() >= maxOpenFiles {
        lruFileNum := lruList.Remove(lruList.Back()).(uint32)
        oldBlockFile := s.openBlockFiles[lruFileNum]

        // Close the old file under the write lock for the file in case
        // any readers are currently reading from it so it's not closed
        // out from under them.
        oldBlockFile.Lock()
        _ = oldBlockFile.file.Close()
        oldBlockFile.Unlock()

        delete(s.openBlockFiles, lruFileNum)
        delete(s.fileNumToLRUElem, lruFileNum)
    }
    s.fileNumToLRUElem[fileNum] = lruList.PushFront(fileNum)
    s.lruMutex.Unlock()

    // Store a reference to it in the open block files map.
    s.openBlockFiles[fileNum] = blockFile

    return blockFile, nil
}

openFile()中主要執(zhí)行:

  1. 直接通過os.Open()調(diào)用以只讀模式打開目標文件;
  2. 檢測openBlocksLRU是否已滿,如果已滿,則將列表末尾元素移除,同時將對應的文件關閉并從openBlockFiles將其移除,然后將新打開的文件添加了列表首位置;其中對openBlocksLRU和fileNumToLRUElem的訪問均在s.lruMutex保護下;
  3. 將新打開的文件放入openBlockFiles中;

從openFile()中可以看出,blockStore通過openBlockFiles和openBlocksLRU及fileNumToLRUElem維護了一個已經(jīng)打開的只讀文件的LRU緩存列表,可以加快從文件中讀區(qū)塊的速度。接下來,我們再來看看writeBlock():

//btcd/database/ffldb/blockio.go

// writeBlock appends the specified raw block bytes to the store's write cursor
// location and increments it accordingly.  When the block would exceed the max
// file size for the current flat file, this function will close the current
// file, create the next file, update the write cursor, and write the block to
// the new file.
//
// The write cursor will also be advanced the number of bytes actually written
// in the event of failure.
//
// Format: <network><block length><serialized block><checksum>
func (s *blockStore) writeBlock(rawBlock []byte) (blockLocation, error) {
    // Compute how many bytes will be written.
    // 4 bytes each for block network + 4 bytes for block length +
    // length of raw block + 4 bytes for checksum.
    blockLen := uint32(len(rawBlock))
    fullLen := blockLen + 12

    // Move to the next block file if adding the new block would exceed the
    // max allowed size for the current block file.  Also detect overflow
    // to be paranoid, even though it isn't possible currently, numbers
    // might change in the future to make it possible.
    //
    // NOTE: The writeCursor.offset field isn't protected by the mutex
    // since it's only read/changed during this function which can only be
    // called during a write transaction, of which there can be only one at
    // a time.
    wc := s.writeCursor
    finalOffset := wc.curOffset + fullLen
    if finalOffset < wc.curOffset || finalOffset > s.maxBlockFileSize {
        // This is done under the write cursor lock since the curFileNum
        // field is accessed elsewhere by readers.
        //
        // Close the current write file to force a read-only reopen
        // with LRU tracking.  The close is done under the write lock
        // for the file to prevent it from being closed out from under
        // any readers currently reading from it.
        wc.Lock()
        wc.curFile.Lock()                                                    (1)
        if wc.curFile.file != nil {
            _ = wc.curFile.file.Close()
            wc.curFile.file = nil
        }
        wc.curFile.Unlock()

        // Start writes into next file.
        wc.curFileNum++                                                      (2)
        wc.curOffset = 0                                                     (3)
        wc.Unlock()
    }

    // All writes are done under the write lock for the file to ensure any
    // readers are finished and blocked first.
    wc.curFile.Lock()
    defer wc.curFile.Unlock()

    // Open the current file if needed.  This will typically only be the
    // case when moving to the next file to write to or on initial database
    // load.  However, it might also be the case if rollbacks happened after
    // file writes started during a transaction commit.
    if wc.curFile.file == nil {
        file, err := s.openWriteFileFunc(wc.curFileNum)                      (4)
        if err != nil {
            return blockLocation{}, err
        }
        wc.curFile.file = file
    }

    // Bitcoin network.
    origOffset := wc.curOffset                                               (5)
    hasher := crc32.New(castagnoli)
    var scratch [4]byte
    byteOrder.PutUint32(scratch[:], uint32(s.network))
    if err := s.writeData(scratch[:], "network"); err != nil {
        return blockLocation{}, err
    }
    _, _ = hasher.Write(scratch[:])

    // Block length.
    byteOrder.PutUint32(scratch[:], blockLen)
    if err := s.writeData(scratch[:], "block length"); err != nil {
        return blockLocation{}, err
    }
    _, _ = hasher.Write(scratch[:])

    // Serialized block.
    if err := s.writeData(rawBlock[:], "block"); err != nil {
        return blockLocation{}, err
    }
    _, _ = hasher.Write(rawBlock)

    // Castagnoli CRC-32 as a checksum of all the previous.
    if err := s.writeData(hasher.Sum(nil), "checksum"); err != nil {
        return blockLocation{}, err
    }

    loc := blockLocation{                                                    (6)
        blockFileNum: wc.curFileNum,
        fileOffset:   origOffset,
        blockLen:     fullLen,
    }
    return loc, nil
}

其主要步驟為:

  1. 檢測寫入?yún)^(qū)塊后是否超過文件大小限制,如果超過,則關閉當前文件,新創(chuàng)建一個文件; 否則,直接在當前文件的wc.curOffset偏移處開始寫區(qū)塊;
  2. 代碼(1)處關閉writeCursor指向的文件,在調(diào)用Close()之前,獲取了lockableFile的寫鎖,以防其他線程正在讀該文件;
  3. 代碼(2)將writeCursor指向下一個文件,代碼(3)處將文件內(nèi)偏移復位;
  4. 代碼(4)處調(diào)用openWriteFile()以可讀寫方式打開或者創(chuàng)建一個新的文件,同時將writeCursor指向該文件;
  5. 代碼(5)處記錄下寫區(qū)塊的文件內(nèi)起始偏移位置,隨后開始向文件中寫區(qū)塊數(shù)據(jù);
  6. 依次向文件中寫入網(wǎng)絡號、區(qū)塊長度值、區(qū)塊數(shù)據(jù)和前三項的crc32檢驗和,可以看出存于文件上的區(qū)塊封裝格式為: "<network><block length><serialized block><checksum>"
  7. 代碼(6)處創(chuàng)建被寫入?yún)^(qū)塊對應的blockLocation對象,它由存儲區(qū)塊的文件的序號、區(qū)塊存儲位置在該文件內(nèi)的起始偏移及封裝后的區(qū)塊長度構(gòu)成,最后返回該blockLocation對象;
dbCache

通過readBlock()和writeBlock()我們基本上可以了解blockStore的整個工作機制,它主要是通過一個LRU列表來管理已經(jīng)打開的只讀文件,并通過writeCursor來記錄當前寫的入文件及文件內(nèi)偏移,在寫入?yún)^(qū)塊時,如果寫入?yún)^(qū)塊后超過了設置的最大文件Size,則另起一個新的文件寫入。理解了這一點后,blockStore的其他代碼均不難理解。接下來,我們主要分析dbCache的代碼,先來看看它的定義:

//btcd/database/ffldb/dbcache.go

// dbCache provides a database cache layer backed by an underlying database.  It
// allows a maximum cache size and flush interval to be specified such that the
// cache is flushed to the database when the cache size exceeds the maximum
// configured value or it has been longer than the configured interval since the
// last flush.  This effectively provides transaction batching so that callers
// can commit transactions at will without incurring large performance hits due
// to frequent disk syncs.
type dbCache struct {
    // ldb is the underlying leveldb DB for metadata.
    ldb *leveldb.DB

    // store is used to sync blocks to flat files.
    store *blockStore

    // The following fields are related to flushing the cache to persistent
    // storage.  Note that all flushing is performed in an opportunistic
    // fashion.  This means that it is only flushed during a transaction or
    // when the database cache is closed.
    //
    // maxSize is the maximum size threshold the cache can grow to before
    // it is flushed.
    //
    // flushInterval is the threshold interval of time that is allowed to
    // pass before the cache is flushed.
    //
    // lastFlush is the time the cache was last flushed.  It is used in
    // conjunction with the current time and the flush interval.
    //
    // NOTE: These flush related fields are protected by the database write
    // lock.
    maxSize       uint64
    flushInterval time.Duration
    lastFlush     time.Time

    // The following fields hold the keys that need to be stored or deleted
    // from the underlying database once the cache is full, enough time has
    // passed, or when the database is shutting down.  Note that these are
    // stored using immutable treaps to support O(1) MVCC snapshots against
    // the cached data.  The cacheLock is used to protect concurrent access
    // for cache updates and snapshots.
    cacheLock    sync.RWMutex
    cachedKeys   *treap.Immutable
    cachedRemove *treap.Immutable
}

其中各字段意義如下:

  • ldb: 指向leveldb的DB對象,用于向leveldb中存取K/V;
  • store: 指向當前db下的blockStore,用于向leveldb中寫元數(shù)據(jù)之前,通過blockStore將區(qū)塊緩存強制寫入磁盤;
  • maxSize: 簡單地講,它是緩存的待添加和刪除的元數(shù)據(jù)的總大小限制,默認值為100M;
  • flushInterval: 向leveldb中寫數(shù)據(jù)的時間間隔;
  • lastFlush: 上次向leveldb中寫數(shù)據(jù)的時間戳;
  • cacheLock: 對cachedKeys和cachedRemove進行讀寫保護,它們會在dbCache向leveldb寫數(shù)據(jù)時更新,在dbCache快照時被讀取;
  • cachedKeys: 緩存待添加的Key,它指向一個樹堆;
  • cachedRemove: 緩存待刪除的Key,它也指向一個樹堆,請注意,cachedKeys和cachedRemove與transaction中的pendingKeys和pendingRemove有區(qū)別,pendingKeys和pendingRemove是可修改樹堆(*treap.Mutable),而cachedKeys和cachedRemove是不可修改樹堆(*treap.Immutable),且通常情況下(不滿足needsFlush()時)pendingKeys和pendingRemove先向cachedKeys和cachedRemove同步,再向leveldb中更新,我們將在dbCache的commitTx()中更清楚地了解這一點。treap.Mutable和treap.Immutable將在本文最后介紹。

我們在transaction的writePendingAndCommit()方法中看到transaction Commit的最后一步就是調(diào)用dbCache的commitTx()來提交元數(shù)據(jù)的更新,所以我們先來看看commitTX()方法:

//btcd/database/ffldb/dbcache.go

// commitTx atomically adds all of the pending keys to add and remove into the
// database cache.  When adding the pending keys would cause the size of the
// cache to exceed the max cache size, or the time since the last flush exceeds
// the configured flush interval, the cache will be flushed to the underlying
// persistent database.
//
// This is an atomic operation with respect to the cache in that either all of
// the pending keys to add and remove in the transaction will be applied or none
// of them will.
//
// The database cache itself might be flushed to the underlying persistent
// database even if the transaction fails to apply, but it will only be the
// state of the cache without the transaction applied.
//
// This function MUST be called during a database write transaction which in
// turn implies the database write lock will be held.
func (c *dbCache) commitTx(tx *transaction) error {
    // Flush the cache and write the current transaction directly to the
    // database if a flush is needed.
    if c.needsFlush(tx) {                                                     (1)
        if err := c.flush(); err != nil {                                     (2)
            return err
        }

        // Perform all leveldb updates using an atomic transaction.
        err := c.commitTreaps(tx.pendingKeys, tx.pendingRemove)               (3)
        if err != nil {
            return err
        }

        // Clear the transaction entries since they have been committed.
        tx.pendingKeys = nil
        tx.pendingRemove = nil
        return nil
    }

    // At this point a database flush is not needed, so atomically commit
    // the transaction to the cache.

    // Since the cached keys to be added and removed use an immutable treap,
    // a snapshot is simply obtaining the root of the tree under the lock
    // which is used to atomically swap the root.
    c.cacheLock.RLock()
    newCachedKeys := c.cachedKeys
    newCachedRemove := c.cachedRemove
    c.cacheLock.RUnlock()

    // Apply every key to add in the database transaction to the cache.
    tx.pendingKeys.ForEach(func(k, v []byte) bool {                           (5)
        newCachedRemove = newCachedRemove.Delete(k)
        newCachedKeys = newCachedKeys.Put(k, v)
        return true
    })
    tx.pendingKeys = nil

    // Apply every key to remove in the database transaction to the cache.
    tx.pendingRemove.ForEach(func(k, v []byte) bool {                         (6)
        newCachedKeys = newCachedKeys.Delete(k)
        newCachedRemove = newCachedRemove.Put(k, nil)
        return true
    })
    tx.pendingRemove = nil

    // Atomically replace the immutable treaps which hold the cached keys to
    // add and delete.
    c.cacheLock.Lock()
    c.cachedKeys = newCachedKeys                                              (7)
    c.cachedRemove = newCachedRemove
    c.cacheLock.Unlock()
    return nil
}

其中的主要步驟為:

  1. 如果離上一次flush已經(jīng)超過一個刷新周期且dbCache已滿,則調(diào)用flush()將樹堆中的緩存寫入leveldb,并將transaction中的待添加和移除的Keys通過commitTreaps()方法直接寫入leveldb,寫完后清空pendingKeys和pendingRemove;
  2. 如果不需要flush,則代碼(5)和(6)處將transaction中的pendingKeys添加到newCachedKeys中,將pendingRemove添加到newCachedRemove,即將tx中待添加和刪除的Keys寫入dbCache。這里要注意兩點: 1). 將pendingKeys中的Key添加到newCachedKeys時,得先將相同的Key從newCachedRemove中移除,以免寫入leveldb時該Key被刪除。向newCachedRemove添加Key時也須將相同Key從newCachedKeys移除,以免本來要刪除的Key又被寫入leveldb;2). cachedKeys和cachedRemove均是treap.Immutable指針,相應地,newCachedKeys和newCachedRemove也是treap.Immutable指針。treap.Immutable類型的樹堆實現(xiàn)了類似于寫時復制(COW)的機制來提高讀寫并發(fā),當通過Put()或者Delete()來更新樹堆的節(jié)點時,需要更新的節(jié)點會復制一份與不需要更新的老的節(jié)點組成一顆新的樹堆返回。代碼(5)和(6)處newCachedKeys和newCachedRemove重新指向Delete()或者Put()調(diào)用的返回值,實際上是指向了一個新的樹堆,而c.cachedKeys和c.cachedRemove仍然指向修改之前的樹堆,所以這時如果通過Snapshot()獲取dbCache的快照,快照中的cachedKeys和cachedRemove并不包含transation的pendingKeys和pendingRemove。這可以看成是dbCache的MVCC實現(xiàn)。
  3. 最后,代碼(7)處更新dbCache中的cachedKeys和cachedRemove。請注意,更新操作通過c.cacheLock的寫鎖保護。更新c.cachedKeys和c.cachedRemove后,再通過Snapshot()拿到的dbCache快照中就包含了transaction提交的pendingKeys和pendingRemove;

接下來,我們看看flush的實現(xiàn):

//btcd/database/ffldb/dbcache.go

// flush flushes the database cache to persistent storage.  This involes syncing
// the block store and replaying all transactions that have been applied to the
// cache to the underlying database.
//
// This function MUST be called with the database write lock held.
func (c *dbCache) flush() error {
    c.lastFlush = time.Now()

    // Sync the current write file associated with the block store.  This is
    // necessary before writing the metadata to prevent the case where the
    // metadata contains information about a block which actually hasn't
    // been written yet in unexpected shutdown scenarios.
    if err := c.store.syncBlocks(); err != nil {                              (1)
        return err
    }

    // Since the cached keys to be added and removed use an immutable treap,
    // a snapshot is simply obtaining the root of the tree under the lock
    // which is used to atomically swap the root.
    c.cacheLock.RLock()
    cachedKeys := c.cachedKeys
    cachedRemove := c.cachedRemove
    c.cacheLock.RUnlock()

    // Nothing to do if there is no data to flush.
    if cachedKeys.Len() == 0 && cachedRemove.Len() == 0 {
        return nil
    }

    // Perform all leveldb updates using an atomic transaction.
    if err := c.commitTreaps(cachedKeys, cachedRemove); err != nil {         (2)
        return err
    }

    // Clear the cache since it has been flushed.
    c.cacheLock.Lock()
    c.cachedKeys = treap.NewImmutable()                                      (3)
    c.cachedRemove = treap.NewImmutable()
    c.cacheLock.Unlock()

    return nil
}

其中主要步驟為:

  1. 調(diào)用blockStore的syncBlocks()強制將文件緩沖寫入磁盤文件,以防止meta數(shù)據(jù)與區(qū)塊文件中的狀態(tài)不一致;
  2. 通過commitTreaps()將dbCache中的緩存寫入leveldb;
  3. 將cachedKeys和cachedRemove置為空的樹堆,實際上是清空dbCache;

dbCache的commitTreaps()比較簡單,它主要是調(diào)用leveldb的Put和Delete依次將cachedKeys和cachedRemove更新到leveldb中,我們就不作專門分析了,讀者可以自行閱讀其源代碼。我們來看看dbCache的Snapshot():

//btcd/database/ffldb/dbcache.go

// Snapshot returns a snapshot of the database cache and underlying database at
// a particular point in time.
//
// The snapshot must be released after use by calling Release.
func (c *dbCache) Snapshot() (*dbCacheSnapshot, error) {
    dbSnapshot, err := c.ldb.GetSnapshot()
    if err != nil {
        str := "failed to open transaction"
        return nil, convertErr(str, err)
    }

    // Since the cached keys to be added and removed use an immutable treap,
    // a snapshot is simply obtaining the root of the tree under the lock
    // which is used to atomically swap the root.
    c.cacheLock.RLock()
    cacheSnapshot := &dbCacheSnapshot{
        dbSnapshot:    dbSnapshot,
        pendingKeys:   c.cachedKeys,
        pendingRemove: c.cachedRemove,
    }
    c.cacheLock.RUnlock()
    return cacheSnapshot, nil
}

可以看到,它實際上就是通過leveldb的Snapshot、c.cachedKeys和c.cachedRemove構(gòu)建一個dbCacheSnapshot對象,在dbCacheSnapshot中查找Key時,先從cachedKeys或cachedRemove查找,再從leveldb的Snapshot查找。transaction中的snapshot就是指向該對象。

treap

通過上面幾個方法的分析,我們就清楚了dbCache緩存Key、刷新緩存及讀緩存的過程。dbCache中用于實際緩存的數(shù)據(jù)結(jié)構(gòu)是treap.Immutable,它是dbCache的核心。Btcd中的treap既實現(xiàn)了Immutable版本,同時也提供了Muttable版本。接下來,我們就開始分析treap的實現(xiàn)。對于不了解treap的讀者,可以閱讀BYVoid同學寫的《隨機平衡二叉查找樹Treap的分析與應用》。簡單地講,樹堆是二叉查找樹與堆的結(jié)合體,為了實現(xiàn)動態(tài)平衡,在二叉查找樹的節(jié)點中引入一個隨機值,用于對節(jié)點進行堆排序,讓二叉查找樹同時形成最大堆或者最小堆,從而保證其平衡性。樹堆查找的時間復雜度為O(logN)。由于篇幅限制,我們不打算完整分析treap的代碼,將主要分析Mutable和Immutable的Put()方法來了解treap的構(gòu)建、添加節(jié)點后的旋轉(zhuǎn)及Immutable的寫時復制等過程。

我們先來看看Immutable、Mutable的定義:

//btcd/database/internal/treap/mutable.go

// Mutable represents a treap data structure which is used to hold ordered
// key/value pairs using a combination of binary search tree and heap semantics.
// It is a self-organizing and randomized data structure that doesn't require
// complex operations to maintain balance.  Search, insert, and delete
// operations are all O(log n).
type Mutable struct {
    root  *treapNode
    count int

    // totalSize is the best estimate of the total size of of all data in
    // the treap including the keys, values, and node sizes.
    totalSize uint64
}


//btcd/database/internal/treap/immutable.go

// Immutable represents a treap data structure which is used to hold ordered
// key/value pairs using a combination of binary search tree and heap semantics.
// It is a self-organizing and randomized data structure that doesn't require
// complex operations to maintain balance.  Search, insert, and delete
// operations are all O(log n).  In addition, it provides O(1) snapshots for
// multi-version concurrency control (MVCC).
//
// All operations which result in modifying the treap return a new version of
// the treap with only the modified nodes updated.  All unmodified nodes are
// shared with the previous version.  This is extremely useful in concurrent
// applications since the caller only has to atomically replace the treap
// pointer with the newly returned version after performing any mutations.  All
// readers can simply use their existing pointer as a snapshot since the treap
// it points to is immutable.  This effectively provides O(1) snapshot
// capability with efficient memory usage characteristics since the old nodes
// only remain allocated until there are no longer any references to them.
type Immutable struct {
    root  *treapNode
    count int

    // totalSize is the best estimate of the total size of of all data in
    // the treap including the keys, values, and node sizes.
    totalSize uint64
}

Immutable和Mutable的定義完全一樣,它們的區(qū)別在于Immutable提供了寫時復制,我們將在Put()方法中看到他們的區(qū)別。其中的root字段指向樹堆的根節(jié)點,節(jié)點的定義為:

//btcd/database/internal/treap/common.go

// treapNode represents a node in the treap.
type treapNode struct {
    key      []byte
    value    []byte
    priority int
    left     *treapNode
    right    *treapNode
}

treapNode中的key和value就是樹堆節(jié)點的值,priority是用于構(gòu)建堆的隨機修正值,也叫節(jié)點的優(yōu)先級,left和right分別指向左右子樹根節(jié)點。我們先來看看Mutable的Put()方法,來了解樹堆的構(gòu)建和插入節(jié)點后的旋轉(zhuǎn)過程:

//btcd/database/internal/treap/mutable.go

// Put inserts the passed key/value pair.
func (t *Mutable) Put(key, value []byte) {
    // Use an empty byte slice for the value when none was provided.  This
    // ultimately allows key existence to be determined from the value since
    // an empty byte slice is distinguishable from nil.
    if value == nil {
        value = emptySlice
    }

    // The node is the root of the tree if there isn't already one.
    if t.root == nil {                                                   (1)
        node := newTreapNode(key, value, rand.Int())
        t.count = 1
        t.totalSize = nodeSize(node)
        t.root = node
        return
    }

    // Find the binary tree insertion point and construct a list of parents
    // while doing so.  When the key matches an entry already in the treap,
    // just update its value and return.
    var parents parentStack
    var compareResult int
    for node := t.root; node != nil; {
        parents.Push(node)
        compareResult = bytes.Compare(key, node.key)
        if compareResult < 0 {
            node = node.left                                            (2)
            continue
        }
        if compareResult > 0 {
            node = node.right                                           (3)
            continue
        }

        // The key already exists, so update its value.
        t.totalSize -= uint64(len(node.value))
        t.totalSize += uint64(len(value))
        node.value = value                                              (4)
        return
    }

    // Link the new node into the binary tree in the correct position.
    node := newTreapNode(key, value, rand.Int())                        (5)
    t.count++
    t.totalSize += nodeSize(node)
    parent := parents.At(0)
    if compareResult < 0 {
        parent.left = node                                              (6)
    } else {
        parent.right = node                                             (7)
    }

    // Perform any rotations needed to maintain the min-heap.
    for parents.Len() > 0 {
        // There is nothing left to do when the node's priority is
        // greater than or equal to its parent's priority.
        parent = parents.Pop()
        if node.priority >= parent.priority {                           (8)
            break
        }

        // Perform a right rotation if the node is on the left side or
        // a left rotation if the node is on the right side.
        if parent.left == node {
            node.right, parent.left = parent, node.right                (9)
        } else {
            node.left, parent.right = parent, node.left                 (10)  
        }
        t.relinkGrandparent(node, parent, parents.At(0))
    }
}

......

// relinkGrandparent relinks the node into the treap after it has been rotated
// by changing the passed grandparent's left or right pointer, depending on
// where the old parent was, to point at the passed node.  Otherwise, when there
// is no grandparent, it means the node is now the root of the tree, so update
// it accordingly.
func (t *Mutable) relinkGrandparent(node, parent, grandparent *treapNode) {
    // The node is now the root of the tree when there is no grandparent.
    if grandparent == nil {
        t.root = node                                                   (11)
        return
    }

    // Relink the grandparent's left or right pointer based on which side
    // the old parent was.
    if grandparent.left == parent {
        grandparent.left = node                                         (12)
    } else {
        grandparent.right = node                                        (13)
    }
}

其中的主要步驟為:

  1. 對于空樹,添加的第一個節(jié)點直接成為根節(jié)點,如代碼(1)處所示,可以看到,節(jié)點的priority是由rand.Int()生成的隨機整數(shù);
  2. 對于非空樹,根據(jù)key來查找待插入的位置,并通過parentStack來記錄查找路徑。從根節(jié)點開始,如果待插入的Key小于根的Key,則進入左子樹繼續(xù)查找,如代碼(2)處所示;如果待插入的Key大于根的Key,則進入右子樹繼續(xù)查找,如代碼(3)處所示;如果待插入的Key正好的當前節(jié)點的Key,則直接更新其Value,如代碼(4)處所示;
  3. 當樹中沒有找到Key,則應插入新的節(jié)點,此時parents中的最后一個節(jié)點就是新節(jié)點的父節(jié)點,請注意,parents.At(0)是查找路徑上的最后一個節(jié)點。如果待插入的Key小于父節(jié)點的Key,則新節(jié)點變成父節(jié)點的左子節(jié)點,如代碼(6)處所示;否則,成為右子節(jié)點,如代碼(7)處所示;
  4. 由于新節(jié)點的priority是隨機產(chǎn)生的,它插入樹中后,樹可能不滿足最小堆性質(zhì)了,所以接下來需要進行旋轉(zhuǎn)。旋轉(zhuǎn)過程需要向上遞歸進行直到整顆樹滿足最小難序。代碼(8)處,如果新節(jié)點的優(yōu)化級正好大于或者等于父節(jié)點的優(yōu)先級,則不用旋轉(zhuǎn),樹已經(jīng)滿足最小難序;如果新節(jié)點的優(yōu)化級小于父節(jié)點的優(yōu)化級,則需要旋轉(zhuǎn),將父節(jié)點變成新節(jié)點的子節(jié)點。如果新節(jié)點是父節(jié)點的左子節(jié)點,則需要進行右旋,如果代碼(9)所示;如果新節(jié)點是父節(jié)點的右子節(jié)點,則需要進行左旋,如代碼(10)所示;
  5. 進行左旋或右旋后,原父節(jié)點變成新節(jié)點的子節(jié)點,但祖節(jié)點(原父節(jié)點的父節(jié)點)的子節(jié)點還指向原父節(jié)點,relinkGrandparent()將繼續(xù)完成旋轉(zhuǎn)過程。如果祖節(jié)點是空,則說明原父節(jié)點就是樹的根,不需要調(diào)整直接將新節(jié)點變成樹的根即可,如代碼(11)處所示;代碼(12)和(13)實際上是將新節(jié)點替代原父節(jié)點,變成祖節(jié)點的左子節(jié)點或者右子節(jié)點;
  6. 新節(jié)點、原父節(jié)點、祖節(jié)點完成旋轉(zhuǎn)后,新節(jié)點變成了新的父節(jié)點,原交節(jié)點變成子節(jié)點,祖節(jié)點不變,但此時新節(jié)點的優(yōu)化級可能還大于祖節(jié)點的優(yōu)化級,則新的父節(jié)點、祖節(jié)點及祖節(jié)點的父節(jié)點還要繼續(xù)旋轉(zhuǎn),這一過程向上遞歸到根節(jié)點,保證查找路徑上節(jié)點均滿足最小堆序,才完成了整個旋轉(zhuǎn)過程及新節(jié)點插入過程。

從Mutable的Put()方法中,我們可以完整地了解treap的構(gòu)建、插入及涉及到的子樹旋轉(zhuǎn)過程。Immutable的Put()與Mutable的Put()實現(xiàn)步驟大致一致,不同的是,Immutable沒有直接修改原節(jié)點或旋轉(zhuǎn)原樹,而是將查找路徑上的所有節(jié)點均復制一份出來與原樹的其它節(jié)點一起形成一顆新的樹,在新樹上進行更新或者旋轉(zhuǎn)后返回新樹。它的實現(xiàn)如下:

//btcd/database/internal/treap/immutable.go

// Put inserts the passed key/value pair.
func (t *Immutable) Put(key, value []byte) *Immutable {
    // Use an empty byte slice for the value when none was provided.  This
    // ultimately allows key existence to be determined from the value since
    // an empty byte slice is distinguishable from nil.
    if value == nil {
        value = emptySlice
    }

    // The node is the root of the tree if there isn't already one.
    if t.root == nil {
        root := newTreapNode(key, value, rand.Int())
        return newImmutable(root, 1, nodeSize(root))                     (1)
    }

    // Find the binary tree insertion point and construct a replaced list of
    // parents while doing so.  This is done because this is an immutable
    // data structure so regardless of where in the treap the new key/value
    // pair ends up, all ancestors up to and including the root need to be
    // replaced.
    //
    // When the key matches an entry already in the treap, replace the node
    // with a new one that has the new value set and return.
    var parents parentStack
    var compareResult int
    for node := t.root; node != nil; {
        // Clone the node and link its parent to it if needed.
        nodeCopy := cloneTreapNode(node)
        if oldParent := parents.At(0); oldParent != nil {
            if oldParent.left == node {
                oldParent.left = nodeCopy                               (2)
            } else {
                oldParent.right = nodeCopy                              (3)
            }
        }
        parents.Push(nodeCopy)                                          (4)

        // Traverse left or right depending on the result of comparing
        // the keys.
        compareResult = bytes.Compare(key, node.key)
        if compareResult < 0 {
            node = node.left
            continue
        }
        if compareResult > 0 {
            node = node.right
            continue
        }

        // The key already exists, so update its value.
        nodeCopy.value = value                                          (5)

        // Return new immutable treap with the replaced node and
        // ancestors up to and including the root of the tree.
        newRoot := parents.At(parents.Len() - 1)                        (6)
        newTotalSize := t.totalSize - uint64(len(node.value)) +         (7)
            uint64(len(value))
        return newImmutable(newRoot, t.count, newTotalSize)             (8)
    }

    // Link the new node into the binary tree in the correct position.
    node := newTreapNode(key, value, rand.Int())
    parent := parents.At(0)
    if compareResult < 0 {
        parent.left = node
    } else {
        parent.right = node
    }

    // Perform any rotations needed to maintain the min-heap and replace
    // the ancestors up to and including the tree root.
    newRoot := parents.At(parents.Len() - 1)
    for parents.Len() > 0 {
        // There is nothing left to do when the node's priority is
        // greater than or equal to its parent's priority.
        parent = parents.Pop()
        if node.priority >= parent.priority {
            break
        }

        // Perform a right rotation if the node is on the left side or
        // a left rotation if the node is on the right side.
        if parent.left == node {
            node.right, parent.left = parent, node.right
        } else {
            node.left, parent.right = parent, node.left
        }

        // Either set the new root of the tree when there is no
        // grandparent or relink the grandparent to the node based on
        // which side the old parent the node is replacing was on.
        grandparent := parents.At(0)
        if grandparent == nil {
            newRoot = node
        } else if grandparent.left == parent {
            grandparent.left = node
        } else {
            grandparent.right = node
        }
    }

    return newImmutable(newRoot, t.count+1, t.totalSize+nodeSize(node))  (9)
}

其與Mutable的Put()方法的主要區(qū)別在于:

  1. 如果向空樹中插入一個節(jié)點,與直接將新節(jié)點變成原樹的根不同,它將以新節(jié)點為根創(chuàng)建一個新的樹堆并返回,如代碼(1)所示;
  2. 在查找待插入的Key時,查找路徑上的所有節(jié)點被復制一份出來,如代碼(2)、(3)和(4)處所示。如果找到待插入的Key,是在復制的節(jié)點上更新Value,而不是原節(jié)點上更新,如果代碼(5)處所示,節(jié)點更新后,將以復制出來的根節(jié)點來創(chuàng)建一顆新的樹并返回,如代碼(6)、(7)和(8)處所示;
  3. 接下來,如果待插入的Key不在樹中,將添加一個新的節(jié)點,且新的節(jié)點被加入到復制出來的父節(jié)點中,然后在復制的新樹上進行旋轉(zhuǎn),最后返回新的樹,如代碼(9)所示。需要注意的是,原樹上節(jié)點均沒有更新,原樹與新樹共享查找路徑以外的其他節(jié)點。

Immutable的Put()的方法通過復制查找路徑上的節(jié)點并返回新的樹根實現(xiàn)了寫時復制,并進而支持了dbCache的MVCC。到此,ffldb的整個工作機制我們就介紹完了,其中的blockStore和dbCache及dbCache使用的數(shù)據(jù)結(jié)構(gòu)treap我們也作了詳細分析,相信大家對Bitcoin節(jié)點進行區(qū)塊的查找和存入磁盤的過程有了完整而清晰的認識了。接下來的文章,我們將介紹Btcd中網(wǎng)絡協(xié)議的實現(xiàn),揭示區(qū)塊在P2P網(wǎng)絡中的傳遞過程。

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務。

推薦閱讀更多精彩內(nèi)容