介紹完BoltDB后,我們回到btcd/database的源代碼。了解了BoltDB的實現(xiàn)后,btcd/database的接口定義和其調(diào)用方法將變得容易理解。然而,包database并未實現(xiàn)一個數(shù)據(jù)庫,它實際上是btcd中的存儲框架,使btcd支持多種數(shù)據(jù)庫,其中,ffldb是database包中提供的默認數(shù)據(jù)庫。在clone完代碼后,可以發(fā)現(xiàn)包database主要包含的文件有:
- cmd/dbtool: 實現(xiàn)了一個從db文件中讀寫block的工具。
- ffldb: 實現(xiàn)了一個默認的數(shù)據(jù)庫驅(qū)動,它參考BoltDB實現(xiàn)了DB、Bucket、Tx等;
- internal/treap:一個樹堆的實現(xiàn),用于緩存元數(shù)據(jù);
- testdata: 包含用于測試的db文件;
- driver.go: 定義了Driver類型及注冊、打開數(shù)據(jù)庫的方法;
- interface.go: 定義了DB、Bucket、Tx、Cursor等接口,幾乎與BoltDB中的定義一致;
- error.go: 定義了包database中的錯誤代碼及對應的提示字符;
- doc.go: 包database的描述;
- driver_test.go、error_test.go、example_test.go、export_test.go: 對應的測試文件;
需要說明的是,ffldb并不是真正意義上的數(shù)據(jù)庫,它利用leveldb來存儲元數(shù)據(jù),用文件來存區(qū)塊。對元數(shù)據(jù)的存儲,ffldb參考BoltDB的實現(xiàn),支持Bucket及嵌套子Bucket;對區(qū)塊或者元數(shù)據(jù)的讀寫,它也實現(xiàn)了類似的Transaction。特別地,ffldb通過leveldb存儲元數(shù)據(jù)時,增加了一層緩存以提高讀寫效率。它的基本框架如下圖所示:
我們先來看看包database中的接口DB的定義:
//btcd/database/interface.go
type DB interface {
// Type returns the database driver type the current database instance
// was created with.
Type() string
......
Begin(writable bool) (Tx, error)
......
View(fn func(tx Tx) error) error
......
Update(fn func(tx Tx) error) error
......
Close() error
}
可以看出,其中的接口定義與BoltDB中的定義幾乎一樣。事實上,Bucket及Cursor等接口均與BoltDB類似,Tx接口由于增加了對metadata和block的操作,有所不同:
//btcd/database/interface.go
// Tx represents a database transaction. It can either by read-only or
// read-write. The transaction provides a metadata bucket against which all
// read and writes occur.
//
// As would be expected with a transaction, no changes will be saved to the
// database until it has been committed. The transaction will only provide a
// view of the database at the time it was created. Transactions should not be
// long running operations.
type Tx interface {
// Metadata returns the top-most bucket for all metadata storage.
Metadata() Bucket
......
StoreBlock(block *btcutil.Block) error
......
HasBlock(hash *chainhash.Hash) (bool, error)
......
HasBlocks(hashes []chainhash.Hash) ([]bool, error)
......
FetchBlockHeader(hash *chainhash.Hash) ([]byte, error)
......
FetchBlockHeaders(hashes []chainhash.Hash) ([][]byte, error)
......
FetchBlock(hash *chainhash.Hash) ([]byte, error)
......
FetchBlocks(hashes []chainhash.Hash) ([][]byte, error)
......
FetchBlockRegion(region *BlockRegion) ([]byte, error)
......
FetchBlockRegions(regions []BlockRegion) ([][]byte, error)
// ******************************************************************
// Methods related to both atomic metadata storage and block storage.
// ******************************************************************
......
Commit() error
......
Rollback() error
}
由于篇幅原因,我們略去了各接口的注釋,讀者可以從源文件中閱讀。從Tx接口的定義中可以看出,它主要定義了三類方法:
- Metadata(), 通過它可以獲得根Bucket,所有的元數(shù)據(jù)均歸屬于Bucket,Bucket及其中的K/V對最終存于leveldb中。在一個Transaction中,對元數(shù)據(jù)的操作均是通過Metadata()得到Bucket后,再在Bucket中進行操作的;
- XxxBlockXxx,與Block操作相關的接口,它們主要是通過讀寫文件來讀寫B(tài)lock;
- Commit()和Rollback,在可寫Tx中寫元數(shù)據(jù)或者區(qū)塊后,均需要通過Commit()來提交修改并關閉Tx,或者通過Rollback來丟棄修改或關閉只讀Tx,作用與BoltDB中的一致;
ffldb提供了對上述各接口的實現(xiàn),我們接下來著重分析它的代碼。我們先來看看它的db類型定義:
//btcd/database/ffldb/db.go
// db represents a collection of namespaces which are persisted and implements
// the database.DB interface. All database access is performed through
// transactions which are obtained through the specific Namespace.
type db struct {
writeLock sync.Mutex // Limit to one write transaction at a time.
closeLock sync.RWMutex // Make database close block while txns active.
closed bool // Is the database closed?
store *blockStore // Handles read/writing blocks to flat files.
cache *dbCache // Cache layer which wraps underlying leveldb DB.
}
其中各字段意義是:
- writeLock: 互斥鎖,保證同時只有一個可寫transaction;
- closeLock: 保證數(shù)據(jù)庫Close時所有已經(jīng)打開的transaction均已結(jié)束;
- closed: 指示數(shù)據(jù)庫是否已經(jīng)關閉;
- store: 指向blockStore,用于讀寫區(qū)塊;
- cach: 指向dbCache,用于讀寫元數(shù)據(jù);
db實現(xiàn)了database.DB接口,其中各方法的實現(xiàn)與BoltDB中基本類似,也是通過View()或者Update()的回調(diào)方法獲取Tx對象或其引用,然后調(diào)用Tx中接口進行數(shù)據(jù)庫操作,故我們不再分析db的各方法實現(xiàn),重點分析Tx的實現(xiàn)。ffldb中transaction的定義如下,它實現(xiàn)了database.Tx接口:
//btcd/database/ffldb/db.go
// transaction represents a database transaction. It can either be read-only or
// read-write and implements the database.Bucket interface. The transaction
// provides a root bucket against which all read and writes occur.
type transaction struct {
managed bool // Is the transaction managed?
closed bool // Is the transaction closed?
writable bool // Is the transaction writable?
db *db // DB instance the tx was created from.
snapshot *dbCacheSnapshot // Underlying snapshot for txns.
metaBucket *bucket // The root metadata bucket.
blockIdxBucket *bucket // The block index bucket.
// Blocks that need to be stored on commit. The pendingBlocks map is
// kept to allow quick lookups of pending data by block hash.
pendingBlocks map[chainhash.Hash]int
pendingBlockData []pendingBlock
// Keys that need to be stored or deleted on commit.
pendingKeys *treap.Mutable
pendingRemove *treap.Mutable
// Active iterators that need to be notified when the pending keys have
// been updated so the cursors can properly handle updates to the
// transaction state.
activeIterLock sync.RWMutex
activeIters []*treap.Iterator
}
其中各字段意義:
- managed: transaction是否被db托管,托管狀態(tài)的transaction不能再主動調(diào)用Commit()或者Rollback();
- closed: 指示當前transaction是否已經(jīng)結(jié)束;
- writable: 指示當前transaction是否可寫;
- db: 指向與當前transaction綁定的db對象;
- snapshot: 當前transaction讀到的元數(shù)據(jù)緩存的一個快照,在transaction打開的時候?qū)bCache進行快照得到的,也是元數(shù)據(jù)存儲中MVCC機制的一部分,類似于BoltDB中讀meta page;
- metaBucket: 存儲元數(shù)據(jù)的根Bucket;
- blockIdxBucket: 存儲區(qū)塊hash與其序號的Bucket,它是metaBucket的第一個子Bucket,且只在ffldb內(nèi)部使用;
- pendingBlocks: 記錄待提交Block的哈希與其在pendingBlockData中的位置的對應關系;
- pendingBlockData: 順序記錄所有待提交Block的字節(jié)序列;
- pendingKeys: 待添加或者更新的元數(shù)據(jù)集合,請注意,它指向一個樹堆;
- pendingRemove: 待刪除的元數(shù)據(jù)集合,它也指向一個樹堆,與pendingKeys一樣,它們均通過dbCache向leveldb中更新;
- activeIterLock: 對activeIters的保護鎖;
- activeIters: 用于記錄當前transaction中查找dbCache的Iterators,當向dbCache中更新Key時,樹堆旋轉(zhuǎn)會更新節(jié)點間關系,故需將所有活躍的Iterator復位;
我們說transaction中主要有三類方法,我們先來看看它的Metadata()方法:
//btcd/database/ffldb/db.go
// Metadata returns the top-most bucket for all metadata storage.
//
// This function is part of the database.Tx interface implementation.
func (tx *transaction) Metadata() database.Bucket {
return tx.metaBucket
}
可以看出它僅僅是返回根Bucket,剩下的操作均通過它來進行。我們來看看bucket的定義,它實現(xiàn)了database.Bucket:
//btcd/database/ffldb/db.go
// bucket is an internal type used to represent a collection of key/value pairs
// and implements the database.Bucket interface.
type bucket struct {
tx *transaction
id [4]byte
}
需要注意的是,ffldb中的bucket與BoltDB中的Bucket雖然有著相同的接口定義,但它們底層實際存儲K/V對的數(shù)據(jù)結(jié)構(gòu)并不相同,所以bucket的定義和查找方法大不相同。ffldb利用leveldb來存儲K/V,leveldb底層數(shù)據(jù)結(jié)構(gòu)為LSM樹(log-structured merge-tree),而BoltDB采用B+Tree。ffldb利用leveldb提供的接口來讀寫K/V,而levealdb中沒有Bucket的概念,也沒有對Key進行分層管理的方法,那ffldb中是如何實現(xiàn)bucket的呢?我們可以通過CreateBucket()來分析:
//btcd/database/ffldb/db.go
// CreateBucket creates and returns a new nested bucket with the given key.
//
// Returns the following errors as required by the interface contract:
// - ErrBucketExists if the bucket already exists
// - ErrBucketNameRequired if the key is empty
// - ErrIncompatibleValue if the key is otherwise invalid for the particular
// implementation
// - ErrTxNotWritable if attempted against a read-only transaction
// - ErrTxClosed if the transaction has already been closed
//
// This function is part of the database.Bucket interface implementation.
func (b *bucket) CreateBucket(key []byte) (database.Bucket, error) {
......
// Ensure bucket does not already exist.
bidxKey := bucketIndexKey(b.id, key)
......
// Find the appropriate next bucket ID to use for the new bucket. In
// the case of the special internal block index, keep the fixed ID.
var childID [4]byte
if b.id == metadataBucketID && bytes.Equal(key, blockIdxBucketName) {
childID = blockIdxBucketID
} else {
var err error
childID, err = b.tx.nextBucketID()
if err != nil {
return nil, err
}
}
// Add the new bucket to the bucket index.
if err := b.tx.putKey(bidxKey, childID[:]); err != nil {
str := fmt.Sprintf("failed to create bucket with key %q", key)
return nil, convertErr(str, err)
}
return &bucket{tx: b.tx, id: childID}, nil
}
上面代碼主要包含:
- 通過bucketIndexKey()創(chuàng)建子Bucket的Key;
- 為子Bucket指定或者選擇一個id;
- 將子Bucket的Key和id作為K/V記錄存入父Bucket中,這一點與BoltDB相似;
與BoltDB中通過K/V的flag來標記Bucket不同,ffldb中通過Key的格式來標記Bucket:
//btcd/database/ffldb/db.go
// bucketIndexKey returns the actual key to use for storing and retrieving a
// child bucket in the bucket index. This is required because additional
// information is needed to distinguish nested buckets with the same name.
func bucketIndexKey(parentID [4]byte, key []byte) []byte {
// The serialized bucket index key format is:
// <bucketindexprefix><parentbucketid><bucketname>
indexKey := make([]byte, len(bucketIndexPrefix)+4+len(key))
copy(indexKey, bucketIndexPrefix)
copy(indexKey[len(bucketIndexPrefix):], parentID[:])
copy(indexKey[len(bucketIndexPrefix)+4:], key)
return indexKey
}
可以看出,一個子Bucket的Key總是“<bucketindexprefix><parentbucketid><bucketname>”的形式,反過來說,如果一個Bucket中的Key是這一形式,那它對應一個子Bucket,它的Value記錄子Bucket的id。也就是說,ffldb中是通過Bucket的Key的分層形式來標記父子關系的。然而,BoltDB中子Bucket對應一顆獨立的B+Tree,當向子Bucket中添加K/V時,就是向?qū)腂+Tree中插入記錄,那ffldb是如何實現(xiàn)向子Bucket中添加K/V的呢,反過來說,如何確定K/V屬于哪個Bucket呢?我們來看看bucket的Put()方法:
//btcd/database/ffldb/db.go
// Put saves the specified key/value pair to the bucket. Keys that do not
// already exist are added and keys that already exist are overwritten.
//
// Returns the following errors as required by the interface contract:
// - ErrKeyRequired if the key is empty
// - ErrIncompatibleValue if the key is the same as an existing bucket
// - ErrTxNotWritable if attempted against a read-only transaction
// - ErrTxClosed if the transaction has already been closed
//
// This function is part of the database.Bucket interface implementation.
func (b *bucket) Put(key, value []byte) error {
......
return b.tx.putKey(bucketizedKey(b.id, key), value)
}
其中的關鍵也在于Key,在Bucket中添加記錄時,會通過bucketizedKey()對key進行處理:
//btcd/database/ffldb/db.go
// bucketizedKey returns the actual key to use for storing and retrieving a key
// for the provided bucket ID. This is required because bucketizing is handled
// through the use of a unique prefix per bucket.
func bucketizedKey(bucketID [4]byte, key []byte) []byte {
// The serialized block index key format is:
// <bucketid><key>
bKey := make([]byte, 4+len(key))
copy(bKey, bucketID[:])
copy(bKey[4:], key)
return bKey
}
也就是說,在向bucket中添加K/V時,Key會被轉(zhuǎn)換成“<bucketid><key>”的形式,從而標記這一記錄屬于id為“<bucketid>”的bucket。以上就是ffldb中通過兩種Key的分層格式來標記子Bucket及子Bucket中K/V對的方法,最終向leveldb中寫入K/V時并沒有bucket的概念,所有的Key是一種扁平結(jié)構(gòu)。與bucket綁定的cursor也是通過leveldb中的Iterator實現(xiàn)的,我們不再專門分析,感興趣的讀者可以自行分析。同時,從bucket的Put()方法也可以看出,添加的K/V會通過transaction的putKey()方法先加入到pendingKeys中:
//btcd/database/ffldb/db.go
// putKey adds the provided key to the list of keys to be updated in the
// database when the transaction is committed.
//
// NOTE: This function must only be called on a writable transaction. Since it
// is an internal helper function, it does not check.
func (tx *transaction) putKey(key, value []byte) error {
// Prevent the key from being deleted if it was previously scheduled
// to be deleted on transaction commit.
tx.pendingRemove.Delete(key)
// Add the key/value pair to the list to be written on transaction
// commit.
tx.pendingKeys.Put(key, value)
tx.notifyActiveIters()
return nil
}
類似地,bucket的Delete()方法也是調(diào)用transaction的deleteKey()方法來實現(xiàn)。deleteKey()中,會將要刪除的key添加到pendingRemove中,待transaction Commit最終將pendingKeys添加到leveldb中,pendingRemove中的Key從leveldb中刪除。bucket的Get()方法也會最終調(diào)用transaction的fetchKey()方法來查詢,fetchKey()先從
pendingRemove或者pendingKeys查找,如果找不到再從dbCache的一個快照中查找。
transaction中第二類是讀取Block相關的方法,我們主要分析StoreBlock()和FetchBlock(),先來看看StoreBlock:
//btcd/database/ffldb/db.go
// StoreBlock stores the provided block into the database. There are no checks
// to ensure the block connects to a previous block, contains double spends, or
// any additional functionality such as transaction indexing. It simply stores
// the block in the database.
//
// Returns the following errors as required by the interface contract:
// - ErrBlockExists when the block hash already exists
// - ErrTxNotWritable if attempted against a read-only transaction
// - ErrTxClosed if the transaction has already been closed
//
// This function is part of the database.Tx interface implementation.
func (tx *transaction) StoreBlock(block *btcutil.Block) error {
......
// Reject the block if it already exists.
blockHash := block.Hash()
......
blockBytes, err := block.Bytes()
......
// Add the block to be stored to the list of pending blocks to store
// when the transaction is committed. Also, add it to pending blocks
// map so it is easy to determine the block is pending based on the
// block hash.
if tx.pendingBlocks == nil {
tx.pendingBlocks = make(map[chainhash.Hash]int)
}
tx.pendingBlocks[*blockHash] = len(tx.pendingBlockData)
tx.pendingBlockData = append(tx.pendingBlockData, pendingBlock{
hash: blockHash,
bytes: blockBytes,
})
log.Tracef("Added block %s to pending blocks", blockHash)
return nil
}
可以看出,StoreBlock()主要是把block先放入pendingBlockData,等待Commit時寫入文件。我們再來看看FetchBlock():
//btcd/database/ffldb/db.go
// FetchBlock returns the raw serialized bytes for the block identified by the
// given hash. The raw bytes are in the format returned by Serialize on a
// wire.MsgBlock.
//
// Returns the following errors as required by the interface contract:
// - ErrBlockNotFound if the requested block hash does not exist
// - ErrTxClosed if the transaction has already been closed
// - ErrCorruption if the database has somehow become corrupted
//
// In addition, returns ErrDriverSpecific if any failures occur when reading the
// block files.
//
// NOTE: The data returned by this function is only valid during a database
// transaction. Attempting to access it after a transaction has ended results
// in undefined behavior. This constraint prevents additional data copies and
// allows support for memory-mapped database implementations.
//
// This function is part of the database.Tx interface implementation.
func (tx *transaction) FetchBlock(hash *chainhash.Hash) ([]byte, error) {
......
// When the block is pending to be written on commit return the bytes
// from there.
if idx, exists := tx.pendingBlocks[*hash]; exists {
return tx.pendingBlockData[idx].bytes, nil
}
// Lookup the location of the block in the files from the block index.
blockRow, err := tx.fetchBlockRow(hash)
if err != nil {
return nil, err
}
location := deserializeBlockLoc(blockRow)
// Read the block from the appropriate location. The function also
// performs a checksum over the data to detect data corruption.
blockBytes, err := tx.db.store.readBlock(hash, location)
if err != nil {
return nil, err
}
return blockBytes, nil
}
讀Block時先從pendingBlocks中查找,如果有則直接從pendingBlockData中返回;否則,通過db中的blockStore讀出區(qū)塊。我們先不深入blockStore,待介紹完transaction的Commit后再來分析它。關鍵地,我們可以發(fā)現(xiàn),通過transaction讀寫元數(shù)據(jù)或者Block時,均會先對pendingBlocks或pendingKeys與pedingRemove讀寫,它們可以看作transaction的緩沖,在Commit時被同步到文件或者leveldb中。Commit()最終調(diào)用writePendingAndCommit()進行實際操作:
//btcd/database/ffldb/db.go
// writePendingAndCommit writes pending block data to the flat block files,
// updates the metadata with their locations as well as the new current write
// location, and commits the metadata to the memory database cache. It also
// properly handles rollback in the case of failures.
//
// This function MUST only be called when there is pending data to be written.
func (tx *transaction) writePendingAndCommit() error {
......
// Loop through all of the pending blocks to store and write them.
for _, blockData := range tx.pendingBlockData {
log.Tracef("Storing block %s", blockData.hash)
location, err := tx.db.store.writeBlock(blockData.bytes)
if err != nil {
rollback()
return err
}
// Add a record in the block index for the block. The record
// includes the location information needed to locate the block
// on the filesystem as well as the block header since they are
// so commonly needed.
blockHdr := blockData.bytes[0:blockHdrSize]
blockRow := serializeBlockRow(location, blockHdr)
err = tx.blockIdxBucket.Put(blockData.hash[:], blockRow)
if err != nil {
rollback()
return err
}
}
// Update the metadata for the current write file and offset.
writeRow := serializeWriteRow(wc.curFileNum, wc.curOffset)
if err := tx.metaBucket.Put(writeLocKeyName, writeRow); err != nil {
rollback()
return convertErr("failed to store write cursor", err)
}
// Atomically update the database cache. The cache automatically
// handles flushing to the underlying persistent storage database.
return tx.db.cache.commitTx(tx)
}
writePendingAndCommit()中,主要包含:
- 通過blockStore將pendingBlockData中的區(qū)塊寫入文件,同時將區(qū)塊的hash與它在文件中的位置寫入blockIdxBucket,以便后續(xù)查找;
- 更新metaBucket中記錄當前文件讀寫位置的K/V;
- 通過dbCache中commitTx()將待提交的K/V寫入樹堆緩存,必要時寫入leveldb;
blockStore
transaction中讀寫元數(shù)據(jù)或者區(qū)塊時,最終會通過blockStore讀寫文件或者dbCache讀寫樹堆或者leveldb。所以接下來,我們主要分析blockStore和dbCache。我們先來看blockStore的定義:
//btcd/database/ffldb/blockio.go
// blockStore houses information used to handle reading and writing blocks (and
// part of blocks) into flat files with support for multiple concurrent readers.
type blockStore struct {
// network is the specific network to use in the flat files for each
// block.
network wire.BitcoinNet
// basePath is the base path used for the flat block files and metadata.
basePath string
// maxBlockFileSize is the maximum size for each file used to store
// blocks. It is defined on the store so the whitebox tests can
// override the value.
maxBlockFileSize uint32
// The following fields are related to the flat files which hold the
// actual blocks. The number of open files is limited by maxOpenFiles.
//
// obfMutex protects concurrent access to the openBlockFiles map. It is
// a RWMutex so multiple readers can simultaneously access open files.
//
// openBlockFiles houses the open file handles for existing block files
// which have been opened read-only along with an individual RWMutex.
// This scheme allows multiple concurrent readers to the same file while
// preventing the file from being closed out from under them.
//
// lruMutex protects concurrent access to the least recently used list
// and lookup map.
//
// openBlocksLRU tracks how the open files are refenced by pushing the
// most recently used files to the front of the list thereby trickling
// the least recently used files to end of the list. When a file needs
// to be closed due to exceeding the the max number of allowed open
// files, the one at the end of the list is closed.
//
// fileNumToLRUElem is a mapping between a specific block file number
// and the associated list element on the least recently used list.
//
// Thus, with the combination of these fields, the database supports
// concurrent non-blocking reads across multiple and individual files
// along with intelligently limiting the number of open file handles by
// closing the least recently used files as needed.
//
// NOTE: The locking order used throughout is well-defined and MUST be
// followed. Failure to do so could lead to deadlocks. In particular,
// the locking order is as follows:
// 1) obfMutex
// 2) lruMutex
// 3) writeCursor mutex
// 4) specific file mutexes
//
// None of the mutexes are required to be locked at the same time, and
// often aren't. However, if they are to be locked simultaneously, they
// MUST be locked in the order previously specified.
//
// Due to the high performance and multi-read concurrency requirements,
// write locks should only be held for the minimum time necessary.
obfMutex sync.RWMutex
lruMutex sync.Mutex
openBlocksLRU *list.List // Contains uint32 block file numbers.
fileNumToLRUElem map[uint32]*list.Element
openBlockFiles map[uint32]*lockableFile
// writeCursor houses the state for the current file and location that
// new blocks are written to.
writeCursor *writeCursor
// These functions are set to openFile, openWriteFile, and deleteFile by
// default, but are exposed here to allow the whitebox tests to replace
// them when working with mock files.
openFileFunc func(fileNum uint32) (*lockableFile, error)
openWriteFileFunc func(fileNum uint32) (filer, error)
deleteFileFunc func(fileNum uint32) error
}
其各字段意義如下:
- network: 指示當前Block網(wǎng)絡類型,比如MainNet、TestNet或SimNet,在向文件中寫入?yún)^(qū)塊時會指定該區(qū)塊來自哪類網(wǎng)絡;
- basePath: 存儲Block的文件在磁盤上的存儲路徑;
- maxBlockFileSize: 存儲Block文件的最大的Size;
- obfMutex: 對openBlockFiles進行保護的讀寫鎖;
- lruMutex:對openBlocksLRU和fileNumToLRUElem進行保護的互斥鎖;
- openBlocksLRU: 已打開文件的序號的LRU列表,默認的最大打開文件數(shù)是25;
- fileNumToLRUElem: 記錄文件序號與openBlocksLRU中元素的對應關系;
- openBlockFiles: 記錄所有打開的只讀文件的序號與文件指針的對應關系;
- writeCursor: 指向當前寫入的文件,記錄其文件序號和寫偏移;
- openFileFunc、openWriteFileFunc以及deleteFileFunc: openFile、openWriteFile和deleteFile的接口方法,主要用于測試,它的默認實現(xiàn)是blockStore的對應方法。
我們還是通過blockStore的readBlock()和writeBlock()方法來了解blockStore的工作機制。我們先來看看readBlokc():
//btcd/database/ffldb/blockio.go
// readBlock reads the specified block record and returns the serialized block.
// It ensures the integrity of the block data by checking that the serialized
// network matches the current network associated with the block store and
// comparing the calculated checksum against the one stored in the flat file.
// This function also automatically handles all file management such as opening
// and closing files as necessary to stay within the maximum allowed open files
// limit.
//
// Returns ErrDriverSpecific if the data fails to read for any reason and
// ErrCorruption if the checksum of the read data doesn't match the checksum
// read from the file.
//
// Format: <network><block length><serialized block><checksum>
func (s *blockStore) readBlock(hash *chainhash.Hash, loc blockLocation) ([]byte, error) {
// Get the referenced block file handle opening the file as needed. The
// function also handles closing files as needed to avoid going over the
// max allowed open files.
blockFile, err := s.blockFile(loc.blockFileNum)
if err != nil {
return nil, err
}
serializedData := make([]byte, loc.blockLen)
n, err := blockFile.file.ReadAt(serializedData, int64(loc.fileOffset))
blockFile.RUnlock()
if err != nil {
str := fmt.Sprintf("failed to read block %s from file %d, "+
"offset %d: %v", hash, loc.blockFileNum, loc.fileOffset,
err)
return nil, makeDbErr(database.ErrDriverSpecific, str, err)
}
// Calculate the checksum of the read data and ensure it matches the
// serialized checksum. This will detect any data corruption in the
// flat file without having to do much more expensive merkle root
// calculations on the loaded block.
serializedChecksum := binary.BigEndian.Uint32(serializedData[n-4:])
calculatedChecksum := crc32.Checksum(serializedData[:n-4], castagnoli)
if serializedChecksum != calculatedChecksum {
str := fmt.Sprintf("block data for block %s checksum "+
"does not match - got %x, want %x", hash,
calculatedChecksum, serializedChecksum)
return nil, makeDbErr(database.ErrCorruption, str, nil)
}
// The network associated with the block must match the current active
// network, otherwise somebody probably put the block files for the
// wrong network in the directory.
serializedNet := byteOrder.Uint32(serializedData[:4])
if serializedNet != uint32(s.network) {
str := fmt.Sprintf("block data for block %s is for the "+
"wrong network - got %d, want %d", hash, serializedNet,
uint32(s.network))
return nil, makeDbErr(database.ErrDriverSpecific, str, nil)
}
// The raw block excludes the network, length of the block, and
// checksum.
return serializedData[8 : n-4], nil
}
其主要步驟為:
- 通過blockFile()查詢已經(jīng)打開的文件或者新打開一個文件;
- 通過file.ReadAt()方法從文件中的loc.fileOffset位置讀出區(qū)塊數(shù)據(jù),它的格式是“<network><block length><serialized block><checksum>”;
- 從區(qū)塊數(shù)據(jù)中解析出block的字節(jié)流;
其中比較重要的是通過blockFile()得到一個文件句柄,我們來看看它的實現(xiàn):
//btcd/database/ffldb/blockio.go
// blockFile attempts to return an existing file handle for the passed flat file
// number if it is already open as well as marking it as most recently used. It
// will also open the file when it's not already open subject to the rules
// described in openFile.
//
// NOTE: The returned block file will already have the read lock acquired and
// the caller MUST call .RUnlock() to release it once it has finished all read
// operations. This is necessary because otherwise it would be possible for a
// separate goroutine to close the file after it is returned from here, but
// before the caller has acquired a read lock.
func (s *blockStore) blockFile(fileNum uint32) (*lockableFile, error) {
// When the requested block file is open for writes, return it.
wc := s.writeCursor
wc.RLock()
if fileNum == wc.curFileNum && wc.curFile.file != nil {
obf := wc.curFile
obf.RLock()
wc.RUnlock()
return obf, nil
}
wc.RUnlock()
// Try to return an open file under the overall files read lock.
s.obfMutex.RLock()
if obf, ok := s.openBlockFiles[fileNum]; ok {
s.lruMutex.Lock()
s.openBlocksLRU.MoveToFront(s.fileNumToLRUElem[fileNum])
s.lruMutex.Unlock()
obf.RLock()
s.obfMutex.RUnlock()
return obf, nil
}
s.obfMutex.RUnlock()
// Since the file isn't open already, need to check the open block files
// map again under write lock in case multiple readers got here and a
// separate one is already opening the file.
s.obfMutex.Lock() (1)
if obf, ok := s.openBlockFiles[fileNum]; ok {
obf.RLock()
s.obfMutex.Unlock()
return obf, nil
}
// The file isn't open, so open it while potentially closing the least
// recently used one as needed.
obf, err := s.openFileFunc(fileNum)
if err != nil {
s.obfMutex.Unlock()
return nil, err
}
obf.RLock()
s.obfMutex.Unlock()
return obf, nil
}
它的主要步驟是:
- 檢查要查找的文件是否是writeCursor指向的文件,如果是則直接返回;請注意,對writeCursor的訪問通過其讀鎖保護;同時,blockFile()返回的lockableFile對象已經(jīng)被自己的讀鎖保護,由調(diào)用方負責釋放文件的讀鎖。如果返回writeCursor指向的文件,即正在向該文件寫入?yún)^(qū)塊,它在被寫滿時會被關閉,讀鎖可以保護關閉該文件時必須等讀文件結(jié)束;
- 接著,從blockStore記錄的openBlockFiles中查找文件,如果找到,將文件移至LRU列表的首位置,同時獲得文件讀鎖后返回;
- 代碼(1)處獲取s.obfMutex的寫鎖并再次從openBlockFiles中查找文件,這是為了防止剛剛從openBlockFiles查找完成后,目標文件被其他線程打開并添加到openBlockFiles中了,如果不作此保護,在openBlockFiles中未找到就打開新文件,可能出現(xiàn)同一個文件被多次打開的情況。有讀者可能會想到: 為什么不在第一次查找openBlockFiles就通過s.obfMutex的寫鎖保護呢?這里也為了提高對openBlockFiles的讀寫并發(fā),openBlockFiles存的均是最近打開過的文件,有較大概率在第一次查找openBlockFiles就能找到目標文件,通過s.obfMutex的讀鎖保護,能提高從openBlockFiles查找的并發(fā)量;
- 如果openBlockFiles中找不到目標文件,就調(diào)用openFile()打開新文件,請注意整個openFile()調(diào)用均在s.obfMutex的寫鎖保護下;
//btcd/database/ffldb/blockio.go
// openFile returns a read-only file handle for the passed flat file number.
// The function also keeps track of the open files, performs least recently
// used tracking, and limits the number of open files to maxOpenFiles by closing
// the least recently used file as needed.
//
// This function MUST be called with the overall files mutex (s.obfMutex) locked
// for WRITES.
func (s *blockStore) openFile(fileNum uint32) (*lockableFile, error) {
// Open the appropriate file as read-only.
filePath := blockFilePath(s.basePath, fileNum)
file, err := os.Open(filePath)
if err != nil {
return nil, makeDbErr(database.ErrDriverSpecific, err.Error(),
err)
}
blockFile := &lockableFile{file: file}
// Close the least recently used file if the file exceeds the max
// allowed open files. This is not done until after the file open in
// case the file fails to open, there is no need to close any files.
//
// A write lock is required on the LRU list here to protect against
// modifications happening as already open files are read from and
// shuffled to the front of the list.
//
// Also, add the file that was just opened to the front of the least
// recently used list to indicate it is the most recently used file and
// therefore should be closed last.
s.lruMutex.Lock()
lruList := s.openBlocksLRU
if lruList.Len() >= maxOpenFiles {
lruFileNum := lruList.Remove(lruList.Back()).(uint32)
oldBlockFile := s.openBlockFiles[lruFileNum]
// Close the old file under the write lock for the file in case
// any readers are currently reading from it so it's not closed
// out from under them.
oldBlockFile.Lock()
_ = oldBlockFile.file.Close()
oldBlockFile.Unlock()
delete(s.openBlockFiles, lruFileNum)
delete(s.fileNumToLRUElem, lruFileNum)
}
s.fileNumToLRUElem[fileNum] = lruList.PushFront(fileNum)
s.lruMutex.Unlock()
// Store a reference to it in the open block files map.
s.openBlockFiles[fileNum] = blockFile
return blockFile, nil
}
openFile()中主要執(zhí)行:
- 直接通過os.Open()調(diào)用以只讀模式打開目標文件;
- 檢測openBlocksLRU是否已滿,如果已滿,則將列表末尾元素移除,同時將對應的文件關閉并從openBlockFiles將其移除,然后將新打開的文件添加了列表首位置;其中對openBlocksLRU和fileNumToLRUElem的訪問均在s.lruMutex保護下;
- 將新打開的文件放入openBlockFiles中;
從openFile()中可以看出,blockStore通過openBlockFiles和openBlocksLRU及fileNumToLRUElem維護了一個已經(jīng)打開的只讀文件的LRU緩存列表,可以加快從文件中讀區(qū)塊的速度。接下來,我們再來看看writeBlock():
//btcd/database/ffldb/blockio.go
// writeBlock appends the specified raw block bytes to the store's write cursor
// location and increments it accordingly. When the block would exceed the max
// file size for the current flat file, this function will close the current
// file, create the next file, update the write cursor, and write the block to
// the new file.
//
// The write cursor will also be advanced the number of bytes actually written
// in the event of failure.
//
// Format: <network><block length><serialized block><checksum>
func (s *blockStore) writeBlock(rawBlock []byte) (blockLocation, error) {
// Compute how many bytes will be written.
// 4 bytes each for block network + 4 bytes for block length +
// length of raw block + 4 bytes for checksum.
blockLen := uint32(len(rawBlock))
fullLen := blockLen + 12
// Move to the next block file if adding the new block would exceed the
// max allowed size for the current block file. Also detect overflow
// to be paranoid, even though it isn't possible currently, numbers
// might change in the future to make it possible.
//
// NOTE: The writeCursor.offset field isn't protected by the mutex
// since it's only read/changed during this function which can only be
// called during a write transaction, of which there can be only one at
// a time.
wc := s.writeCursor
finalOffset := wc.curOffset + fullLen
if finalOffset < wc.curOffset || finalOffset > s.maxBlockFileSize {
// This is done under the write cursor lock since the curFileNum
// field is accessed elsewhere by readers.
//
// Close the current write file to force a read-only reopen
// with LRU tracking. The close is done under the write lock
// for the file to prevent it from being closed out from under
// any readers currently reading from it.
wc.Lock()
wc.curFile.Lock() (1)
if wc.curFile.file != nil {
_ = wc.curFile.file.Close()
wc.curFile.file = nil
}
wc.curFile.Unlock()
// Start writes into next file.
wc.curFileNum++ (2)
wc.curOffset = 0 (3)
wc.Unlock()
}
// All writes are done under the write lock for the file to ensure any
// readers are finished and blocked first.
wc.curFile.Lock()
defer wc.curFile.Unlock()
// Open the current file if needed. This will typically only be the
// case when moving to the next file to write to or on initial database
// load. However, it might also be the case if rollbacks happened after
// file writes started during a transaction commit.
if wc.curFile.file == nil {
file, err := s.openWriteFileFunc(wc.curFileNum) (4)
if err != nil {
return blockLocation{}, err
}
wc.curFile.file = file
}
// Bitcoin network.
origOffset := wc.curOffset (5)
hasher := crc32.New(castagnoli)
var scratch [4]byte
byteOrder.PutUint32(scratch[:], uint32(s.network))
if err := s.writeData(scratch[:], "network"); err != nil {
return blockLocation{}, err
}
_, _ = hasher.Write(scratch[:])
// Block length.
byteOrder.PutUint32(scratch[:], blockLen)
if err := s.writeData(scratch[:], "block length"); err != nil {
return blockLocation{}, err
}
_, _ = hasher.Write(scratch[:])
// Serialized block.
if err := s.writeData(rawBlock[:], "block"); err != nil {
return blockLocation{}, err
}
_, _ = hasher.Write(rawBlock)
// Castagnoli CRC-32 as a checksum of all the previous.
if err := s.writeData(hasher.Sum(nil), "checksum"); err != nil {
return blockLocation{}, err
}
loc := blockLocation{ (6)
blockFileNum: wc.curFileNum,
fileOffset: origOffset,
blockLen: fullLen,
}
return loc, nil
}
其主要步驟為:
- 檢測寫入?yún)^(qū)塊后是否超過文件大小限制,如果超過,則關閉當前文件,新創(chuàng)建一個文件; 否則,直接在當前文件的wc.curOffset偏移處開始寫區(qū)塊;
- 代碼(1)處關閉writeCursor指向的文件,在調(diào)用Close()之前,獲取了lockableFile的寫鎖,以防其他線程正在讀該文件;
- 代碼(2)將writeCursor指向下一個文件,代碼(3)處將文件內(nèi)偏移復位;
- 代碼(4)處調(diào)用openWriteFile()以可讀寫方式打開或者創(chuàng)建一個新的文件,同時將writeCursor指向該文件;
- 代碼(5)處記錄下寫區(qū)塊的文件內(nèi)起始偏移位置,隨后開始向文件中寫區(qū)塊數(shù)據(jù);
- 依次向文件中寫入網(wǎng)絡號、區(qū)塊長度值、區(qū)塊數(shù)據(jù)和前三項的crc32檢驗和,可以看出存于文件上的區(qū)塊封裝格式為: "<network><block length><serialized block><checksum>"
- 代碼(6)處創(chuàng)建被寫入?yún)^(qū)塊對應的blockLocation對象,它由存儲區(qū)塊的文件的序號、區(qū)塊存儲位置在該文件內(nèi)的起始偏移及封裝后的區(qū)塊長度構(gòu)成,最后返回該blockLocation對象;
dbCache
通過readBlock()和writeBlock()我們基本上可以了解blockStore的整個工作機制,它主要是通過一個LRU列表來管理已經(jīng)打開的只讀文件,并通過writeCursor來記錄當前寫的入文件及文件內(nèi)偏移,在寫入?yún)^(qū)塊時,如果寫入?yún)^(qū)塊后超過了設置的最大文件Size,則另起一個新的文件寫入。理解了這一點后,blockStore的其他代碼均不難理解。接下來,我們主要分析dbCache的代碼,先來看看它的定義:
//btcd/database/ffldb/dbcache.go
// dbCache provides a database cache layer backed by an underlying database. It
// allows a maximum cache size and flush interval to be specified such that the
// cache is flushed to the database when the cache size exceeds the maximum
// configured value or it has been longer than the configured interval since the
// last flush. This effectively provides transaction batching so that callers
// can commit transactions at will without incurring large performance hits due
// to frequent disk syncs.
type dbCache struct {
// ldb is the underlying leveldb DB for metadata.
ldb *leveldb.DB
// store is used to sync blocks to flat files.
store *blockStore
// The following fields are related to flushing the cache to persistent
// storage. Note that all flushing is performed in an opportunistic
// fashion. This means that it is only flushed during a transaction or
// when the database cache is closed.
//
// maxSize is the maximum size threshold the cache can grow to before
// it is flushed.
//
// flushInterval is the threshold interval of time that is allowed to
// pass before the cache is flushed.
//
// lastFlush is the time the cache was last flushed. It is used in
// conjunction with the current time and the flush interval.
//
// NOTE: These flush related fields are protected by the database write
// lock.
maxSize uint64
flushInterval time.Duration
lastFlush time.Time
// The following fields hold the keys that need to be stored or deleted
// from the underlying database once the cache is full, enough time has
// passed, or when the database is shutting down. Note that these are
// stored using immutable treaps to support O(1) MVCC snapshots against
// the cached data. The cacheLock is used to protect concurrent access
// for cache updates and snapshots.
cacheLock sync.RWMutex
cachedKeys *treap.Immutable
cachedRemove *treap.Immutable
}
其中各字段意義如下:
- ldb: 指向leveldb的DB對象,用于向leveldb中存取K/V;
- store: 指向當前db下的blockStore,用于向leveldb中寫元數(shù)據(jù)之前,通過blockStore將區(qū)塊緩存強制寫入磁盤;
- maxSize: 簡單地講,它是緩存的待添加和刪除的元數(shù)據(jù)的總大小限制,默認值為100M;
- flushInterval: 向leveldb中寫數(shù)據(jù)的時間間隔;
- lastFlush: 上次向leveldb中寫數(shù)據(jù)的時間戳;
- cacheLock: 對cachedKeys和cachedRemove進行讀寫保護,它們會在dbCache向leveldb寫數(shù)據(jù)時更新,在dbCache快照時被讀取;
- cachedKeys: 緩存待添加的Key,它指向一個樹堆;
- cachedRemove: 緩存待刪除的Key,它也指向一個樹堆,請注意,cachedKeys和cachedRemove與transaction中的pendingKeys和pendingRemove有區(qū)別,pendingKeys和pendingRemove是可修改樹堆(*treap.Mutable),而cachedKeys和cachedRemove是不可修改樹堆(*treap.Immutable),且通常情況下(不滿足needsFlush()時)pendingKeys和pendingRemove先向cachedKeys和cachedRemove同步,再向leveldb中更新,我們將在dbCache的commitTx()中更清楚地了解這一點。treap.Mutable和treap.Immutable將在本文最后介紹。
我們在transaction的writePendingAndCommit()方法中看到transaction Commit的最后一步就是調(diào)用dbCache的commitTx()來提交元數(shù)據(jù)的更新,所以我們先來看看commitTX()方法:
//btcd/database/ffldb/dbcache.go
// commitTx atomically adds all of the pending keys to add and remove into the
// database cache. When adding the pending keys would cause the size of the
// cache to exceed the max cache size, or the time since the last flush exceeds
// the configured flush interval, the cache will be flushed to the underlying
// persistent database.
//
// This is an atomic operation with respect to the cache in that either all of
// the pending keys to add and remove in the transaction will be applied or none
// of them will.
//
// The database cache itself might be flushed to the underlying persistent
// database even if the transaction fails to apply, but it will only be the
// state of the cache without the transaction applied.
//
// This function MUST be called during a database write transaction which in
// turn implies the database write lock will be held.
func (c *dbCache) commitTx(tx *transaction) error {
// Flush the cache and write the current transaction directly to the
// database if a flush is needed.
if c.needsFlush(tx) { (1)
if err := c.flush(); err != nil { (2)
return err
}
// Perform all leveldb updates using an atomic transaction.
err := c.commitTreaps(tx.pendingKeys, tx.pendingRemove) (3)
if err != nil {
return err
}
// Clear the transaction entries since they have been committed.
tx.pendingKeys = nil
tx.pendingRemove = nil
return nil
}
// At this point a database flush is not needed, so atomically commit
// the transaction to the cache.
// Since the cached keys to be added and removed use an immutable treap,
// a snapshot is simply obtaining the root of the tree under the lock
// which is used to atomically swap the root.
c.cacheLock.RLock()
newCachedKeys := c.cachedKeys
newCachedRemove := c.cachedRemove
c.cacheLock.RUnlock()
// Apply every key to add in the database transaction to the cache.
tx.pendingKeys.ForEach(func(k, v []byte) bool { (5)
newCachedRemove = newCachedRemove.Delete(k)
newCachedKeys = newCachedKeys.Put(k, v)
return true
})
tx.pendingKeys = nil
// Apply every key to remove in the database transaction to the cache.
tx.pendingRemove.ForEach(func(k, v []byte) bool { (6)
newCachedKeys = newCachedKeys.Delete(k)
newCachedRemove = newCachedRemove.Put(k, nil)
return true
})
tx.pendingRemove = nil
// Atomically replace the immutable treaps which hold the cached keys to
// add and delete.
c.cacheLock.Lock()
c.cachedKeys = newCachedKeys (7)
c.cachedRemove = newCachedRemove
c.cacheLock.Unlock()
return nil
}
其中的主要步驟為:
- 如果離上一次flush已經(jīng)超過一個刷新周期且dbCache已滿,則調(diào)用flush()將樹堆中的緩存寫入leveldb,并將transaction中的待添加和移除的Keys通過commitTreaps()方法直接寫入leveldb,寫完后清空pendingKeys和pendingRemove;
- 如果不需要flush,則代碼(5)和(6)處將transaction中的pendingKeys添加到newCachedKeys中,將pendingRemove添加到newCachedRemove,即將tx中待添加和刪除的Keys寫入dbCache。這里要注意兩點: 1). 將pendingKeys中的Key添加到newCachedKeys時,得先將相同的Key從newCachedRemove中移除,以免寫入leveldb時該Key被刪除。向newCachedRemove添加Key時也須將相同Key從newCachedKeys移除,以免本來要刪除的Key又被寫入leveldb;2). cachedKeys和cachedRemove均是treap.Immutable指針,相應地,newCachedKeys和newCachedRemove也是treap.Immutable指針。treap.Immutable類型的樹堆實現(xiàn)了類似于寫時復制(COW)的機制來提高讀寫并發(fā),當通過Put()或者Delete()來更新樹堆的節(jié)點時,需要更新的節(jié)點會復制一份與不需要更新的老的節(jié)點組成一顆新的樹堆返回。代碼(5)和(6)處newCachedKeys和newCachedRemove重新指向Delete()或者Put()調(diào)用的返回值,實際上是指向了一個新的樹堆,而c.cachedKeys和c.cachedRemove仍然指向修改之前的樹堆,所以這時如果通過Snapshot()獲取dbCache的快照,快照中的cachedKeys和cachedRemove并不包含transation的pendingKeys和pendingRemove。這可以看成是dbCache的MVCC實現(xiàn)。
- 最后,代碼(7)處更新dbCache中的cachedKeys和cachedRemove。請注意,更新操作通過c.cacheLock的寫鎖保護。更新c.cachedKeys和c.cachedRemove后,再通過Snapshot()拿到的dbCache快照中就包含了transaction提交的pendingKeys和pendingRemove;
接下來,我們看看flush的實現(xiàn):
//btcd/database/ffldb/dbcache.go
// flush flushes the database cache to persistent storage. This involes syncing
// the block store and replaying all transactions that have been applied to the
// cache to the underlying database.
//
// This function MUST be called with the database write lock held.
func (c *dbCache) flush() error {
c.lastFlush = time.Now()
// Sync the current write file associated with the block store. This is
// necessary before writing the metadata to prevent the case where the
// metadata contains information about a block which actually hasn't
// been written yet in unexpected shutdown scenarios.
if err := c.store.syncBlocks(); err != nil { (1)
return err
}
// Since the cached keys to be added and removed use an immutable treap,
// a snapshot is simply obtaining the root of the tree under the lock
// which is used to atomically swap the root.
c.cacheLock.RLock()
cachedKeys := c.cachedKeys
cachedRemove := c.cachedRemove
c.cacheLock.RUnlock()
// Nothing to do if there is no data to flush.
if cachedKeys.Len() == 0 && cachedRemove.Len() == 0 {
return nil
}
// Perform all leveldb updates using an atomic transaction.
if err := c.commitTreaps(cachedKeys, cachedRemove); err != nil { (2)
return err
}
// Clear the cache since it has been flushed.
c.cacheLock.Lock()
c.cachedKeys = treap.NewImmutable() (3)
c.cachedRemove = treap.NewImmutable()
c.cacheLock.Unlock()
return nil
}
其中主要步驟為:
- 調(diào)用blockStore的syncBlocks()強制將文件緩沖寫入磁盤文件,以防止meta數(shù)據(jù)與區(qū)塊文件中的狀態(tài)不一致;
- 通過commitTreaps()將dbCache中的緩存寫入leveldb;
- 將cachedKeys和cachedRemove置為空的樹堆,實際上是清空dbCache;
dbCache的commitTreaps()比較簡單,它主要是調(diào)用leveldb的Put和Delete依次將cachedKeys和cachedRemove更新到leveldb中,我們就不作專門分析了,讀者可以自行閱讀其源代碼。我們來看看dbCache的Snapshot():
//btcd/database/ffldb/dbcache.go
// Snapshot returns a snapshot of the database cache and underlying database at
// a particular point in time.
//
// The snapshot must be released after use by calling Release.
func (c *dbCache) Snapshot() (*dbCacheSnapshot, error) {
dbSnapshot, err := c.ldb.GetSnapshot()
if err != nil {
str := "failed to open transaction"
return nil, convertErr(str, err)
}
// Since the cached keys to be added and removed use an immutable treap,
// a snapshot is simply obtaining the root of the tree under the lock
// which is used to atomically swap the root.
c.cacheLock.RLock()
cacheSnapshot := &dbCacheSnapshot{
dbSnapshot: dbSnapshot,
pendingKeys: c.cachedKeys,
pendingRemove: c.cachedRemove,
}
c.cacheLock.RUnlock()
return cacheSnapshot, nil
}
可以看到,它實際上就是通過leveldb的Snapshot、c.cachedKeys和c.cachedRemove構(gòu)建一個dbCacheSnapshot對象,在dbCacheSnapshot中查找Key時,先從cachedKeys或cachedRemove查找,再從leveldb的Snapshot查找。transaction中的snapshot就是指向該對象。
treap
通過上面幾個方法的分析,我們就清楚了dbCache緩存Key、刷新緩存及讀緩存的過程。dbCache中用于實際緩存的數(shù)據(jù)結(jié)構(gòu)是treap.Immutable,它是dbCache的核心。Btcd中的treap既實現(xiàn)了Immutable版本,同時也提供了Muttable版本。接下來,我們就開始分析treap的實現(xiàn)。對于不了解treap的讀者,可以閱讀BYVoid同學寫的《隨機平衡二叉查找樹Treap的分析與應用》。簡單地講,樹堆是二叉查找樹與堆的結(jié)合體,為了實現(xiàn)動態(tài)平衡,在二叉查找樹的節(jié)點中引入一個隨機值,用于對節(jié)點進行堆排序,讓二叉查找樹同時形成最大堆或者最小堆,從而保證其平衡性。樹堆查找的時間復雜度為O(logN)。由于篇幅限制,我們不打算完整分析treap的代碼,將主要分析Mutable和Immutable的Put()方法來了解treap的構(gòu)建、添加節(jié)點后的旋轉(zhuǎn)及Immutable的寫時復制等過程。
我們先來看看Immutable、Mutable的定義:
//btcd/database/internal/treap/mutable.go
// Mutable represents a treap data structure which is used to hold ordered
// key/value pairs using a combination of binary search tree and heap semantics.
// It is a self-organizing and randomized data structure that doesn't require
// complex operations to maintain balance. Search, insert, and delete
// operations are all O(log n).
type Mutable struct {
root *treapNode
count int
// totalSize is the best estimate of the total size of of all data in
// the treap including the keys, values, and node sizes.
totalSize uint64
}
//btcd/database/internal/treap/immutable.go
// Immutable represents a treap data structure which is used to hold ordered
// key/value pairs using a combination of binary search tree and heap semantics.
// It is a self-organizing and randomized data structure that doesn't require
// complex operations to maintain balance. Search, insert, and delete
// operations are all O(log n). In addition, it provides O(1) snapshots for
// multi-version concurrency control (MVCC).
//
// All operations which result in modifying the treap return a new version of
// the treap with only the modified nodes updated. All unmodified nodes are
// shared with the previous version. This is extremely useful in concurrent
// applications since the caller only has to atomically replace the treap
// pointer with the newly returned version after performing any mutations. All
// readers can simply use their existing pointer as a snapshot since the treap
// it points to is immutable. This effectively provides O(1) snapshot
// capability with efficient memory usage characteristics since the old nodes
// only remain allocated until there are no longer any references to them.
type Immutable struct {
root *treapNode
count int
// totalSize is the best estimate of the total size of of all data in
// the treap including the keys, values, and node sizes.
totalSize uint64
}
Immutable和Mutable的定義完全一樣,它們的區(qū)別在于Immutable提供了寫時復制,我們將在Put()方法中看到他們的區(qū)別。其中的root字段指向樹堆的根節(jié)點,節(jié)點的定義為:
//btcd/database/internal/treap/common.go
// treapNode represents a node in the treap.
type treapNode struct {
key []byte
value []byte
priority int
left *treapNode
right *treapNode
}
treapNode中的key和value就是樹堆節(jié)點的值,priority是用于構(gòu)建堆的隨機修正值,也叫節(jié)點的優(yōu)先級,left和right分別指向左右子樹根節(jié)點。我們先來看看Mutable的Put()方法,來了解樹堆的構(gòu)建和插入節(jié)點后的旋轉(zhuǎn)過程:
//btcd/database/internal/treap/mutable.go
// Put inserts the passed key/value pair.
func (t *Mutable) Put(key, value []byte) {
// Use an empty byte slice for the value when none was provided. This
// ultimately allows key existence to be determined from the value since
// an empty byte slice is distinguishable from nil.
if value == nil {
value = emptySlice
}
// The node is the root of the tree if there isn't already one.
if t.root == nil { (1)
node := newTreapNode(key, value, rand.Int())
t.count = 1
t.totalSize = nodeSize(node)
t.root = node
return
}
// Find the binary tree insertion point and construct a list of parents
// while doing so. When the key matches an entry already in the treap,
// just update its value and return.
var parents parentStack
var compareResult int
for node := t.root; node != nil; {
parents.Push(node)
compareResult = bytes.Compare(key, node.key)
if compareResult < 0 {
node = node.left (2)
continue
}
if compareResult > 0 {
node = node.right (3)
continue
}
// The key already exists, so update its value.
t.totalSize -= uint64(len(node.value))
t.totalSize += uint64(len(value))
node.value = value (4)
return
}
// Link the new node into the binary tree in the correct position.
node := newTreapNode(key, value, rand.Int()) (5)
t.count++
t.totalSize += nodeSize(node)
parent := parents.At(0)
if compareResult < 0 {
parent.left = node (6)
} else {
parent.right = node (7)
}
// Perform any rotations needed to maintain the min-heap.
for parents.Len() > 0 {
// There is nothing left to do when the node's priority is
// greater than or equal to its parent's priority.
parent = parents.Pop()
if node.priority >= parent.priority { (8)
break
}
// Perform a right rotation if the node is on the left side or
// a left rotation if the node is on the right side.
if parent.left == node {
node.right, parent.left = parent, node.right (9)
} else {
node.left, parent.right = parent, node.left (10)
}
t.relinkGrandparent(node, parent, parents.At(0))
}
}
......
// relinkGrandparent relinks the node into the treap after it has been rotated
// by changing the passed grandparent's left or right pointer, depending on
// where the old parent was, to point at the passed node. Otherwise, when there
// is no grandparent, it means the node is now the root of the tree, so update
// it accordingly.
func (t *Mutable) relinkGrandparent(node, parent, grandparent *treapNode) {
// The node is now the root of the tree when there is no grandparent.
if grandparent == nil {
t.root = node (11)
return
}
// Relink the grandparent's left or right pointer based on which side
// the old parent was.
if grandparent.left == parent {
grandparent.left = node (12)
} else {
grandparent.right = node (13)
}
}
其中的主要步驟為:
- 對于空樹,添加的第一個節(jié)點直接成為根節(jié)點,如代碼(1)處所示,可以看到,節(jié)點的priority是由rand.Int()生成的隨機整數(shù);
- 對于非空樹,根據(jù)key來查找待插入的位置,并通過parentStack來記錄查找路徑。從根節(jié)點開始,如果待插入的Key小于根的Key,則進入左子樹繼續(xù)查找,如代碼(2)處所示;如果待插入的Key大于根的Key,則進入右子樹繼續(xù)查找,如代碼(3)處所示;如果待插入的Key正好的當前節(jié)點的Key,則直接更新其Value,如代碼(4)處所示;
- 當樹中沒有找到Key,則應插入新的節(jié)點,此時parents中的最后一個節(jié)點就是新節(jié)點的父節(jié)點,請注意,parents.At(0)是查找路徑上的最后一個節(jié)點。如果待插入的Key小于父節(jié)點的Key,則新節(jié)點變成父節(jié)點的左子節(jié)點,如代碼(6)處所示;否則,成為右子節(jié)點,如代碼(7)處所示;
- 由于新節(jié)點的priority是隨機產(chǎn)生的,它插入樹中后,樹可能不滿足最小堆性質(zhì)了,所以接下來需要進行旋轉(zhuǎn)。旋轉(zhuǎn)過程需要向上遞歸進行直到整顆樹滿足最小難序。代碼(8)處,如果新節(jié)點的優(yōu)化級正好大于或者等于父節(jié)點的優(yōu)先級,則不用旋轉(zhuǎn),樹已經(jīng)滿足最小難序;如果新節(jié)點的優(yōu)化級小于父節(jié)點的優(yōu)化級,則需要旋轉(zhuǎn),將父節(jié)點變成新節(jié)點的子節(jié)點。如果新節(jié)點是父節(jié)點的左子節(jié)點,則需要進行右旋,如果代碼(9)所示;如果新節(jié)點是父節(jié)點的右子節(jié)點,則需要進行左旋,如代碼(10)所示;
- 進行左旋或右旋后,原父節(jié)點變成新節(jié)點的子節(jié)點,但祖節(jié)點(原父節(jié)點的父節(jié)點)的子節(jié)點還指向原父節(jié)點,relinkGrandparent()將繼續(xù)完成旋轉(zhuǎn)過程。如果祖節(jié)點是空,則說明原父節(jié)點就是樹的根,不需要調(diào)整直接將新節(jié)點變成樹的根即可,如代碼(11)處所示;代碼(12)和(13)實際上是將新節(jié)點替代原父節(jié)點,變成祖節(jié)點的左子節(jié)點或者右子節(jié)點;
- 新節(jié)點、原父節(jié)點、祖節(jié)點完成旋轉(zhuǎn)后,新節(jié)點變成了新的父節(jié)點,原交節(jié)點變成子節(jié)點,祖節(jié)點不變,但此時新節(jié)點的優(yōu)化級可能還大于祖節(jié)點的優(yōu)化級,則新的父節(jié)點、祖節(jié)點及祖節(jié)點的父節(jié)點還要繼續(xù)旋轉(zhuǎn),這一過程向上遞歸到根節(jié)點,保證查找路徑上節(jié)點均滿足最小堆序,才完成了整個旋轉(zhuǎn)過程及新節(jié)點插入過程。
從Mutable的Put()方法中,我們可以完整地了解treap的構(gòu)建、插入及涉及到的子樹旋轉(zhuǎn)過程。Immutable的Put()與Mutable的Put()實現(xiàn)步驟大致一致,不同的是,Immutable沒有直接修改原節(jié)點或旋轉(zhuǎn)原樹,而是將查找路徑上的所有節(jié)點均復制一份出來與原樹的其它節(jié)點一起形成一顆新的樹,在新樹上進行更新或者旋轉(zhuǎn)后返回新樹。它的實現(xiàn)如下:
//btcd/database/internal/treap/immutable.go
// Put inserts the passed key/value pair.
func (t *Immutable) Put(key, value []byte) *Immutable {
// Use an empty byte slice for the value when none was provided. This
// ultimately allows key existence to be determined from the value since
// an empty byte slice is distinguishable from nil.
if value == nil {
value = emptySlice
}
// The node is the root of the tree if there isn't already one.
if t.root == nil {
root := newTreapNode(key, value, rand.Int())
return newImmutable(root, 1, nodeSize(root)) (1)
}
// Find the binary tree insertion point and construct a replaced list of
// parents while doing so. This is done because this is an immutable
// data structure so regardless of where in the treap the new key/value
// pair ends up, all ancestors up to and including the root need to be
// replaced.
//
// When the key matches an entry already in the treap, replace the node
// with a new one that has the new value set and return.
var parents parentStack
var compareResult int
for node := t.root; node != nil; {
// Clone the node and link its parent to it if needed.
nodeCopy := cloneTreapNode(node)
if oldParent := parents.At(0); oldParent != nil {
if oldParent.left == node {
oldParent.left = nodeCopy (2)
} else {
oldParent.right = nodeCopy (3)
}
}
parents.Push(nodeCopy) (4)
// Traverse left or right depending on the result of comparing
// the keys.
compareResult = bytes.Compare(key, node.key)
if compareResult < 0 {
node = node.left
continue
}
if compareResult > 0 {
node = node.right
continue
}
// The key already exists, so update its value.
nodeCopy.value = value (5)
// Return new immutable treap with the replaced node and
// ancestors up to and including the root of the tree.
newRoot := parents.At(parents.Len() - 1) (6)
newTotalSize := t.totalSize - uint64(len(node.value)) + (7)
uint64(len(value))
return newImmutable(newRoot, t.count, newTotalSize) (8)
}
// Link the new node into the binary tree in the correct position.
node := newTreapNode(key, value, rand.Int())
parent := parents.At(0)
if compareResult < 0 {
parent.left = node
} else {
parent.right = node
}
// Perform any rotations needed to maintain the min-heap and replace
// the ancestors up to and including the tree root.
newRoot := parents.At(parents.Len() - 1)
for parents.Len() > 0 {
// There is nothing left to do when the node's priority is
// greater than or equal to its parent's priority.
parent = parents.Pop()
if node.priority >= parent.priority {
break
}
// Perform a right rotation if the node is on the left side or
// a left rotation if the node is on the right side.
if parent.left == node {
node.right, parent.left = parent, node.right
} else {
node.left, parent.right = parent, node.left
}
// Either set the new root of the tree when there is no
// grandparent or relink the grandparent to the node based on
// which side the old parent the node is replacing was on.
grandparent := parents.At(0)
if grandparent == nil {
newRoot = node
} else if grandparent.left == parent {
grandparent.left = node
} else {
grandparent.right = node
}
}
return newImmutable(newRoot, t.count+1, t.totalSize+nodeSize(node)) (9)
}
其與Mutable的Put()方法的主要區(qū)別在于:
- 如果向空樹中插入一個節(jié)點,與直接將新節(jié)點變成原樹的根不同,它將以新節(jié)點為根創(chuàng)建一個新的樹堆并返回,如代碼(1)所示;
- 在查找待插入的Key時,查找路徑上的所有節(jié)點被復制一份出來,如代碼(2)、(3)和(4)處所示。如果找到待插入的Key,是在復制的節(jié)點上更新Value,而不是原節(jié)點上更新,如果代碼(5)處所示,節(jié)點更新后,將以復制出來的根節(jié)點來創(chuàng)建一顆新的樹并返回,如代碼(6)、(7)和(8)處所示;
- 接下來,如果待插入的Key不在樹中,將添加一個新的節(jié)點,且新的節(jié)點被加入到復制出來的父節(jié)點中,然后在復制的新樹上進行旋轉(zhuǎn),最后返回新的樹,如代碼(9)所示。需要注意的是,原樹上節(jié)點均沒有更新,原樹與新樹共享查找路徑以外的其他節(jié)點。
Immutable的Put()的方法通過復制查找路徑上的節(jié)點并返回新的樹根實現(xiàn)了寫時復制,并進而支持了dbCache的MVCC。到此,ffldb的整個工作機制我們就介紹完了,其中的blockStore和dbCache及dbCache使用的數(shù)據(jù)結(jié)構(gòu)treap我們也作了詳細分析,相信大家對Bitcoin節(jié)點進行區(qū)塊的查找和存入磁盤的過程有了完整而清晰的認識了。接下來的文章,我們將介紹Btcd中網(wǎng)絡協(xié)議的實現(xiàn),揭示區(qū)塊在P2P網(wǎng)絡中的傳遞過程。