本文是Distributed systems for fun and profit的第二部分,本文是閱讀該文后的一些記錄。
分布式編程大多數(shù)數(shù)時間都是在處理分布式后帶來的影響。為什么這么說呢?因為雖然理想情況是:我們在分布式系統(tǒng)上編程跟在單機上編程一樣,這種抽象對于程序員來說是最友好的,但是呢?理想很豐滿,現(xiàn)實很骨感,我們必須撥開抽象,去處理影藏在單機抽象背后的多機系統(tǒng)帶來的問題,才可能很好的解決問題。因此,我們現(xiàn)在不斷在尋求一個更好的抽象模型,盡可能的讓編程在分布式環(huán)境下變的簡單。
那問題來了,怎么定義一個抽象更好呢?
What do we mean when say X is more abstract than Y? First, that X does not introduce anything new or fundamentally di?erent from Y. In fact, X may remove some aspects of Y or present them in a way that makes them more manageable. Second, that X is in some sense easier to grasp than Y, assuming that the things that X removed from Y are not important to the matter at hand.
總結(jié)起來就是:用盡可能少的假設(shè)來描述清楚一個東西。
抽象非常重要,能夠幫我們抓住主要問題,如果我們能抓住本質(zhì),那此時想出來的解決方案已定也是最通用的。
那問題有來了,我們怎么知道哪些東西是主要的,或者說是本質(zhì)的?
我們每次在實際問題上,剝離掉系統(tǒng)一個具體的限制的時候(假設(shè)這個不是主要問題),都可能給系統(tǒng)埋下隱患,這也是為什么后期系統(tǒng)必須要重新將限制引入系統(tǒng),因為這些去掉的限制,才是影響到系統(tǒng)工作的真正因素。
有了這個概念,接著就可以介紹在保持下系統(tǒng)工作的同時,最少的假設(shè)條件是多少,這種假設(shè)條件演化出來的就是我們經(jīng)常說的系統(tǒng)模型。
A system model
分布式系統(tǒng)最大的屬性就是:分布式,更具體來說,一個分布式系統(tǒng)中的程序具有的屬性有:
- run concurrently on independent nodes …【獨立節(jié)點上并發(fā)執(zhí)行】
- are connected by a network that may introduce nondeterminism and message loss …【通過網(wǎng)絡(luò)互連】
- and have no shared memory or shared clock.【不共享內(nèi)存和時鐘】
具體解釋是:
- each node executes a program concurrently【每個節(jié)點都并發(fā)執(zhí)行】
- knowledge is local: nodes have fast access only to their local state, and any information about global state is potentially out of date 【每個節(jié)點只知道自己節(jié)點上的信息】
- nodes can fail and recover from failure independently 【每個節(jié)點失敗和恢復(fù)都是獨立的】
- messages can be delayed or lost (independent of node failure; it is not easy to distinguish network failure and node failure) 【通信是不可靠的】
- and clocks are not synchronized across nodes (local timestamps do not correspond to the global real time order, which cannot be easily observed)【時鐘不同步】
那什么是系統(tǒng)模型?
System model:a set of assumptions about the environment and facilities on which a distributed system is implemented
系統(tǒng)模型定義了關(guān)于 environment and facilities 的假設(shè),這些假設(shè)包括:
- what capabilities the nodes have and how they may fail 【每個節(jié)點能力和失敗方式】
- how communication links operate and how they may fail and 【節(jié)點間通信方式和失敗方式】
- properties of the overall system, such as assumptions about time and order【整個系統(tǒng)屬性:如時序】
什么是抗造系統(tǒng),就是系統(tǒng)模型做了最少的假設(shè),由于假設(shè)少,基于這種系統(tǒng)設(shè)計的算法,都是容錯性非常好的,但是同樣的,由于假設(shè)少,所以算法也很難理解。
下面是具體介紹:properties of nodes, links and time and order
Nodes in our system model
節(jié)點是系統(tǒng)有計算和存儲能力的hosts,它們能:
- the ability to execute a program 【計算】
- the ability to store data into volatile memory (which can be lost upon failure) and into stable state (which can be read after a failure) 【存儲】
- a clock (which may or may not be assumed to be accurate)【時鐘】
此處我們假設(shè)node的failure models是 crash-recovery failure model,而考慮 Byzantine fault tolerance,即拜占庭錯誤
Communication links in our system model
分布式系統(tǒng)中最難處理的假設(shè)就是通信假設(shè),我們在分布式系統(tǒng)中,一個系統(tǒng)很難知道另一個系統(tǒng)的情況,因為任何的通信都是不可靠的,信息都無法交流,還怎么知道別人的情況,因此分布式系統(tǒng)中,能依賴的只有節(jié)點本身的信息。
Timing / ordering assumptions
在分布式系統(tǒng)中我們必須認識到:每個node看到的世界都是不同的,這個不同來自于一個事實:信息的傳輸需要時間。對于同一件事情,每個節(jié)點看到這個事情的時間都是不一樣的,因此每個節(jié)點看到的世界,其時間點都是不同的。
有兩個主要的關(guān)于時間的模型:
Synchronous system
? model Processes execute in lock-step; there is a known upper bound on message transmission delay; each process has an accurate clock
Asynchronous system
? model No timing assumptions - e.g. processes execute at independent rates; there is no bound on message transmission delay; useful clocks do not exist
The consensus problem
下面對網(wǎng)絡(luò)是否分區(qū)包含在錯誤模型中和網(wǎng)絡(luò)傳輸是同步還是異步模型兩個條件的討論
- whether or not network partitions are included in the failure model, and
- synchronous vs. asynchronous timing assumptions
先介紹下什么是一致性模型
- Agreement: Every correct process must agree on the same value.
- Integrity: Every correct process decides at most one value, and if it decides some value, then it must have been proposed by some process.
- Termination: All processes eventually reach a decision.
- Validity: If all correct processes propose the same value V, then all correct processes decide V.
Two impossibility results
什么是impossibility results
A proof of impossibility, also known as negative proof, proof of an impossibility theorem, or negative result, is a proof demonstrating that a particular problem cannot be solved, or cannot be solved in general. Often proofs of impossibility have put to rest decades or centuries of work attempting to find a solution. To prove that something is impossible is usually much harder than the opposite task; it is necessary to develop a theory. Impossibility theorems are usually expressible as universal propositions in logic (see universal quantification).
說白點就是證明了某項事情是不可能的,這樣子人們就不用去朝著這個東西再去白費力氣了。
在分布式中有兩個重要的impossibility results,F(xiàn)LP 和 CAP,F(xiàn)LP的重要性在于學(xué)術(shù)研究,此處不做過多介紹,下面主要介紹下CAP
The CAP theorem
CAP中每個字母代表是:
- Consistency: all nodes see the same data at the same time.(一致性)
- Availability: node failures do not prevent survivors from continuing to operate.(可用性)
- Partition tolerance: the system continues to operate despite message loss due to network and/or node failure(分區(qū)容忍性)
上面3個特性,只有2個能同時滿足,因此我們會有3種系統(tǒng):
- CA (consistency + availability). Examples include full strict quorum protocols, such as two-phase commit.
- CP (consistency + partition tolerance). Examples include majority quorum protocols in which minority partitions are unavailable such as Paxos.
- AP (availability + partition tolerance). Examples include protocols using con?ict resolution, such as Dynamo.
CA和CP系統(tǒng)都提供了強一致性模型,不同是CA不可以容忍網(wǎng)絡(luò)分區(qū),而CP在2f+1個節(jié)點中,可以容忍f個節(jié)點失敗,原因很簡單:
- A CA system does not distinguish between node failures and network failures, and hence must stop accepting writes everywhere to avoid introducing divergence (multiple copies). It cannot tell whether a remote node is down, or whether just the network connection is down: so the only safe thing is to stop accepting writes.【不能區(qū)分網(wǎng)絡(luò)分區(qū)和節(jié)點失敗,因此必須停止寫入避免引入不一致】
- A CP system prevents divergence (e.g. maintains single-copy consistency) by forcing asymmetric behavior on the two sides of the partition. It only keeps the majority partition around, and requires the minority partition to become unavailable (e.g. stop accepting writes), which retains a degree of availability (the majority partition) and still ensures single-copy consistency.【即使網(wǎng)絡(luò)分區(qū)了,大多數(shù)節(jié)點的一方還是能夠提供服務(wù)】
CP系統(tǒng)因為將網(wǎng)絡(luò)分區(qū)考慮到了failure model中,因此能夠通過類似Paxos, Raft 的協(xié)議來區(qū)分a majority partition and a minority partition
CA則由于沒有考慮網(wǎng)絡(luò)分區(qū)的情況,因此無法知道一個節(jié)點不響應(yīng)式因為節(jié)點收不到消息還是節(jié)點失敗了,因此只能夠通過停止服務(wù)來防止出現(xiàn)數(shù)據(jù)一致,在CA中由于不能保證網(wǎng)絡(luò)可靠性,因此通過使用two-phase commit algorithm來保證數(shù)據(jù)一致性。
從CAP理論中我們能得到4個結(jié)論:
- First, that many system designs used in early distributed relational database systems did not take into account partition tolerance (e.g. they were CA designs). Partition tolerance is an important property for modern systems, since network partitions become much more likely if the system is geographically distributed (as many large systems are).【早期系統(tǒng)大多沒有考慮P,因此是CA系統(tǒng),但是現(xiàn)代系統(tǒng),特別是出現(xiàn)異地多主后,必須考慮分區(qū)了】
- Second, that there is a tension between strong consistency and high availability during network partitions. The CAP theorem is an illustration of the tradeo?s that occur between strong guarantees and distributed computation.【P既然無法避免,我們只能在C和A之間做選擇,有時候我們可以通過降低數(shù)據(jù)的一致性模型,不再追求強一致,從而達到"CAP"】
- Third, that there is a tension between strong consistency and performance in normal operation.【當一個操作涉及的消息數(shù)和節(jié)點的數(shù)少的時候,延遲自然就低,但是這也意味著有些節(jié)點不會被經(jīng)常訪問,意味著數(shù)據(jù)會是舊數(shù)據(jù)】
- Fourth - and somewhat indirectly - that if we do not want to give up availability during a network partition, then we need to explore whether consistency models other than strong consistency are workable for our purposes.【有時候3選2可能是誤解,我們?nèi)绻麑⒆约翰幌拗圃趶娨恢滦阅P停覀儠懈嗟倪x擇】
我們要記住:
ACID consistency != CAP consistency != Oatmeal consistency
一致性模型的概念是:
Consistency model
a contract between programmer and system, wherein the system guarantees that if the programmer follows some speci?c rules, the results of operations on the data store will be predictable
一致性模型是編程者和系統(tǒng)之間的契約,只要編程者按照某種規(guī)則,那計算機的操作結(jié)果就是可預(yù)測的。
下面介紹一些一致性模型:
Strong consistency vs. other consistency models
- Strong consistency models (capable of maintaining a single copy)
- Linearizable consistency
- Sequential consistency
- Weak consistency models (not strong)
- Client-centric consistency models
- Causal consistency: strongest model available
- Eventual consistency models
一致性模型可以分為兩大類:強一致和弱一致。
強一致模型給編程者提供的是一個和單機系統(tǒng)一樣的模型,而弱一致,則讓編程者清楚的意識要是在分布式環(huán)境下編程,而不是單機環(huán)境。
Strong consistency models
強一致性模型可以再細分為兩大類:
- Linearizable consistency: Under linearizable consistency, all operations appear to have executed atomically in an order that is consistent with the global real-time ordering of operations. (Herlihy & Wing, 1991)
- Sequential consistency: Under sequential consistency, all operations appear to have executed atomically in some order that is consistent with the order seen at individual nodes and that is equal at all nodes. (Lamport, 1979)
兩者的最大不同是:linearizable consistency要求操作的結(jié)果要和操作實際執(zhí)行的順序一致,而Sequential consistency則允許操作實際發(fā)生的順序和操作產(chǎn)生結(jié)果的順序不同,只要每個節(jié)點看到的順序是一樣的就行。兩者之間的差別基本上可以忽略。
Client-centric consistency models
該一致性模型主要是為了解決下面的情況:客戶端進行了某個操作,同時也看到了最新的結(jié)果,但是由于網(wǎng)絡(luò)中斷,重新連接到server,此時不能因為重新連接而看到一個舊的結(jié)果。
Eventual consistency
最終一致性我們需要知道兩點:
- 最終一致,這個最終是多久?我們需要有個下限,或者至少是一個平均值
- 多個副本怎么達成一致?
- First, how long is "eventually"? It would be useful to have a strict lower bound, or at least some idea of how long it typically takes for the system to converge to the same value.【最終,這個時間是多久】
- Second, how do the replicas agree on a value? how非常重要,因為如果設(shè)計的不好,可能會導(dǎo)致數(shù)據(jù)丟失。
因此,在談?wù)撟罱K一致的時候,我們需要知道這可能是:"eventually last-writer-wins, and read-the-latest-observed-value in the meantime"