同步博客: My Love
原文地址: Demystifying Containers 101: A Deep Dive Into Container Technology for Beginners
翻譯: JonyChiao
Introduction 簡介
Regardless of whether you are a student in school, a developer at some company, or a software enthusiast, chances are you heard of containers. You may have also heard that containers are lightweight virtual machines, but what does that really mean, how exactly do containers work, and why are they so important?
無論你是在校的學生、公司里的開發人員、軟件愛好者,你都應該聽說過容器。你可能也聽說過容器是輕量級的虛擬機,但這到底指的是什么意思,容器到底是怎樣工作的,為什么容器如此重要?
This story serves as a look into containers, their key great technical ideas, and the applications. I won’t assume any prior knowledge in this field other than a basic understanding of computer science.
下面講的就是容器相關的內容,包括容器的關鍵的偉大思想,以及容器的應用。除了需要對計算機有基本的了解外,你不需要其他的專業知識來閱讀這篇文章。
The Kernel and the OS 內核與操作系統
Your laptop, along with every other computer, is built on top of some pieces of hardware like the CPU, persistent storage (disk drive, SSD), memory, network card, etc.
你的筆記本以及其他的計算設備都是建立在其他頂級硬件基礎比如 CPU,持久化存儲(硬盤,固態硬盤),內存,網絡等之上。
To interact with this hardware, a piece of software in the operating system called the kernel serves as the bridge between the hardware and the rest of the system. The kernel is responsible for scheduling processes (programs) to run, managing devices (reading and writing addresses on disk and memory), and more.
為了與硬件進行交互,操作系統中的一些叫做內核服務的軟件就在硬件和操作系統其他軟件之間充當了橋梁的作用。內核用于調度進程(程序)的運行,管理設備(從硬盤和內存中讀寫數據)等等。
The rest of the operating system serves to boot and manage the user space, where user processes are run, and will constantly interact with the kernel.
操作系統剩下的部分用于啟動和管理運行用戶進程的用戶空間,并不斷與內核進行交互。
The kernel is part of the operating system and interfaces with the hardware. The operating system as a whole lives in the “kernel space” while user programs live in the “user space”. The kernel space is responsible for managing the user space.
內核是操作系統的一部分,也是操作系統與硬件之間的接口。操作系統作為一個整體存在于“內核空間”,而用戶的進程存在于“用戶空間”。內核空間負責管理用戶空間。
The Virtual Machine 虛擬機
So you have a computer that runs MacOS and an application that is built to run on Ubuntu. Hmmm… One common solution is to boot up a virtual machine on your MacOS computer that runs Ubuntu and then run your program there.
那么你有一個運行 MacOS 的計算機和一個運行于 Ubuntu 上的應用程序。(⊙o⊙)…一個常用的辦法是在你的 MacOS 計算機上啟動一個運行 Ubuntu 的虛擬機,然后運行這個程序。
A virtual machine is comprised of some level of hardware and kernel virtualization on which runs a guest operating system. A piece of software called a hypervisor creates the virtualized hardware which may include the virtual disk, virtual network interface, virtual CPU, and more. Virtual machines also include a guest kernel that can talk to this virtual hardware.
虛擬機由一定級別的硬件和內核虛擬化組成,在這個虛擬化內核上可以運行用戶的操作系統。一塊稱之為虛擬化管理的程序創建了包括虛擬磁盤、虛擬網絡接口、虛擬cpu等等的虛擬硬件,虛擬機通常包含了一個可以與虛擬硬件交互的訪客內核。
The hypervisor can be hosted, which means it is some software that runs on the Host OS (MacOS) as in the example. It can also be bare metal, running directly on the machine hardware (replacing your OS). Either way, the hypervisor approach is considered heavy weight as it requires virtualizing multiple parts if not all of the hardware and kernel.
管理程序可以被托管,這意味著它是一些在主機操作系統(MacOS)上運行的軟件,如示例中所示。管理程序也可以運行在裸機上,也就是直接運行在硬件設備(取代你的操作系統)。無論哪種方式,管理程序都被認為是重量級的,因為它需要虛擬化多個部分(如果不是全部硬件和內核)。
When there needs to be multiple isolated groups on the same machine, running a VM for each of these groups is way too heavy and wasteful of resources to be a good approach.
當在同一臺機器上需要多個隔離的組時,為每個組運行VM是一種非常吃力而且浪費資源的方式。
VMs require hardware virtualization for machine level isolation whereas containers operate on isolation within the same operation system. The overhead difference becomes really apparent as the number of isolated spaces increase. A regular laptop can run tens of containers but can struggle to run even one VM well.
VM需要硬件虛擬化才能實現機器級隔離,而容器只需要在同一操作系統內進行隔離操作。隨著隔離空間數量的增加,開銷差異變得非常明顯。 普通的筆記本電腦可以運行數十個容器,但是甚至很難運行一個VM。
cgroups
In 2006, engineers at Google invented the Linux “control groups”, abbreviated as cgroups. This is a feature of the Linux kernel that isolates and controls the resource usage for user processes.
2006年,谷歌的工程師發明了Linux“控制組”,縮寫為cgroups。這是Linux內核的一項功能,可隔離和控制用戶進程的資源使用情況。
These processes can be put into namespaces, essentially collections of processes that share the same resource limitations. A computer can have multiple namespaces, each with the resource properties enforced by the kernel.
這些進程可以放入名稱空間,實質上是共享相同資源限制的進程集合。 計算機可以有多個名稱空間,每個名稱空間都有內核強制執行的資源屬性。
The resource allocation per namespace can be managed in order to limit the amount of the overall CPU, RAM, etc that a set of processes can use. For example, a background log aggregation application will probably need to have its resources limit in order to not accidentally overwhelm the actual server it’s logging.
每個命名空間的資源分配都可以順序地進行管理,以便限制一組進程可以使用的總CPU,RAM等的數量。例如,后臺日志聚合應用程序可能需要限制其資源,以免意外地影響運行這個日志聚合應用程序的服務器。
While not an original feature, cgroups in Linux were eventually reworked to include a feature called namespace isolation. The idea of namespace isolation itself is not new, and Linux already had many kinds of namespace isolation. One common example is process isolation, which separates each individual process and prevents such things like shared memory.
雖然不是原始功能,但Linux中的cgroup最終被重新設計為包含名為名稱空間隔離的功能。 命名空間隔離的想法本身并不新鮮,Linux已經有多種命名空間隔離。 一個常見的例子是進程隔離,它將每個單獨的進程分開并防止共享內存之類的事情發生。
Cgroup isolation is a higher level of isolation that makes sure processes within a cgroup namespace are independent of processes in other namespaces. A few important namespace isolation features are outlined below and pave the foundation for the isolation we expect from containers.
Cgroup隔離是一種更高級別的隔離,可確保cgroup命名空間中的進程獨立于其他命名空間中的進程。 下面指出了一些重要的命名空間隔離功能,為我們對容器的隔離奠定了基礎。
PID (Process Identifier) Namespaces: this ensures that processes within one namespace are not aware of process in other namespaces.
PID(流程識別符):這可以確保一個名稱空間內的進程不知道其他名稱空間中的進程。
Network Namespaces: Isolation of the network interface controller, iptables, routing tables, and other lower level networking tools.
網絡命名空間:網絡控制接口、防火墻、路由表以及其他低級網絡工具的隔離。
Mount Namespaces: Filesystems are mounted, so that the file system scope of a namespace is limited to only the directories mounted.
載入命名空間:文件系統被掛載后,命名空間所屬的文件系統范圍就被僅僅被限定在這個已掛載的路徑下。
User Namespaces: Limits users within a namespace to only that namespace and avoids user ID conflicts across namespaces.
用戶命名空間:使用命名空間限制用戶只屬于那個命名空間,并且避免用戶ID與命名空間沖突。
To put it simply, each namespace would appear to be its own machine to the processes within it.
簡而言之,每個命名空間看起來都屬于用于運行進程的物理設備。
Linux Containers Linux 容器
Linux cgroups paved the way for a technology called linux containers (LXC). LXC was really the first major implementation of what we know today to be a container, taking advantage of cgroups and namespace isolation to create virtual environment with separate process and networking space.
Linux cgroups 為一種叫做 Linux 容器(LXC)的技術鋪平了道路。LXC 是我們今天稱之為容器的首次真正的實現,它吸取了 cgroups 和 命名空間的隔離性優點,創建了一個帶有獨立進程和網絡空間的虛擬環境。
In a sense, this allows for independent and isolated user spaces. The idea of containers follows directly from LXC. In fact, earlier versions of Docker were built directly on top of LXC.
從某種意義上說,它允許獨立和隔離的用戶空間。容器的概念直接繼承了 LXC。事實上,早期版本的容器就是直接建立在 LXC 之上的。
Docker
Docker is the most widely used container technology and really what most people mean when they refer to containers. While there are other open source container techs (like rkt by CoreOS) and large companies that build their own container engine (like lmctfy at Google), Docker has become the industry standard for containerization. It is still built on the cgroups and namespacing provided by the Linux kernel and recently Windows as well.
Docker 是使用最廣泛的容器而且是人們通常表示容器時就是指的 Docker。雖然有其他的容器技術(比如 CoreOS 的 rkt)以及大公司構建自己的容器引擎(比如 Goole 的 lmctfy),Docker 已經成為容器化的工業標準。容器還是建立在 由 Linux 內核提供的 cgroups 和 命名空間之上, 包括最近的 windows 上也能運行容器。
A Docker container is made up of layers of images, binaries packed together into a single package. The base image contains the operating system of the container, which can be different from the OS of the host.
一個 Docker 容器由多層 images 和 binaries 組合在一個包中構成。基礎鏡像包含容器的操作系統,這個操作系統與主機的操作系統不同。
The OS of the container is in the form an image. This is not the full operating system as on the host, and the difference is that the image is just the file system and binaries for the OS while the full OS includes the file system, binaries, and the kernel.
容器的操作系統存在于鏡像中。這更主機中的完整操作系統不同,區別在于這個鏡像只包含操作系統的文件系統和二進制,而完整的操作系統包括文件系統、二進制程序和內核。
On top of the base image are multiple images that each build a portion of the container. For example, on top of the base image may be the image that contains the apt-get
dependencies. On top of that may be the image that contains the application binary, and so on.
基礎鏡像之上還有多個鏡像,每個鏡像構建容器的一部分。比如,在基礎鏡像之上可能包含 apt-get
依賴的鏡像。在那之上的鏡像可能包含二進制應用程序等等的東西。
The cool part is if there are two containers with the image layers a, b, c
and a, b, d
, then you only need to store one copy of each image layer a, b, c, d
both locally and in the repository. This is Docker’s union file system.
比較酷的地方是如果有兩個容器分別包含 a, b, c
層的鏡像和 a, b, d
層的鏡像,然后你只需要在本地和倉庫中保存每層鏡像的一個拷貝也就是 a, b, c, d
。這是 Docker 的獨立文件系統。
Each image, identified by a hash, is just one of many possible layers of images that make up a container. However a container is identified only by its top level image, which has references to parent images. Two top level images (Image 1 and Image 2) shown here share the first three layers. Image 2 has two additional configuration related layers, but shares the same parent images as Image 1.
每個由 hash 標示的鏡像可能僅僅是構成容器的幾個層中的那些鏡像中的一個。但是一個容器僅僅由其最頂層的鏡像標示,這個頂層鏡像與他的父鏡像存在聯系。兩個頂層鏡像(image1 和 image2)共享前三層。image2 有兩個額外的配置相關的層,但是與 image1 共享相同的父鏡像。
When a container is booted, the image and its parent images are downloaded from the repo, the cgroup and namespaces are created, and the image is used to create a virtual environment. From within the container, the files and binaries specified in the image appear to be the only files in the entire machine. Then the container’s main process is started and the container is considered alive.
當一個容器啟動時,從倉庫中下載鏡像以及其父鏡像,然后創建 cgroups 和命名空間,同時這個鏡像也用于創建虛擬環境。在容器內,鏡像中的文件和二進制程序似乎是這個機器中僅有的文件。然后容器的主進程開始執行,容器就誕生了。
Docker has some other really really cool features, such as copy on write, volumes (shared file systems between containers), the docker daemon (manages containers on a machine), version controlled repositories (like Github for containers), and more. To learn more about them and see some practical examples of how to use Docker, this Medium article is extremely useful.
Docker 還有一些其他非常非常酷的特性,比如寫時復制,卷機制(與其他容器共享文件系統),docker 守護進程,版本控制倉庫(容器版的 Github)等等。想要學習更多關于 Docker 的內容以及如何使用 Docker 實例的話,這篇文章會非常有用
A command line client (1) tells a process on the machine called the docker daemon (2) what to do. The daemon pulls images from a registry/repository (3). These images are cached (4) on the local machine and can be booted up by the daemon to run containers (5). Image Source: Docker
client(1) 的命令行告訴機器上的一個叫做 docker daemon(2) 的進程該做什么。daemon 程序從遠程倉庫registry/repository (3) 拉取鏡像到本地。這些鏡像緩存(4)在本地,daemon 程序通過這些鏡像來運行容器(5)。鏡像資源: Docker。
Why Containers 為什么需要容器
Aside from process isolation, containers have many other beneficial properties.
除了進程隔離,容器還有其他不錯的特性。
The container serves as a self isolated unit that can run anywhere that supports it. And in each of these instances, the container itself will be exactly identical. It won’t matter if the host OS is CentOS, Ubuntu, MacOS, or even something non UNIX like Windows — from within the container the OS will be whatever OS the container specified. Thus you can be sure the container you built on your laptop will also run on the company’s servers.
容器作為獨立的單元向外提供服務,可以運行在任何支持容器地方。在每個使用場景下,容器本身將會是完全相同的。不管主機操作系統是 MacOS、Ubuntu、CentOS,甚至是一些非類 Unix 的操作系統比如 Windows,容器可以部署運行在任意操作系統中。因此你可以確定你在自己筆記本上運行的容器同樣可以運行在你公司的服務器上。
The container also acts as a standardized unit of work or compute. A common paradigm is for each container to run a single web server, a single shard of a database, or a single Spark worker, etc. Then to scale an application, you simply need to scale the number of containers.
容器通常也是工作或計算機的標準單元。一個常見的范例是每個容器運行單個Web服務器,數據庫的單個分片或單個Spark工作程序等。然后,為了擴展應用程序,您只需要擴展容器的數量。
In this paradigm, each container is given a fixed resource configuration (CPU, RAM, # of threads, etc) and scaling the application requires scaling just the number of containers instead of the individual resource primitives. This provides a much easier abstraction for engineers when applications need to be scaled up or down.
在這個范例中,每個容器被合理配置(CPU,RAM,線程數等等)而且要擴展應用程序只需要擴大容器的數量而不是個別資源。這在程序需要擴展和收縮時為工程師提供了極大的便利。
Containers also serve as a great tool to implement micro service architecture, where each microservice is just a set of co-operating containers. For example the Redis micro service can be implemented with a single master container and multiple slave containers.
容器也是微服務架構中一個非常棒的工具,這些微服務就是一系列相互合作的容器。比如 Redis 微服務就可以使用一個單獨的主容器和多個從容器實現。
This (micro)service orientated architecture has some very important properties that make it easy for engineering teams to create and deploy applications (see my earlier article for more details).
這個定向的微服務架構有很多讓工程師團隊易于創建和部署應用程序的重要屬性(可以查看我的文章來獲取更多細節)。
Orchestration 編配
Ever since the time of linux containers, users have tried to deploy large scale applications over many virtual machines where each process runs in its own container. Doing this required being able to efficiently deploy tens to thousands of containers across potentially hundreds of virtual machines and manage their networking, file systems, resources, etc. Docker today makes this a little easier as it exposes abstractions to define container networking, volumes for file systems, resource configurations, etc.
自從 Linux 容器時代開始,用戶就試圖在虛擬機上部署大型應用程序,這些虛擬機上的每個進程都運行在自己的容器中。要做到這些需要能夠高效地在可能數百臺虛擬機上部署成百上千個容器,并且要管理好他們的網絡、文件系統、資源等內容。今天的 Docker 讓這些變得容易了一點,因為它將容器的網絡、文件系統卷、資源配置等都抽象并暴露出來了。
However a tool is still needed to:
但是仍需要一個工具來:
actually take a specification and assign containers to machines (scheduling)
采用實際的規范并將容器交由機器(調度)
actually boot the specified containers on the machines through Docker
通過Docker在機器上啟動指定的容器
deal with upgrades/rollbacks/the constantly changing nature of the system
處理升級/回滾/系統不斷變化的特性
respond to failures like container crashes
報告如容器宕機的錯誤
and create cluster resources like service discovery, inter VM networking, cluster ingress/egress, etc.
創建集群資源如服務發現、虛擬機間的網絡、集群入口/出口等。
This set of problems relates to the orchestration of a distributed system built on top of a set of (possibly transient or constantly changing) containers, and people have built some really miraculous systems to solve this problem.
這一系列問題涉及在一系列(可能是瞬態的或不斷變化的)容器之上構建的分布式系統的編排,人們已經構建了一些非常奇妙的系統來解決這個問題。
In my next story I will talk in depth about the implementation of Kubernetes, the major open source orchestrator, along with two equally important but lesser known ones, Mesos and Borg.
在我下一篇文章中我將會深入討論主要的開源協調器 Kubernetes 的實現,以及兩個同樣重要但鮮為人知的 Mesos 和 Borg 。
This story is part of a series. I am an undergrad at UC Berkeley. My research is in distributed systems and I am advised by Scott Shenker.
這篇文章是系列中的一部分,我是 UC Berkeley 的一名本科生。我的主要研究方向是由 Scott Shenker 建議的分布式系統