谷歌云數據工程師考試 - Bigtable復習筆記

Bigtable Summary

What is?
-> more expensive because you pay for the number of nodes that you are using
-> if 10 nodes, 100,000 queries per second with 6 millisecond latency
-> low latency
-> high throughput -> fast
-> structured data
-> NOT transactional
-> NOT SQL
-> global availability
-> durable, replicated, and you can get access to it

Screen Shot 2018-06-27 at 1.37.00 pm.png

[圖片上傳中...(Screen Shot 2018-06-26 at 11.04.42 am.png-5ada72-1532174291870-0)]

Serverless?
No

Benefits

  • Incredible scalability. Cloud Bigtable scales in direct proportion to the number of machines in your cluster. A self-managed HBase installation has a design bottleneck that limits the performance after a certain QPS is reached. Cloud Bigtable does not have this bottleneck, and so you can scale your cluster up to handle more queries.
  • Simple administration. Cloud Bigtable handles upgrades and restarts transparently, and it automatically maintains high data durability. To replicate your data, simply add a second cluster to your instance, and replication starts automatically. No more managing masters or regions; just design your table schemas, and Cloud Bigtable will handle the rest for you.
  • Cluster resizing without downtime. You can increase the size of a Cloud Bigtable cluster for a few hours to handle a large load, then reduce the cluster's size again—all without any downtime. After you change a cluster's size, it typically takes just a few minutes under load for Cloud Bigtable to balance performance across all of the nodes in your cluster.

What good for?
Storing time-series data in Cloud Bigtable is a natural fit

  • Time-series data, such as CPU and memory usage over time for multiple servers.
  • Marketing data, such as purchase histories and customer preferences.
  • Financial data, such as transaction histories, stock prices, and currency exchange rates.
  • Internet of Things data, such as usage reports from energy meters and home appliances.
  • Graph data, such as information about how users are connected to one another.

How to use?

cbt

  • a command-line interface for performing several different operations on Cloud Bigtable.

HBase shell

  • HBase shell to connect to a Cloud Bigtable instance, perform basic administrative tasks, and read and write data in a table

Indexing
-> can only be indexed by row key. none of other columns can be indexed

Design
As a summary:

Get a balance between:
Distribute the reading load between tablets (you don’t want reading to be to only one tablet)
AND
Distribute the writing load between tablets (you don’t want writing to be to only one tablet)
AND
Design a row key to allow common queries to return consecutive rows

先看要query的東西在不在key里

然后看key有沒有以下東西,避免hotspotting

Avoid using a row key that’s a domain or starts with a domain (can be part of domain though)

-> because certain domains are extremely active than others

-> the tablets corresponding to those customers are going to cause hot spotting

Avoid using User ID as row key if user IDs are sequentially assigned

-> it is OK if your user ID is randomly assigned e.g. by a hash code

-> because in many applications, newer users are going to be more active than users that were created 6-7 years ago

-> so if the User IDs are assigned in sequential order, the tablets that correspond to new users will tend to be more active -> hots potting

Avoid using a static identifier as a key, especially if you have a static identifier that’s going to keep getting used

-> if you have row key that’s mem usage or CPU usage or disk usage and you keep updating them over and over again, those nodes that do processing for these constantly updated data will get overworked

Avoid using dates as most writes will have the latest dates, thus same tablets -> hot spotting

?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。

推薦閱讀更多精彩內容