(GeekBand)系統設計與實踐 案例分析

案例

  • News Feeds
  • Stats Server
  • Web Crawler
  • Amazon Product Page

News feed(信息流)

Define feed

Organize
  • aggregate(分類)
  • dedup(去重)
  • sort(排序)

Level1.0

Database Schema:
  • User
  • Friendship
  • News
GetNewsfeed:
  • merge news
  • Newsfeed vs News

Why bad?

100+ friends
1Query-->Get friends list

1Query-->

SELECT news

WHERE timestamp>xxx
AND sourceid IN friend list
LIMIT 1000

IN is slow

Either Sequential scan or 100+ index queries

Level 2.0

Pull vs Push

Pull:Get news from each friend,merge them together.(NewsFeed generated when user request)

Push:NewsFeed generated when news generated.(we have another table to store newsfeed,may cause duplicate news)

Push:
1Query to select latest 1000 newsfeed.
100+ insert queries(Async)

Disadvantage:News Delay.

Level 3.0

Popular star(Justin Bieber)

Flowers 13M+

Async Push may cause over 30 minutes(13M+ insertions,delay too long)

Push+Pull

for popular star,don't push news to flowers

for every newfeed reqiest,merge non-popular user newfeed(push) and popular users newsfeed(pull)

Level 4.0

Push disadvantage
  • Realtime
  • Storage(Duplicate)
  • Edit
Go back to PULL:
  • Cache users' latest (14days) news
  • Broadcast multiple request to multiple servers(Shard by userld).
  • Merge & sort newsfeed
  • Cache newsfeeds for this user with timestamp

Click Stats Server

How are click stats stored

A poor candidate will suggest write-back to a data store on every click

A good candidate will suggest some form of aggregation tier that accepts clickstream data,aggregates it,and writes back a persistent data store periodically

A great candidate will suggest alow-latecy messaging system to bugger the click data and transfer it to the aggregation tier.

If daily,storing in hdfs and running map/reduce jobs to compute stats is a reasonable approach

If near real-time,the aggregation logic should compute stats

PS:要如何統計鼠標點擊的次數以及相關區域呢?普通的程序員會將每次點擊的數據(log)直接存儲在數據庫一層。比較好的程序員會在前段與數據庫間加一個中間層,為點擊的數據流做一次聚合,每隔一段時間(1分鐘或10分鐘)做一次刷新,存儲到數據庫,大大減輕了后端的壓力。優秀的程序員綜合以上的兩種情況,對于數據量很大,實時性效果不高的情況下,可以通過分布式的批處理方式,將刷新聚合層的時間定位一天。對于時效性強的要適當縮短刷新時間。

Cache Requirement

  • When a request comes look it up in the cache and if it hits then return the response from here and do not pass the request to the system.
  • If the request is not found in the cache then pass it on to the system.
  • Since cache can only store the last n requests,Insert the n+1th request in the cache and delete one of the older requests from the cache
  • Design one cache such that all operations can be done in O(1)-lookup,delete and insert.
PS:如何設計cache(LRU設計相關):
  • 在層中緩存部分請求的處理方式,如果接收的請求在層中存在對應的處理方式,則無需把請求發送到后端系統
  • 如果在層中找不到對應處理,則發送需求到后端
  • 以定長隊列的形式緩存,緩存最近的n個需求,頭進尾出
  • 將層中的匹配操作算法控制在O(1)范圍

Web Crawler

爬蟲

Amazon Product Page

The product page includes information such as
  • product information
  • user information
  • recommended products(what do other customers buy after viewing this item,recommendations for you like this product,etc)
Reference
  • http://highscalability.com
  • The Log:What every software engineer should know about real-time data's unifying abstraction
  • Job Interviews:How should I prepare system design questions for Goole/Facebook Interview?
  • HOW TO ACE A SYSTEMS DESIGN INTERVIEW
  • <Design Pattern>
  • <Design_Oatterns_For_Dummies.pdf>
  • http://www.hiredintech.com/app
最后編輯于
?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。

推薦閱讀更多精彩內容

  • PLEASE READ THE FOLLOWING APPLE DEVELOPER PROGRAM LICENSE...
    念念不忘的閱讀 13,551評論 5 6
  • 在這個不知名的下午,內心有些文字的氣息在躁動,不知道有些什么異樣的情緒,在整個身體內蔓延成災,心里空落落的。 ...
    不經意間流水年華閱讀 237評論 0 1
  • 我將自己掏空 爾后規整世人的沉默 試圖填滿內容 讓我猜想 假使諸事順遂 我會在九月和你道別 雨后的車站 無需臨別的...
    舟啊粥粥閱讀 146評論 0 1