接上文500 Lines or Less:A Web Crawler With asyncio Coroutines異步網絡爬蟲(一)
Coordinating Coroutines
We began by describing how we want our crawler to work. Now it is time to implement it with asyncio coroutines.
我們開始描述我們希望我們的爬蟲如何工作。 現在是時候實現它與asyncio協程。
Our crawler will fetch the first page, parse its links, and add them to a queue. After this it fans out across the website, fetching pages concurrently. But to limit load on the client and server, we want some maximum number of workers to run, and no more. Whenever a worker finishes fetching a page, it should immediately pull the next link from the queue. We will pass through periods when there is not enough work to go around, so some workers must pause. But when a worker hits a page rich with new links, then the queue suddenly grows and any paused workers should wake and get cracking. Finally, our program must quit once its work is done.
我們的抓取工具會抓取第一頁,解析其鏈接,并將其添加到隊列中。 在這之后,它退出網站,同時抓取頁面。 但是為了限制客戶端和服務器上的負載,我們希望運行一些最大數量的worker,并且不會更多。 每當一個worker完成提取頁面時,它應該立即從隊列中拉下一個鏈接。 當沒有足夠的工作量時,我們會暫停一些worker。 但是當一個worker點擊一個富有新鏈接的頁面時,隊列突然增長,任何暫停的worker都應該醒來并開始工作。 最后,我們的程序必須在其工作完成后退出。
Imagine if the workers were threads. How would we express the crawler's algorithm? We could use a synchronized queue[1] from the Python standard library. Each time an item is put in the queue, the queue increments its count of "tasks". Worker threads call task_done
after completing work on an item. The main thread blocks on Queue.join
until each item put in the queue is matched by a task_done
call, then it exits.
想象一下,如果workers 是線程。 我們將如何表達爬蟲的算法? 我們可以使用來自Python標準庫的同步隊列[^ 5]。 每次將項目放入隊列時,隊列都會增加其“任務”的計數。 工作線程在完成對項目的工作后調用task_done
。 Queue.join
中的主線程直到每個項目放入隊列后阻塞,然后通過task_done
調用來匹配,然后退出。
Coroutines use the exact same pattern with an asyncio queue! First we import it[2]:
協程使用與asyncio隊列完全相同的模式! 首先我們導入它[^ 6]:
try:
from asyncio import JoinableQueue as Queue
except ImportError:
# In Python 3.5, asyncio.JoinableQueue is
# merged into Queue.
from asyncio import Queue
We collect the workers' shared state in a crawler class, and write the main logic in its crawl
method. We start crawl
on a coroutine and run asyncio's event loop until crawl
finishes:
我們在爬蟲類中收集workers的共享狀態,并在其crawl
方法中編寫主邏輯。 我們在協程上啟動crawl
并運行asyncio的事件循環,直到crawl
完成:
loop = asyncio.get_event_loop()
crawler = crawling.Crawler('http://xkcd.com',
max_redirect=10)
loop.run_until_complete(crawler.crawl())
The crawler begins with a root URL and max_redirect
, the number of redirects it is willing to follow to fetch any one URL. It puts the pair (URL, max_redirect)
in the queue. (For the reason why, stay tuned.)
爬蟲以根網址和max_redirect
開頭,the number of redirects it is willing to follow to fetch any one URL(不知道怎么翻譯比較好...先放在這) 它將(URL, max_redirect)
放在隊列中。 (為什么,請繼續關注。)
class Crawler:
def __init__(self, root_url, max_redirect):
self.max_tasks = 10
self.max_redirect = max_redirect
self.q = Queue()
self.seen_urls = set()
# aiohttp's ClientSession does connection pooling and
# HTTP keep-alives for us.
self.session = aiohttp.ClientSession(loop=loop)
# Put (URL, max_redirect) in the queue.
self.q.put((root_url, self.max_redirect))
The number of unfinished tasks in the queue is now one. Back in our main script, we launch the event loop and the crawl
method:
隊列中未完成任務的數量現在為1。 回到我們的主腳本,我們啟動事件循環和crawl
方法:
loop.run_until_complete(crawler.crawl())
The crawl
coroutine kicks off the workers. It is like a main thread: it blocks on join
until all tasks are finished, while the workers run in the background.
crawl
協程啟動workers。 它像一個主線程:它阻塞在join
,直到所有任務完成,而workers 在后臺運行。
@asyncio.coroutine
def crawl(self):
"""Run the crawler until all work is done."""
workers = [asyncio.Task(self.work())
for _ in range(self.max_tasks)]
# When all work is done, exit.
yield from self.q.join()
for w in workers:
w.cancel()
If the workers were threads we might not wish to start them all at once. To avoid creating expensive threads until it is certain they are necessary, a thread pool typically grows on demand. But coroutines are cheap, so we simply start the maximum number allowed.
如果workers 是線程,我們可能不希望立即啟動它們。 以避免在確定需要之前創建昂貴的線程,線程池通常根據需要增長。 但協程很便宜,所以我們簡單的啟動允許的最大數量。
It is interesting to note how we shut down the crawler. When the join
future resolves, the worker tasks are alive but suspended: they wait for more URLs but none come. So, the main coroutine cancels them before exiting. Otherwise, as the Python interpreter shuts down and calls all objects' destructors, living tasks cry out:
有趣的是,我們注意到我們如何關閉爬蟲。 當join
future 解析時,worker 任務活著但是被暫停:它們等待更多的URL,但沒有來到。 因此,主協程在退出之前取消它們。 否則,當Python解釋器關閉并調用所有對象的析構函數時,存活的任務崩潰了:
ERROR:asyncio:Task was destroyed but it is pending!
And how does cancel
work? Generators have a feature we have not yet shown you. You can throw an exception into a generator from outside:
cancel
是如何工作的? 生成器具有我們尚未向您展示的功能。 您可以從外部將異常拋出到生成器中:
>>> gen = gen_fn()
>>> gen.send(None) # Start the generator as usual.
1
>>> gen.throw(Exception('error'))
Traceback (most recent call last):
File "<input>", line 3, in <module>
File "<input>", line 2, in gen_fn
Exception: error
The generator is resumed by throw
, but it is now raising an exception. If no code in the generator's call stack catches it, the exception bubbles back up to the top. So to cancel a task's coroutine:
生成器由throw
恢復,但它現在引發異常。 如果生成器的調用棧中沒有代碼捕獲它,則異常冒泡回到頂部。 所以為了取消任務的協程:
# Method of Task class.
def cancel(self):
self.coro.throw(CancelledError)
Wherever the generator is paused, at some yield from
statement, it resumes and throws an exception. We handle cancellation in the task's step
method:
無論生成器何時暫停,在某些yield from
語句中,它會恢復并拋出異常。 我們在任務的step
方法中處理取消:
# Method of Task class.
def step(self, future):
try:
next_future = self.coro.send(future.result)
except CancelledError:
self.cancelled = True
return
except StopIteration:
return
next_future.add_done_callback(self.step)
Now the task knows it is cancelled, so when it is destroyed it does not rage against the dying of the light.
現在任務知道它被取消,所以當它被銷毀時,它不憤怒反對死亡。
Once crawl
has canceled the workers, it exits. The event loop sees that the coroutine is complete (we shall see how later), and it too exits:
一旦crawl
取消了workers,它就退出了。 事件循環看到協程是完成了的(我們將看到稍后),它也退出:
loop.run_until_complete(crawler.crawl())
The crawl
method comprises all that our main coroutine must do. It is the worker coroutines that get URLs from the queue, fetch them, and parse them for new links. Each worker runs the work
coroutine independently:
crawl
方法包括我們的主協程必須做的所有事情。 它是worker 協程,從隊列獲取URL,獲取它們,并解析它們的新鏈接。 每個worker 獨立運行work
協同程序:
@asyncio.coroutine
def work(self):
while True:
url, max_redirect = yield from self.q.get()
# Download page and add new links to self.q.
yield from self.fetch(url, max_redirect)
self.q.task_done()
Python sees that this code contains yield from
statements, and compiles it into a generator function. So in crawl
, when the main coroutine calls self.work
ten times, it does not actually execute this method: it only creates ten generator objects with references to this code. It wraps each in a Task. The Task receives each future the generator yields, and drives the generator by calling send
with each future's result when the future resolves. Because the generators have their own stack frames, they run independently, with separate local variables and instruction pointers.
Python看到這個代碼包含yield from語句,并將其編譯成一個生成器函數。 因此,在爬蟲運行時,當主協程調用 self.work
十次時,它實際上不執行此方法:它只創建十個生成器對象引用此代碼。 它將每個任務包裝在一個Task中。 任務接收每個未來的生成器 yields,并通過在未來結算時通過調用send
與每個future的結果來驅動生成器。 因為生成器具有自己的堆棧幀,所以它們獨立運行,具有單獨的局部變量和指令指針。
The worker coordinates with its fellows via the queue. It waits for new URLs with:
worker 通過隊列與其同事協調。 它等待具有以下內容的新網址:
url, max_redirect = yield from self.q.get()
The queue's get
method is itself a coroutine: it pauses until someone puts an item in the queue, then resumes and returns the item.
隊列的get
方法本身是一個協程:它將暫停,直到有人將一個項目放入隊列,然后就會恢復并返回項目。
Incidentally, this is where the worker will be paused at the end of the crawl, when the main coroutine cancels it. From the coroutine's perspective, its last trip around the loop ends when yield from
raises a CancelledError
.
順便說一下,當主協程取消它,worker 將在爬取結束時暫停。 從協程的角度來看,當“yield from”引發一個“CancelledError”時,它的最后一次循環結束。
When a worker fetches a page it parses the links and puts new ones in the queue, then calls task_done
to decrement the counter. Eventually, a worker fetches a page whose URLs have all been fetched already, and there is also no work left in the queue. Thus this worker's call to task_done
decrements the counter to zero. Then crawl
, which is waiting for the queue's join
method, is unpaused and finishes.
當worker 提取頁面時,它會解析鏈接并將新的鏈接放入隊列,然后調用task_done
來遞減計數器。 最終,worker 獲取已經獲取了URL的頁面,并且隊列中也沒有剩余的工作。 因此,這個worker 對“task_done”的調用將計數器減少為零。 然后, crawl
,它等待隊列的join
方法,被取消暫停并完成。
We promised to explain why the items in the queue are pairs, like:
我們承諾過解釋為什么隊列中的項目是成對的,如:
# URL to fetch, and the number of redirects left.
('http://xkcd.com/353', 10)
New URLs have ten redirects remaining. Fetching this particular URL results in a redirect to a new location with a trailing slash. We decrement the number of redirects remaining, and put the next location in the queue:
新網址有十個重定向。 獲取此特定網址會導致重定向到帶有尾部斜杠的新位置。 我們減少剩余的重定向數,并將下一個位置放入隊列:
# URL with a trailing slash. Nine redirects left.
('http://xkcd.com/353/', 9)
The aiohttp
package we use would follow redirects by default and give us the final response. We tell it not to, however, and handle redirects in the crawler, so it can coalesce redirect paths that lead to the same destination: if we have already seen this URL, it is in self.seen_urls
and we have already started on this path from a different entry point:
我們使用的aiohttp
包將遵循默認的重定向,并給我們最后的響應。 然而,我們告訴它不會在抓取工具中處理重定向,所以它可以合并到重定向路徑,導致相同的目標:如果我們已經看到這個URL,它在self.seen_urls
,我們已經 在此路徑上從不同的入口點啟動:
\aosafigure[240pt]{crawler-images/redirects.png}{Redirects}{500l.crawler.redirects}
The crawler fetches "foo" and sees it redirects to "baz", so it adds "baz" to the queue and to seen_urls
. If the next page it fetches is "bar", which also redirects to "baz", the fetcher does not enqueue "baz" again. If the response is a page, rather than a redirect, fetch
parses it for links and puts new ones in the queue.
抓取工具獲取“foo”并且看到它重定向到“baz”,因此它將“baz”添加到隊列和seen_urls
。 如果它獲取的下一頁是“bar”,它也重定向到“baz”,抓取器不會再次入隊“baz”。 如果響應是一個頁面,而不是一個重定向,fetch
解析它的鏈接,并將新的隊列中。
@asyncio.coroutine
def fetch(self, url, max_redirect):
# Handle redirects ourselves.
response = yield from self.session.get(
url, allow_redirects=False)
try:
if is_redirect(response):
if max_redirect > 0:
next_url = response.headers['location']
if next_url in self.seen_urls:
# We have been down this path before.
return
# Remember we have seen this URL.
self.seen_urls.add(next_url)
# Follow the redirect. One less redirect remains.
self.q.put_nowait((next_url, max_redirect - 1))
else:
links = yield from self.parse_links(response)
# Python set-logic:
for link in links.difference(self.seen_urls):
self.q.put_nowait((link, self.max_redirect))
self.seen_urls.update(links)
finally:
# Return connection to pool.
yield from response.release()
If this were multithreaded code, it would be lousy with race conditions. For example, the worker checks if a link is in seen_urls
, and if not the worker puts it in the queue and adds it to seen_urls
. If it were interrupted between the two operations, then another worker might parse the same link from a different page, also observe that it is not in seen_urls
, and also add it to the queue. Now that same link is in the queue twice, leading (at best) to duplicated work and wrong statistics.
如果這是多線程代碼,它將是討厭的條件競爭。 例如,worker 檢查鏈接是否在seen_urls
中,如果不是,則將其放入隊列并將其添加到seen_urls
中。 如果它在兩個操作之間中斷,則另一個worker 可能從不同的頁面解析相同的鏈接,還觀察到它不在seen_urls
,并且也將其添加到隊列。 現在同一個鏈接在隊列中兩次,導致(頂多)重復的工作和錯誤的統計。
However, a coroutine is only vulnerable to interruption at yield from
statements. This is a key difference that makes coroutine code far less prone to races than multithreaded code: multithreaded code must enter a critical section explicitly, by grabbing a lock, otherwise it is interruptible. A Python coroutine is uninterruptible by default, and only cedes control when it explicitly yields.
但是,協程只受到yield from
語句中斷的影響。 這是一個關鍵區別,使得協同代碼比多線程代碼更不容易發生競爭:多線程代碼必須通過抓取鎖來顯式地進入臨界區,否則它是可中斷的。 Python協程在默認情況下是不可中斷的,并且只有在它顯式產生時才控制。
We no longer need a fetcher class like we had in the callback-based program. That class was a workaround for a deficiency of callbacks: they need some place to store state while waiting for I/O, since their local variables are not preserved across calls. But the fetch
coroutine can store its state in local variables like a regular function does, so there is no more need for a class.
我們不再需要像我們在基于回調的程序中一樣的fetcher類。 該類是回調不足的解決方法:在等待I / O時,它們需要一些地方來存儲狀態,因為它們的局部變量不會跨越調用保留。 但是fetch
協程可以像常規函數那樣將其狀態存儲在局部變量中,因此不再需要類。
When fetch
finishes processing the server response it returns to the caller, work
. The work
method calls task_done
on the queue and then gets the next URL from the queue to be fetched.
當fetch
完成處理服務器響應時,它返回到調用者work
。 work
方法在隊列上調用task_done
,然后從隊列中獲取下一個要獲取的URL。
When fetch
puts new links in the queue it increments the count of unfinished tasks and keeps the main coroutine, which is waiting for q.join
, paused. If, however, there are no unseen links and this was the last URL in the queue, then when work
calls task_done
the count of unfinished tasks falls to zero. That event unpauses join
and the main coroutine completes.
當fetch
將新的鏈接放入隊列時,它增加未完成任務的計數,并保持主協程,等待q.join
,暫停。 然而,如果沒有unseen links,這是隊列中的最后一個URL,那么當work
調用task_done
時,未完成任務的計數降為零。 該事件取消了join
并且主協程完成。
The queue code that coordinates the workers and the main coroutine is like this[3]:
協調workers 和主協程的隊列代碼是這樣的[^ 9]:
class Queue:
def __init__(self):
self._join_future = Future()
self._unfinished_tasks = 0
# ... other initialization ...
def put_nowait(self, item):
self._unfinished_tasks += 1
# ... store the item ...
def task_done(self):
self._unfinished_tasks -= 1
if self._unfinished_tasks == 0:
self._join_future.set_result(None)
@asyncio.coroutine
def join(self):
if self._unfinished_tasks > 0:
yield from self._join_future
The main coroutine, crawl
, yields from join
. So when the last worker decrements the count of unfinished tasks to zero, it signals crawl
to resume, and finish.
主協程crawl
從 join
中產生。 因此,當最后一個工人將未完成任務的計數減少為零時,它指示crawl
恢復并且完成。
The ride is almost over. Our program began with the call to crawl
:
我們的程序從調用crawl
開始:
loop.run_until_complete(self.crawler.crawl())
How does the program end? Since crawl
is a generator function, calling it returns a generator. To drive the generator, asyncio wraps it in a task:
程序如何結束? 因為crawl
'是一個生成器函數,所以調用它會返回一個生成器。 為了驅動生成器,asyncio將它包裝在一個任務中:
class EventLoop:
def run_until_complete(self, coro):
"""Run until the coroutine is done."""
task = Task(coro)
task.add_done_callback(stop_callback)
try:
self.run_forever()
except StopError:
pass
class StopError(BaseException):
"""Raised to stop the event loop."""
def stop_callback(future):
raise StopError
When the task completes, it raises StopError
, which the loop uses as a signal that it has arrived at normal completion.
當任務完成時,它引發StopError,loop 作為它已經到達正常完成的信號。
But what's this? The task has methods called add_done_callback
and result
? You might think that a task resembles a future. Your instinct is correct. We must admit a detail about the Task class we hid from you: a task is a future.
但是這是什么? task 有稱為add_done_callback
和result
的方法? 你可能認為任務類似于future。 你的直覺是正確的。 我們必須承認我們隱藏的任務類的細節:一個任務是一個future。
class Task(Future):
"""A coroutine wrapped in a Future."""
Normally a future is resolved by someone else calling set_result
on it. But a task resolves itself when its coroutine stops. Remember from our earlier exploration of Python generators that when a generator returns, it throws the special StopIteration
exception:
通常,future 由其他人調用set_result
解決。 但是一個任務在它的協程停止時自行解決。 記住我們早期探索Python生成器時,當一個生成器返回時,它會拋出特殊的StopIteration異常:
# Method of class Task.
def step(self, future):
try:
next_future = self.coro.send(future.result)
except CancelledError:
self.cancelled = True
return
except StopIteration as exc:
# Task resolves itself with coro's return
# value.
self.set_result(exc.value)
return
next_future.add_done_callback(self.step)
So when the event loop calls task.add_done_callback(stop_callback)
, it prepares to be stopped by the task. Here is run_until_complete
again:
所以當事件循環調用task.add_done_callback(stop_callback)
時,它準備被任務停止。 這里是run_until_complete
:
# Method of event loop.
def run_until_complete(self, coro):
task = Task(coro)
task.add_done_callback(stop_callback)
try:
self.run_forever()
except StopError:
pass
When the task catches StopIteration
and resolves itself, the callback raises StopError
from within the loop. The loop stops and the call stack is unwound to run_until_complete
. Our program is finished.
當任務捕獲StopIteration
并且自己解析時,回調從循環中引發StopError
。 循環停止,調用堆棧解開為run_until_complete
。 我們的程序完成了。
Conclusion 結論
Increasingly often, modern programs are I/O-bound instead of CPU-bound. For such programs, Python threads are the worst of both worlds: the global interpreter lock prevents them from actually executing computations in parallel, and preemptive switching makes them prone to races. Async is often the right pattern. But as callback-based async code grows, it tends to become a dishevelled mess. Coroutines are a tidy alternative. They factor naturally into subroutines, with sane exception handling and stack traces.
越來越多地,現代程序是I / O綁定而不是CPU綁定。 對于這樣的程序,Python線程是很糟糕的:全局解釋器鎖防止它們實際上并行執行計算,并且搶先切換使它們容易出現競爭。 異步通常是正確的模式。 但是隨著基于回調的異步代碼的增長,它往往成為一個混亂的混亂。 協程是一個整潔的替代品。 他們自然地考慮子程序,具有正確的異常處理和堆棧跟蹤。
If we squint so that the yield from
statements blur, a coroutine looks like a thread doing traditional blocking I/O. We can even coordinate coroutines with classic patterns from multi-threaded programming. There is no need for reinvention. Thus, compared to callbacks, coroutines are an inviting idiom to the coder experienced with multithreading.
如果我們瞇著眼睛,使得yield from
語句模糊,協程看起來像是執行傳統的阻塞I / O的線程。 我們甚至可以使用多線程編程中的經典模式來協調協程。 沒有必要改造。
But when we open our eyes and focus on the yield from
statements, we see they mark points when the coroutine cedes control and allows others to run. Unlike threads, coroutines display where our code can be interrupted and where it cannot. In his illuminating essay "Unyielding"[4], Glyph Lefkowitz writes, "Threads make local reasoning difficult, and local reasoning is perhaps the most important thing in software development." Explicitly yielding, however, makes it possible to "understand the behavior (and thereby, the correctness) of a routine by examining the routine itself rather than examining the entire system."
但是當我們打開我們的眼睛并專注于yield from
語句時,我們看到它們在協程退出控制并允許其他人運行時標記了重點。 與線程不同,協同程序顯示我們的代碼可以中斷的地方,而線程不能。 在他的論文"Unyielding"[^ 4]中,Glyph Lefkowitz寫道:“線程使得局部推理變得困難,局部推理也許是軟件開發中最重要的。 然而,顯式產生可以通過檢查例程本身而不是檢查整個系統來“理解例程的行為(正確性)。
This chapter was written during a renaissance in the history of Python and async. Generator-based coroutines, whose devising you have just learned, were released in the "asyncio" module with Python 3.4 in March 2014. In September 2015, Python 3.5 was released with coroutines built in to the language itself. These native coroutinesare declared with the new syntax "async def", and instead of "yield from", they use the new "await" keyword to delegate to a coroutine or wait for a Future.
章是在Python和異步的歷史上復興期間寫的。 基于生成器的協程,它的設計你剛剛學會了,在2014年3月的Python 3.4的“asyncio”模塊中發布。2015年9月,Python 3.5發布了內置語言本身的協同程序。 這些本地協程用新語法“async def”聲明,而不是“yield from”,它們使用新的“await”關鍵字來委派協程或等待Future。
Despite these advances, the core ideas remain. Python's new native coroutines will be syntactically distinct from generators but work very similarly; indeed, they will share an implementation within the Python interpreter. Task, Future, and the event loop will continue to play their roles in asyncio.
盡管有這些進展,核心思想仍然存在。 Python的新本地協同程序在語法上不同于生成器,但工作非常相似; 實際上,他們將在Python解釋器中共享一個實現。 Task, Future, 和event loop將繼續在asyncio中發揮他們的角色。
Now that you know how asyncio coroutines work, you can largely forget the details. The machinery is tucked behind a dapper interface. But your grasp of the fundamentals empowers you to code correctly and efficiently in modern async environments.
現在你知道asyncio協程如何工作,你可以在很大程度上忘記細節。 機械被塞在一個dapper接口后面。 但是你對基礎知識的掌握使你能夠在現代異步環境中正確有效地編程。
<latex>
<markdown>
-
The actual
asyncio.Queue
implementation uses anasyncio.Event
in place of the Future shown here. The difference is an Event can be reset, whereas a Future cannot transition from resolved back to pending. ?