How To Get The Number Of Requests In Queue In Scrapy?
Solution 1:
This took me a while to figure out, but here's what I used:
self.crawler.engine.slot.scheduler
That is the instance of the scheduler. You can then call the __len__()
method of it, or if you just need true/false for pending requests, do something like this:
self.crawler.engine.scheduler_cls.has_pending_requests(self.crawler.engine.slot.scheduler)
Beware that there could still be running requests even thought the queue is empty. To check how many requests are currently running use:
len(self.crawler.engine.slot.inprogress)
Solution 2:
An approach to answer your questions:
From the documentation http://readthedocs.org/docs/scrapy/en/0.14/faq.html#does-scrapy-crawl-in-breath-first-or-depth-first-order
By default, Scrapy uses a LIFO queue for storing pending requests, which basically means that it crawls in DFO order. This order is more convenient in most cases. If you do want to crawl in true BFO order, you can do it by setting the following settings:
DEPTH_PRIORITY = 1SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'
So self.dqs
and self.mqs
are auto esplicative (disk queque scheduler and memory queue scheduler.
From another SO answer there is a suggestion about accessing to the (Storing scrapy queue in a database) scrapy internale queque rappresentation queuelib
https://github.com/scrapy/queuelib
Once you get it you just need to count the length of the queue.
Solution 3:
Found this question because I was trying to implement a progressbar for a scrapy spider and thought I would share what I figured out. For current versions of scrapy (I'm using 2.5), I recommend using signals with a custom extension (although that might depend on what you're trying to do with the total).
Basically, you want to bind to the request_scheduled signal and increment your total every time and that signal is fired, and also bind to the request_dropped signal and decrement your title whenever that one is fired.
If you want to know how many have been scheduled but not processed, then you could do the same thing but also bind to the item_scraped signal and decrease the total as scheduled requests are processed (possibly also item dropped, depending on the spider).
Here's an example extension that tracks the total number of requests that have been queued per named spider:
from collections import defaultdict
from scrapy import signals
from scrapy.exceptions import NotConfigured
classQueueTotal:
"""scrapy extension to track the number of requests that have been queued."""def__init__(self):
self.totals = defaultdict(int)
self.items_scraped = defaultdict(int)
@classmethoddeffrom_crawler(cls, crawler):
# first check if the extension should be enabled and raise# NotConfigured otherwiseifnot crawler.settings.getbool("QUEUETOTAL_ENABLED"):
raise NotConfigured
# instantiate the extension object
ext = cls()
# connect the extension object to signals
crawler.signals.connect(ext.request_scheduled, signal=signals.request_scheduled)
crawler.signals.connect(ext.request_dropped, signal=signals.request_dropped)
# return the extension objectreturn ext
defrequest_scheduled(self, request, spider):
# increase total when new requests are scheduled
self.totals[spider.name] += 1defrequest_dropped(self, request, spider):
# decrease total when requests are dropped
self.totals[spider.name] -= 1
Post a Comment for "How To Get The Number Of Requests In Queue In Scrapy?"