Skip to content Skip to sidebar Skip to footer

How To Get The Number Of Requests In Queue In Scrapy?

I am using scrapy to crawl some websites. How to get the number of requests in the queue? I have looked at the scrapy source code and find scrapy.core.scheduler.Scheduler may lead

Solution 1:

This took me a while to figure out, but here's what I used:

self.crawler.engine.slot.scheduler

That is the instance of the scheduler. You can then call the __len__() method of it, or if you just need true/false for pending requests, do something like this:

self.crawler.engine.scheduler_cls.has_pending_requests(self.crawler.engine.slot.scheduler)

Beware that there could still be running requests even thought the queue is empty. To check how many requests are currently running use:

len(self.crawler.engine.slot.inprogress)

Solution 2:

An approach to answer your questions:

From the documentation http://readthedocs.org/docs/scrapy/en/0.14/faq.html#does-scrapy-crawl-in-breath-first-or-depth-first-order

By default, Scrapy uses a LIFO queue for storing pending requests, which basically means that it crawls in DFO order. This order is more convenient in most cases. If you do want to crawl in true BFO order, you can do it by setting the following settings:

DEPTH_PRIORITY = 1SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'

So self.dqs and self.mqs are auto esplicative (disk queque scheduler and memory queue scheduler.

From another SO answer there is a suggestion about accessing to the (Storing scrapy queue in a database) scrapy internale queque rappresentation queuelibhttps://github.com/scrapy/queuelib

Once you get it you just need to count the length of the queue.

Solution 3:

Found this question because I was trying to implement a progressbar for a scrapy spider and thought I would share what I figured out. For current versions of scrapy (I'm using 2.5), I recommend using signals with a custom extension (although that might depend on what you're trying to do with the total).

Basically, you want to bind to the request_scheduled signal and increment your total every time and that signal is fired, and also bind to the request_dropped signal and decrement your title whenever that one is fired.

If you want to know how many have been scheduled but not processed, then you could do the same thing but also bind to the item_scraped signal and decrease the total as scheduled requests are processed (possibly also item dropped, depending on the spider).

Here's an example extension that tracks the total number of requests that have been queued per named spider:

from collections import defaultdict
from scrapy import signals
from scrapy.exceptions import NotConfigured

classQueueTotal:
"""scrapy extension to track the number of requests that have been queued."""def__init__(self):
    self.totals = defaultdict(int)
    self.items_scraped = defaultdict(int)

@classmethoddeffrom_crawler(cls, crawler):
    # first check if the extension should be enabled and raise# NotConfigured otherwiseifnot crawler.settings.getbool("QUEUETOTAL_ENABLED"):
        raise NotConfigured

    # instantiate the extension object
    ext = cls()
    # connect the extension object to signals
    crawler.signals.connect(ext.request_scheduled, signal=signals.request_scheduled)
    crawler.signals.connect(ext.request_dropped, signal=signals.request_dropped)

    # return the extension objectreturn ext

defrequest_scheduled(self, request, spider):
    # increase total when new requests are scheduled
    self.totals[spider.name] += 1defrequest_dropped(self, request, spider):
    # decrease total when requests are dropped
    self.totals[spider.name] -= 1

Post a Comment for "How To Get The Number Of Requests In Queue In Scrapy?"