Crawler Data Form Website Use Scrapy 1.5.0 - Python
I try to crawler data form a website with Scrapy (1.5.0)- Python Project directory : stack/ scrapy.cfg stack/ __init__.py items.py
Solution 1:
Set User Agent
goto your scrapy projects settings.py
and paste this in,
USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'
Solution 2:
If you just want to crawl the website and get the Source Code, this might help.
import urllib.request as req
def imLS():
url = "https://batdongsan.com.vn/nha-dat-ban"
data = req.Request(url)
resp = req.urlopen(data)
respData = resp.read()
print(respData)
imLS()
Solution 3:
found the answer: http://edmundmartin.com/random-user-agent-requests-python/ need set header to pass prevent crawl
Solution 4:
To parse each page you need to add a little bit code.
import re
from scrapy import Spider
from scrapy.selector import Selector
classStackSpider(Spider):
name = "batdongsan"
allowed_domains = ["<DOMAIN>"]
start_urls = [
"https://<DOMAIN>/nha-dat-ban",
]
defparse(self, response):
questions = Selector(response).xpath('//div[@class="p-title"]/h3')
# This part of code collect only titles. You need to add more fields to be collected if you need.for question in questions:
title = question.xpath(
'a/text()').extract_first().strip()
yield {'title': title}
ifnot re.search(r'\d+', response.url):
# Now we have to go th
url_prefix = response.css('div.background-pager-right-controls a::attr(href)').extract_first()
url_last = response.css('div.background-pager-right-controls a::attr(href)').extract()[-1]
max = re.findall(r'\d+', url_last)[0]
for n inrange(2, int(max)+1):
next_page = url_prefix + '/p' + str(n)
yield response.follow(next_page, callback=self.parse)
Replace to your domain. Also I didn't use Item class in my code.
Post a Comment for "Crawler Data Form Website Use Scrapy 1.5.0 - Python"