If you want to scrap web pages using python - scrapy is a perfect choice for this purpose. With few lines of code and commands you can make quick and efficient spider.
In order to run scrapy you need Python. If you don't have python you can download and install it using this tutorial: Download and install python for Ubuntu/Windows
In this post:
- Install scrapy
- Create basic spider
- Running scrapy spider
- Scrap multiple pages
- References
Install scrapy
You can install scrapy in several ways depending on your python version. The first way is by using pip:
pip install scrapy
to install scrapy by using conda:
conda install -c conda-forge scrapy
In case of problems during installation with module compatibility you can check this article:
Python dependency error or module with incompatible requirements
After successful installation of scrapy your import:
import scrapy
will be recognized otherwise you will get error:
No module named scrapy
or
scrapy command not recognized
or
'scrapy' is not recognized as an internal or external command
Create basic spider
Create a new python file and put this code inside(in my case the file is /home/user/scrapy/testScrapy.py) you need the file name in order to run the spider:
import scrapy
class WikiSpider(scrapy.Spider):
name = "wiki"
start_urls = ["https://en.wikipedia.org/wiki/Scrapy"]
def parse(self, response):
sel = scrapy.Selector(response)
sites = sel.xpath('//body/div')
for site in sites:
title = site.xpath('//h1/text()').extract()
subtitle = site.xpath('//h2/span/text()').extract()
bold = site.xpath('//p/b').extract()
links = site.xpath('//a/@href').extract()
# print(title, subtitle, bold, links)
yield {
'title': title,
'subtitle': subtitle,
'bold': bold,
'links': links,
}
result:
{'bold': ['<b>Scrapy</b>'], 'title': ['Scrapy'], 'subtitle': ['History', 'References', 'External links'], 'links': ['#mw-head', '#p-search', '/wiki/Scrapie', ...
This is an example how can be scrapped information from wikipedia.org
Few words about the spider:
- this spider is scraping only one page: https://en.wikipedia.org/wiki/Scrapy
- sel.xpath('//body/div') - selects what information to be read and used for extraction. For example:
SELECTOR = '.references'
for test in response.css(SELECTOR):
will get information only from the reference section of wikipedia.
- site.xpath('//h1/text()').extract() - this is extracting the text of tag H1 by xpath.
- site.css('h1 ::text').extract() - this will extract the same using css selector.
- You can extract all by extract() or only the first one by using extract_first()
- yield { 'title': title,... - this one is giving the scrapy ouput. You can define the information that you want in the output.
Running scrapy spider
If you want to run a scrapy spider you have to use the special scrapy command and not the usual way to run python program. This is a similar execution for the spider above:
scrapy runspider ./testScrapy.py
where name testScrapy.py is the file name of the python file created in the previous section. You need to be sure that you are running the the version of python which has scrapy installed(you can also disobey the robots.txt file):
/home/user/anaconda3/bin/scrapy runspider /home/user/scrapy/testScrapy.py --set=ROBOTSTXT_OBEY=False
Note: if you get error
AttributeError: 'module' object has no attribute 'OP_SINGLE_ECDH_USE'
The problem is due to missing libraries - libssl-dev and pyopenssl. In order to solve it you need to do:
sudo apt-get install libssl-dev
and after that to upgrade pyopenssl by:
pip install pyopenssl --upgrade
Scrap multiple pages
If you want to scrap multiple pages using scrapy recursively you can do it by adding this code at the end of your parse method:
next_page = response.xpath('//a/@href').extract_first()
if next_page:
yield scrapy.Request(
response.urljoin(next_page),
callback=self.parse
)
this will find the first link and it will continue to scrapping process.
result:
2018-05-02 18:46:17 [scrapy.core.engine] INFO: Spider opened
2018-05-02 18:46:17 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-05-02 18:46:17 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-05-02 18:46:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/Scrapy> (referer: None)
2018-05-02 18:46:17 [scrapy.core.scraper] DEBUG: Scraped from <200 https://en.wikipedia.org/wiki/Scrapy>