Python scrapy tutorial for beginners

If you want to scrap web pages using python - scrapy is a perfect choice for this purpose. With few lines of code and commands you can make quick and efficient spider.

In order to run scrapy you need Python. If you don't have python you can download and install it using this tutorial: Download and install python for Ubuntu/Windows

In this post:

Install scrapy
Create basic spider
Running scrapy spider
Scrap multiple pages
References

Install scrapy

You can install scrapy in several ways depending on your python version. The first way is by using pip:

pip install scrapy

to install scrapy by using conda:

conda install -c conda-forge scrapy

In case of problems during installation with module compatibility you can check this article:

Python dependency error or module with incompatible requirements

After successful installation of scrapy your import:

import scrapy

will be recognized otherwise you will get error:

No module named scrapy

or
scrapy command not recognized
or
'scrapy' is not recognized as an internal or external command

Create basic spider

Create a new python file and put this code inside(in my case the file is /home/user/scrapy/testScrapy.py) you need the file name in order to run the spider:

import scrapy


class WikiSpider(scrapy.Spider):
    name = "wiki"
    start_urls = ["https://en.wikipedia.org/wiki/Scrapy"]

    def parse(self, response):
        sel = scrapy.Selector(response)
        sites = sel.xpath('//body/div')
        for site in sites:
            title = site.xpath('//h1/text()').extract()
            subtitle = site.xpath('//h2/span/text()').extract()
            bold = site.xpath('//p/b').extract()
            links = site.xpath('//a/@href').extract()
            # print(title, subtitle, bold, links)

        yield {
            'title': title,
            'subtitle': subtitle,
            'bold': bold,
            'links': links,
        }

result:

{'bold': ['<b>Scrapy</b>'], 'title': ['Scrapy'], 'subtitle': ['History', 'References', 'External links'], 'links': ['#mw-head', '#p-search', '/wiki/Scrapie', ...

This is an example how can be scrapped information from wikipedia.org

Few words about the spider:

this spider is scraping only one page: https://en.wikipedia.org/wiki/Scrapy
sel.xpath('//body/div') - selects what information to be read and used for extraction. For example:

SELECTOR = '.references'
for test in response.css(SELECTOR):

will get information only from the reference section of wikipedia.

site.xpath('//h1/text()').extract() - this is extracting the text of tag H1 by xpath.
site.css('h1 ::text').extract() - this will extract the same using css selector.
You can extract all by extract() or only the first one by using extract_first()
yield { 'title': title,... - this one is giving the scrapy ouput. You can define the information that you want in the output.

Running scrapy spider

If you want to run a scrapy spider you have to use the special scrapy command and not the usual way to run python program. This is a similar execution for the spider above:

scrapy runspider ./testScrapy.py

where name testScrapy.py is the file name of the python file created in the previous section. You need to be sure that you are running the the version of python which has scrapy installed(you can also disobey the robots.txt file):

/home/user/anaconda3/bin/scrapy runspider /home/user/scrapy/testScrapy.py  --set=ROBOTSTXT_OBEY=False

Note: if you get error

AttributeError: 'module' object has no attribute 'OP_SINGLE_ECDH_USE'

The problem is due to missing libraries - libssl-dev and pyopenssl. In order to solve it you need to do:

sudo apt-get install libssl-dev

and after that to upgrade pyopenssl by:

pip install pyopenssl --upgrade

Scrap multiple pages

If you want to scrap multiple pages using scrapy recursively you can do it by adding this code at the end of your parse method:

        next_page = response.xpath('//a/@href').extract_first()
        if next_page:
            yield scrapy.Request(
                response.urljoin(next_page),
                callback=self.parse
            )

this will find the first link and it will continue to scrapping process.

result:

2018-05-02 18:46:17 [scrapy.core.engine] INFO: Spider opened
2018-05-02 18:46:17 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-05-02 18:46:17 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-05-02 18:46:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/Scrapy> (referer: None)
2018-05-02 18:46:17 [scrapy.core.scraper] DEBUG: Scraped from <200 https://en.wikipedia.org/wiki/Scrapy>