Web Crawling with Python

9 min readApr 11, 2023

Web Crawling Definition

Web Crawling is also known as indexing. It automatically gathers information using automated scripts that crawled data through methodically and predetermined means across the World Wide Web. Software tools like web crawlers, spiders, or bots are usually used to index content from all over the Internet.

In this tutorial, you will learn how to make use of a Python framework called “Scrapy” to handle large amounts of data. You will use Scrapy by building a web spider scraper for https://toscrape.com, a fictional bookstore.

Use cases of Web Crawling

Search Engine application: Web Scraping is used to index pages for search engines. This helps most search engines to present appropriate results for queries. It also affects the website’s search engine optimization(SEO) by giving search engines like Google information on whether your material is original or a straight copy of other online material.

2. Application in E-commerce

Real-time monitoring of Competitors’ prices: A highly important part of e-commerce is consistently offering things at low costs. Some of the websites actively search the internet to crawl certain items from the websites of their rivals, to see if they can equal or beat the pricing being provided there.

3. Product Performance Intelligence

Nowadays, e-commerce sites actively watch their rivals to keep one step ahead of their direct rivals. Let’s say Alibaba, for instance, would like to know how well their items are selling in comparison to Amazon. In order to detect the gaps in their catalogue, they would wish to trawl the online product catalogue from these two websites.

Introduction to Scrapy

Scrapy is an open-source Python web crawling and web scraping framework used to extract data from websites. It provides a set of tools and libraries for developers to build scalable and reliable web crawlers. Scrapy is designed to be highly modular and extensible, allowing developers to customize its functionality to meet their specific needs. It provides a wide range of features, including:

Built-in support for common protocols and formats: Scrapy provides built-in support for HTTP, HTTPS, FTP, and S3, as well as common data formats LIKE JSON, XML, and CSV.
Asynchronous requests: Scrapy can make multiple requests in parallel, which can greatly improve the speed of the crawling process.

3. Item pipelines: Scrapy allows you as a developer to define pipelines that can process the data extracted by the spider, including cleaning, validation and storage.

4. Extensibility: Scrapy is highly extensible, allowing developers to add their own functionality and customizers as needed

Scrapy is widely used in a variety of applications, including data mining, information processing, and monitoring. It is also used in academics for research purposes, and in the industry for tasks like price monitoring, content aggregation, and search engine indexing.

Overall, scrapy is a powerful and flexible web scraping and crawling framework that can be used to extract from websites quickly and efficiently.

Overview of how Scrapy Library works

One of the major libraries to be installed is Scrapy. Scrapy is a powerful Python package for web crawling and web scraping. It serves as a set of tools for developers to build web crawlers that can extract data from websites in a structured way. Scrapy is based on the idea of spiders, which are programs that can crawl websites and extract data from them. The spider starts by visiting a given URL and then follows links to other pages to continue crawling the website. As the spider crawls the website, it can extract data using a variety of techniques, including XPath selectors, regular expressions, and CSS selectors.

Writing the Crawler

To build a Python web Crawler you will need to install Python.

Installing Python

Up-to-date versions of Python download can be found at python.org/downloads. The documentation for all the supported platforms is also presented there. The latest Python download at the time of writing this article is Python 3.11.2. On the website select the Download Python tab. Depending on the operating system, in our case the operating system used is Windows. Download the Windows installer and check Add Python to PATH when installing. Python and Pip commands entered at the terminal will be recognized by Windows automatically. Pip is a package manager for Python packages or modules. Note that Python 3.4 and later versions come with pip by default. Therefore, it might not be necessary to install the pip manually. With Python installed you are ready to set the ball rolling on building your first Python Web Crawler.

How to check what version of Python is installed, execute the following command:

If Python is already installed, this command above will show you the version of Python installed.

All the code is written in Visual Studio (VS) Code Integrated Development Environment(IDE). VS code is simple to download, completely free, and it is easy to customize with many useful extensions.

Creating a Basic Scraper

If you already have Python installed on your machine, you can install Scrapy with the following command executed in the VS code terminal.

Creating a new Python project

Above code is written in the command line of VS code terminal. This will create a new directory called “myproject” containing the basic structure of a scrapy project. Your code and results will be stored there. The command will create a directory with Python scripts and a lot of files in it. The code output is shown in the image below:

The contents of “myproject” folder can be viewed directly through Explorer on the left side of VS code:

Crawling the Web Page

The next step will be to navigate to the project directory and create a new Spider for the website to be crawled.

This will create a new Spider called “myspider” python file that will crawl the “toscrape.com” domain. The domain link is topscrape.com for the fictional website. It provides a platform for websraping and helps developers to validate their scraping techniques.

Open the “myspider.py” edit in your VS code and modify it as follows to crawl the catalogue and category data for the books:

CrawlSpider: It is popular among the spider frameworks for crawling regular websites. It sets the following links to be crawled. It is generally adaptable for crawling different websites.
Link_extractor: is a Link Extractor object that explains the extraction process of the link from the page to be crawled. A request object is generated with each specified link, this will contain the link’s text.
name: is the name of the spider. A unique name can be chosen, in our case, we chose “myspider”. This will be used to run the spider.
allowed_domains: This presents the domains that can be crawled. URLs request not in the Python list will not be crawled. You should put only the domain of the website Example: [webpage to scrape] https://toscrape.com and not the whole URL, to avoid getting errors and warnings
Start_urls: Has the links from which the spider starts crawling
rules: Each Rule specifies a specific way to crawl a website. The first rule is usually given priority in the case of having multiple rules matching the same link.

This spider will extract the book catalogue of different categories from each page it crawls. To run the spider, navigate to the command window terminal of the VS Code and run the following command:

This will start the Spider and begin crawling the [webpage to scrape] https://toscrape.com domain. This above code will run the spider crawler we just wrote, it will make a request, and get the HTML from start_urls. It will collect the catalogue and category links.

You can modify this code to extract different types of information from the pages. The long output is seen at the command line of the terminal window.

The scraper initializes and loads additional extensions and components to handle reading data from URLs.

Extracting Multiple Data from a page

Parse(self, response): The parse function is responsible for processing the response and returning scraped data. It is referred to by another name for a parse function called the callback function. Parse(response): is used to process downloaded responses; it works as default callback.

The second rule is defined, to crawl catalogue data that does not have a category in them. The callback is later defined after finding the URLs. The callback specifies when the URLs are found and later refers them to a function called parse_item. The parse_item will define all the instances in the Rule.

The crawler is later used to crawl the link that fits into the rule and later take the output and fit it into the parse item to extract the output.

You can later right-click and inspect the page to extract the Title, Price and Availability and class representation on google chrome. While inspecting the elements on the web page, all the information needed is wrapped in an article tag. Therefore, we have to loop through each article tag and then extract the needed information to be crawled.

In this example, we intend to scrape all the information about Title, Price and Availability:

Executing the spyder code again:

After executing the above commands. You will see the output in the terminal showing something like this:

All the code for the crawling is written below:

Exporting the data to json file

To store the crawled data you may consider using json. The json format enables organized, recursive storage and Python’s json module provides all the parsing tools needed for getting this data into your application.

Run the following code in the terminal:

Now sit back and watch your json file fill with data!. This generates a .json file containing all the crawled data.

Conclusion

In this tutorial, you have learnt about web crawling with Python using the Scrapy library. We went through writing Spiders in Scrapy. From generating project files and folders to resolving duplicate URLs, Scrapy takes care of the heavy lifting of coding for you. It enables you to start doing powerful web scraping in minutes and offers support for all widely used data types that can be used in other applications. For more information on Scrapy, check (https://docs.scrapy.org/en/latest/).

You can check the link for the code used: https://github.com/AyorindeTayo/Web-Crawling-with-Python

Lastly, problems that can be encountered when carrying out web crawling, especially crawling large-scale data includes; getting blocked from certain pages, because your IP addresses might be flagged for sending too many requests. To prevent this you can use a third-party proxy service. Reach out to us for services relating to web crawling and web scraping services at https://tubular-alpaca-280cfd.netlify.app/.

References

Scraping Sandbox

A website that lists quotes from famous people. It has many endpoints showing the quotes in many different ways, each…

toscrape.com

Scrapy 2.8 documentation - Scrapy 2.8.0 documentation

Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data…

docs.scrapy.org

A Practical Introduction to Web Scraping in Python - Real Python

Note: This tutorial is adapted from the chapter "Interacting With the Web" in Python Basics: A Practical Introduction…

realpython.com