Introduction

Python is a language with some libraries which enable us to perform web scraping easily. On this post, I will walk you through the basic libraries that you might use if you are learning how to use Python for web scraping and I will share with you some tips which I’ve learnt from my own experience and I believe that can be helpful for beginners.

However, first of all I would like to share with you when I think that creating a Python script to scrape is worth it as not always it is the best option. It might be the best option if you:

  • Plan to scrape data in a regular basis and you can schedule or reutilize this same script.
  • Need to manipulate the data that you are scraping as you would like to store this data in a database.
  • Need an adhoc script as no current solution in the market meets your needs.

In case that you do not need to extract data in a regular basis or you do not need to go through a very sophisticated data manipulation you could find some tools in the market that could already meet your needs without the necessity of having to code your own scripts. Some of these tools are:

  • Screaming Frog: probably one of the most famous SEO tools. It provides with already a big list of default tags that it will scrape plus you can do some tailored extractions with the Custom Extraction option and Xpath.
  • Scrape Box: it is a very powerful tool which can be used for custom scraping and other activities such as SERPs scraping, email addresses extraction, check quality of the websites…
  • URL Profiler: this tool is similar to Scrape Box but not so complete. It is a very good option to scrape email addresses, contact pages and check the quality of a bunch of websites.
  • Google Chrome Extensions: there are many Google Chrome extensions which are very useful and can make our day easier. For scraping, the best one is “Web Scraper” as it is very easy to be used as you only need to select with your mouse those elements that you would like to scrape.

Having said this, if you still consider that the best option for you is creating an adhoc script with Python, it is time to give you some tips!

1.- Beautiful Soup and Requests

The Requests library will enable you to access to the pages you might be interested in scraping and Beautiful Soup helps you to parse the HTML data and transform it in a more navigable format.

Installing these libraries is very easy, you would only need to run these commands:

pip install beautifulsoup4
pip install requests

Once these libraries are installed, a basic example of scraping a page would be:

import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.example.com')
soup = BeautifulSoup(page.text, 'html.parser') #You can change your parser
soup.find('h1').get_text() #It would return the content between the H1 tags.
soup.find_all('p') #You can also find all the tags with find_all

2.- Sleep and Random Randint

The sleep function can help you to avoid the server to be overloaded with too many requests in a very short period of time. Basically, with sleep you can make the script stop for a certain period of time so that if you are making some requests iteratively you can avoid the server to be overloaded.

On the other hand, the random randint method can be used together with the sleep function as with random randint you will be able to generate a random number given a specified range so that the sleep function will be stopped for a different period of time for each request which is made. In this way, servers will not be able to find a clear common pattern for these requests and getting banned will be harder.

In the code below you can find an example about how Sleep and Random Randint would work together:

import random
import time
time.sleep(random.randint(5,16)) #Script will be stopped for a period of time ranging from 5 to 16 seconds.

3.- Change User Agent 

Some websites can have sophisticated technical set ups like a prerender version or dynamic rendering which can make the content inside the pages vary. For this reason, it might be a good idea to change the default User Agent which is used by the Request library to a different User Agent like Googlebot or Android to check which content is served to Googlebot or for a mobile device.

Below you can find an example about how a new User Agent can be set with Requests:

import requests
headers = {
    'User-Agent': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
}
response = requests.get('https://www.example.com', headers=headers)
#This code would request a page simulating to be Googlebot.

4.- Use of proxies

In some cases, not even using the sleep function, random randint method or a different User Agent could be enough to avoid the servers to ban your IP. In such a case, you might need to use proxies to pretend to be making the request from a different IP.

Fortunately, the library requests also allows to make use of a Proxy parameter once the request is made, although some proxy suppliers have their own APIs to make the requests:

import requests
proxies = {
  'http': 'http://10.10.1.10:3128'
}
response = requests.get('https://www.example.com', proxies=proxies)

5.- Inspect Elements

It could happen that maybe you are not able to access to some elements on the code as they are loaded with an onclick event with AJAX or with a Frame. In such a case, before panicking and having to work on a more sophisticated scraper, you might first check the elements with “Inspect” and see how these problematic elements are built.

One example about AJAX elements could be this page: https://www.inforesidencias.com/centros/residencia/4660/residencial-castellon-servicios-integrales-3a-edad. As you can see, if you want to access to the website or the phone number, you need to click on them. However, what we can directly do is scraping the phone number and the website accessing to the resources from which they are loaded:

https://inforesidencias.com/centros/datos-ajax/4660/telefono” and “https://inforesidencias.com/centros/datos-ajax/4660/web“.

A good example about scraping frames can be found on https://www.mga.org.mt/licences/. The Frame is hosted on the URL https://mgalicenseeregister.mga.org.mt/index1.aspx and if we would like to scrape the data from the licenses, we could make it easily by using the parameters on the original frame like: https://mgalicenseeregister.mga.org.mt/Results1.aspx?Licencee=1×2+Network+Malta+Limited+&Class=&Status=&URL=.

6.- CloudFlare

Last but not least, you could come across a website which is protected with CloudFlare technology. In such a case, the python library request would not work and it would always return an error.

In those cases where you need to scrape on CloudFlare, you can use the library cloudscrape. It works in a very similar way to the Requests module as it accepts most of the parameters which are used on Requests and it will be able to access to those websites protected with CloudFlare.

Installation is very simple and it has some dependencies on the Requests module:

pip install cloudscrape

In addition, if you might be interested in scraping email addresses from a website which is running on Cloudscrape without having to render the page with Selenium, you can use the next function which will decode the mechanism that CloudFlare uses to encrypt the email addresses.

#Encrypted slug looks like: "/cdn-cgi/l/email-protection#0362676e6a6d436262606c6d7077717660776a6c6d746c7168702d606c6e"
def decodeEmail(e):
    de = ""
    k = int(e[:2], 16)

    for i in range(2, len(e)-1, 2):
        de += chr(int(e[i:i+2], 16)^k)

    return de

decodeEmail("0362676e6a6d436262606c6d7077717660776a6c6d746c7168702d606c6e")