On today’s post I am going to show you how you can scrape the SERPs with Python and Oxylabs, which has a Real Time Crawler API that uses a global proxy pool that will prevent Google from banning your IP. In addition, I will also show you some cases where scraping the SERPs can be useful for several purposes such as indexation analyses, getting the number of indexed results, finding partners and/or sales opportunities, etcetera.

Does it sound interesting? Let’s get started then!

1.- How does Oxylabs work?

The Real Time Crawler service works in a very easy way, you will only need to make a simple post HTTP request with the library Requests to their endpoint and then you will be able to retrieve the data from the SERPs.

For example:

import requests

payload = {
    'source': 'google_search',
    'domain': 'com',
    'query': 'adidas',
}

response = requests.request(
    'POST',
    'https://realtime.oxylabs.io/v1/queries',
    auth=('<your_username>', '<your_password>'),
    json=payload,
)

print(response.json())

With this query we would scrape the SERPs for the query adidas from google.com.

It is worth mentioning that the requests support several parameters which can help us to customize our SERPs scraping. The parameters that I usually work with to customize the SERPs scraping are:

  • domain: you can introduce any specific Google cctld to extract the results from different country versions.
  • query: this enables us to introduce our query. It also accepts Google’s commands such as “site:”, “intitle:” and so on as we will see in the practical examples.
  • start_page: the page from which we would like to start the SERPs scraping.
  • pages: number of pages to be scraped.
  • limit: how many results will be displayed in each page.
  • geo_location: the geographical location where the results will be adapted for. You can find and download as a CSV file all the geolocations from the Google Adwords geo-targets documentation.
  • user_agent_type: you can insert the type of device. The main values that I usually use in this parameter are either “desktop” or “mobile”, although you can also specify a browser as shown in this page which compiles are the supported user agents.
  • render: you can render the SERPs, which is a feature that is specially useful to get some of the features that are only accessible once the SERPs are rendered such as the carrousel of Google News and other rich snippets. The values that you can insert for this parameter are “html” or “png” (in case you would like to get a base64-encoded screenshot about how the rendered SERPs look like).
  • parse: if this value is “true” the response will be structured in a JSON format. If not, it will be returned in a HTML format.

It is important to clarify that if we need to render the results, we will need to use a different endpoint (https://data.oxylabs.io/v1/queries) and we will receive an URL in the response from which we will be able to access the results by using a Get request to the URL obtained from the key: [“_links”][1][“href”].

Here we can find an example about how our code would need to be in order to retrieve the results from a rendered SERPs page:

import requests

payload = {
    'source': 'google_search',
    'domain': 'com',
    'query': 'adidas',
    'render':'html',
    'parse':'true'
}

# Get response.
response = requests.request(
    'POST',
    'https://data.oxylabs.io/v1/queries',
    auth=('<your_username>', '<your_password>'),
    json=payload,
)

response2 = requests.request(
                'GET',
                response.json()["_links"][1]["href"],
                auth=('<your_username>', '<your_password>'),
                json=payload,
            )

response_rendering = response2.json()["results"]

From my point of view, rendering the SERPs is the best way to get the most out of this tool, although if rendering is not necessary, we can save some time if we only parse the raw HTML code as rendering takes a bit of time. In fact, to avoid breaking the code due to the page not being rendered yet once the request is made, it is recommendable to make use of a while loop and time to get the response once it is eventually ready:

import requests
import time

payload = {
    'source': 'google_search',
    'domain': 'com',
    'query': 'adidas',
    'render':'html',
    'parse':'true'
}

# Get response.
response = requests.request(
    'POST',
    'https://data.oxylabs.io/v1/queries',
    auth=('<your_username>', '<your_password>'),
    json=payload,
)

response_rendering = None
while response_rendering is None:
    try:
        response2 = requests.request(
            'GET',
            response.json()["_links"][1]["href"],
            auth=('<your_username>', '<your_password>'),
            json=payload,
        )

        response_rendering = response2.json()["results"]
    except Exception as e:
        print("trying again")
        time.sleep(10)
        pass

2.- What can you get by default?

If we render the SERPs and we ask the data to be delivered in the structured data format we can get by default:

  • Paid results: with their positions, URLs, descriptions and titles. Key: response_rendering[0][“content”][“results”][“paid”].
  • Organic results: with their positions, URLs, descriptions and titles. Key: response_rendering[0][“content”][“results”][“organic”].
  • Video results: with their positions, URLs, titles and authors. Key: response_rendering[0][“content”][“results”][“videos”].
  • Top stories: with their positions, sources, URLs and headlines. Key: response_rendering[0][“content”][“results”][“top_stories”].
  • Related searches. Key: response_rendering[0][“content”][“results”][“related_searches”].
  • Related questions: with their answers, source URLs and source titles. Key: response_rendering[0][“content”][“results”][“related_questions”].

2.1.- Creating a list with the paid results: URLs, titles and descriptions

We will use a for loop to iterate over the JSON file and transform the results into a list:

paid = []
for x in response_rendering[0]["content"]["results"]["paid"]:
    paid.append([x["pos"],x["url"],x["title"], x["desc"]])

2.2.- Creating a list with the organic results: URL, titles and description

Same logic is used to get the organic results:

organic_results = []
for x in response_rendering[0]["content"]["results"]["organic"]:
    organic_results.append([x["pos"], x["url"], x["title"], x["desc"]])

2.3.- Creating a list with the video results: URLs, titles and authors

In this case, some of the keys are slightly different as we will extract the authors and the overall positions.

video_results = []
for x in response_rendering[0]["content"]["results"]["videos"]:
       video_results.append([x["pos_overall"], x["url"], x["title"], x["author"]])

2.4.- Creating a list with the top stories results: URLs, headlines and sources

Some of the keys are also different as we extract the headlines and the overall positions.

top_stories = []
for x in response_rendering[0]["content"]["results"]["top_stories"]:
       top_stories.append([x["pos_overall"], x["url"], x["headline"], x["source"]])

2.5.- Creating a list with the related searches

We can also extract the related searches and create a list with them:

related_searches = []
related_searches.append(response_rendering[0]["content"]["results"]["related_searches"]["related_searches"])

2.6.- Creating a list with the related questions: questions, answers and source URLs

Finally, we can get the related questions, their answers and their source URLs.

related_questions = []
for x in response_rendering[0]["content"]["results"]["related_questions"]:
    related_questions.append([x["pos"],x["question"],x["answer"], x["source"]["url"]])

3.- Some practical cases

SERPs scraping can be used for an endless number of tasks and activities. Some of the tasks that I perform frequently by scraping the SERPs are checking the number of indexed results, indexation analyses, finding new partners, influencers and/or sales opportunities and finding guest-posting opportunities.

3.1.- Number of indexed results

As JC Chouinard explained in this article, it might be interesting to scrape the SERPs to extract the number of pages that are indexed for a query. If we combine this method with some Google commands like “intitle” or “inurl”, we can have an idea about not only the number of indexed pages for a query, but the number of pages which contain the keyword that we would like to target in their metatitles.

Theoretically, if we include a keyword exact match in our metatitle we might have many more chances to rank for that specific keyword than those pages where the keyword is not mentioned in the title. Therefore, in order to evaluate how competitive a keyword is, it might make sense to make use of the command “intitle” and extract the number of indexed results.

This is something we can do with the Real Time Crawler without having to render the SERPs page and using BeautifulSoup to create an object that can be parsed as shown below:

import requests
from bs4 import BeautifulSoup

payload = {
    'source': 'google_search',
    'domain': 'com',
    'query': 'intitle:"buy white shoes"',
}

response = requests.request(
    'POST',
    'https://realtime.oxylabs.io/v1/queries',
    auth=('<your_username>', '<your_title>'),
    json=payload,
)

soup = BeautifulSoup(response.json()["results"][0]["content"], "html.parser")
div_results = soup.find("div", {"id": "result-stats"})
indexed_results = int(div_results.text.split("About")[1].split("results")[0].replace(" ","").replace(",",""))

With this technique, we can analyze the competitiveness of lots of keywords in a bulk mode to allocate our efforts on those not so competitive keywords whose ROI might be higher.

3.2.- Indexation analyses

Another task that can be done with proxies and the command “site:” is indexation analyses (only a few indexed URLs from a site are provided on Google Search Console unfortunately). To proceed with these analyses, we will not need to render the SERPs and we will basically use the parameter “parse” to get the data as a JSON file. If there are no indexed URLs, then the organic results will be blank.

For instance, with this piece of code I was able to retrieve all the results that are indexed from my site:

import requests

payload = {
    'source': 'google_search',
    'domain': 'com',
    'query': 'site:danielherediamejias.com',
    'parse':'true',
    'start_page':1,
    'limit':100,
    'pages':3
    
}

response = requests.request(
    'POST',
    'https://realtime.oxylabs.io/v1/queries',
    auth=('<your_username>', '<your_password>'),
    json=payload,
)

number_indexed_results = len(response.json()["results"][0]["content"]["results"]["organic"])

indexed_urls = []
for x in response.json()["results"][0]["content"]["results"]["organic"]:
    indexed_urls.append([x["url"]])

We can also introduce a bunch of URLs from a sitemap for example and we can check whether they are indexed or not in order to understand what is the indexation coverage from that sitemap. For this, we will need to retrieve the results from the SERPs and check if they exactly match with the original URL since there could be some cases where the root URL might not be indexed but there could be some indexed subdirectories.

import requests

url = "<your_url>" 

payload = {
    'source': 'google_search',
    'domain': 'com',
    'query': 'site:' + url,
    'parse':'true',
    'start_page':1,
    'limit':100,
    'pages':3
    
}

response = requests.request(
    'POST',
    'https://realtime.oxylabs.io/v1/queries',
    auth=('<your_username>', '<your_password>'),
    json=payload,
)


indexation = False
for x in response.json()["results"][0]["content"]["results"]["organic"]:
    try:
        if x["url"].endswith(url):
            indexation = True
            break
    except:
        pass

Regarding these indexation analyses, we can also introduce an initial domain, use the command site: and extract all the provided URLs by Google (around 300), store them and iterate again over all the URLs that have been extracted. This is specially useful in very large websites and it will give you an idea about all the URLs that are indexed from a site and you might be able to discover pages and sections that were not supposed to be indexed.

Basically, what we will do is creating a list with all the URLs and append more URLs as we iterate over the URLs and get more results from Google’s index:

import requests

list_url = ["<your_initial_url>"]

for iteration in list_url:
    
    print(iteration)
    payload = {
        'source': 'google_search',
        'domain': 'com',
        'query': 'site:' + iteration,
        'parse':'true',
        'start_page':1,
        'limit':100,
        'pages':3

    }

    response = requests.request(
        'POST',
        'https://realtime.oxylabs.io/v1/queries',
        auth=('<your_username>', '<your_password>'),
        json=payload,
    )


    for x in response.json()["results"][0]["content"]["results"]["organic"]:
        present = False
        try:
            for y in list_url:
                if x["url"] == y:
                    present = True
            
            if present == False:
                list_url.append(x["url"])
        except:
            pass

This process is quite proxy-consuming because essentially we will use the command site for all the pages that are found with the proxies and in some cases it can be inaccurate if the number of pages which are indexed from a directory exceed the 300 results (which is usually the maximum number of results that Google will return for a query).

3.3.- Finding new partners, influencers and/or sales opportunities

The proxies can be used not only for SEO purposes, but also for finding email addresses to collaborate with new partners and/or influencers or to find new sales opportunities.

How can we do this? We can use some commands like site and intext to search for indexed results from a website with email addresses. For instance, if we would like to find Youtube channels with email addresses that might be open to collaborations we could use the search pattern: “site:youtube.com inurl:/about/ intext:@gmail.com”.

First we use the proxies and we extract the URLs:

import requests

payload = {
    'source': 'google_search',
    'domain': 'com',
    'query': 'site:youtube.com inurl:/about/ intext:@gmail.com',
    'parse':'true',
    'limit':100

}

response = requests.request(
    'POST',
    'https://realtime.oxylabs.io/v1/queries',
    auth=('<your_username>', '<your_password>'),
    json=payload,
)

list_url = []
for x in response.json()["results"][0]["content"]["results"]["organic"]:
    present = False
    try:
        list_url.append([x["url"]])
    except:
        pass

Then, once we extract the indexed URLs, we can go over them with requests and beautifulsoup and extract the email addresses to build our own database and contact them.

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import time
import re

driver = webdriver.Chrome(ChromeDriverManager().install())

for iteration in range (len(list_url)):
    
    
    driver.get(list_url[iteration][0])
    
    if iteration == 0:
        input = driver.find_element_by_xpath('/html/body/c-wiz/div/div/div/div[2]/div[1]/div[4]/form/div[1]/div/button/span')
        input.click()

    html = driver.page_source
    email_address = re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", html)
    email_address  = list(dict.fromkeys(email_address))
    list_url[iteration].append(email_address)
    
    time.sleep(2)
    
driver.close()

3.4.- Finding Guest-posting opportunities

Another activity that can be automated with proxies is the the discovery of new guest-posting opportunities. This task is pretty straightforward, you will only need to extract the URLs returned from queries such as “write with us”, “write for us”, “guest posting policy”, “guest posting rules”, “guest posting guidelines”, “we accept guest post”, “submit a guest post”…

If you would like to narrow down the results, you can also add a representative term about the topic of the websites that you are looking for:

import requests

payload = {
    'source': 'google_search',
    'domain': 'com',
    'query': '"write with us" digital marketing',
    'parse':'true',
    'limit':100

}

response = requests.request(
    'POST',
    'https://realtime.oxylabs.io/v1/queries',
    auth=('<your_username>', '<your_password>'),
    json=payload,
)

list_url = []
for x in response.json()["results"][0]["content"]["results"]["organic"]:
    present = False
    try:
        list_url.append([x["url"]])
    except:
        pass

As an extra step, something that can be done is checking if those websites have Doubleclick ads. In such a case, it is very possible that they are monetizing their website and they might be open to guest-posting in their sites.

You can search for Google Doubleclick ads with:

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import time

driver = webdriver.Chrome(ChromeDriverManager().install())

for iteration in range (len(list_url)):
       
    driver.get(list_url[iteration][0])
    html = driver.page_source
    
    guest_posting = False
    if "doubleclick" in html:
        guest_posting = True
    
    list_url[iteration].append(guest_posting)
    
    time.sleep(2)
    
driver.close()

In addition, if you would like to categorize the extracted websites based on their topics in a bulk mode, you can have a read at this post where I explain how you can use Python and Google NLP API for website categorization.

3.5.- Recruitment

Last but not least, proxies can also be used for recruitment and finding suitable candidates for some open positions. If for instance, we would be searching for a candidate to cover an SEO position in Barcelona who needs to be able to code with Python, we could use the command: site:es.linkedin.com/in/ intitle:”seo” intext:barcelona intext:python.

Therefore, getting the profiles from Google’s index would be quite easy with:

import requests

payload = {
    'source': 'google_search',
    'domain': 'com',
    'query': 'site:es.linkedin.com/in/ intitle:"seo" intext:barcelona intext:python',
    'parse':'true',
    'limit':100

}

response = requests.request(
    'POST',
    'https://realtime.oxylabs.io/v1/queries',
    auth=('<your_username>', '<your_password>'),
    json=payload,
)

list_url = []
for x in response.json()["results"][0]["content"]["results"]["organic"]:
    present = False
    try:
        list_url.append([x["url"]])
    except:
        pass

4.- Alternatives to scrape the SERPs for free

There are some alternatives to scrape the SERPs for free, although unfortunately they will not offer as many features as Oxylabs is able to offer.

4.1.- Googlesearch Python library

Mario Vilas created a library that enables you to scrape the SERPs without using proxies. However, after around 20 requests, it is possible that Google will ban your IP and you will not be able to keep the scraping up.

This piece of code will return the URLs showing up for the query adidas with the Spanish browser:

from googlesearch import search
for url in search('adidas', tld='es', lang='es', stop=20):
    print(url)

4.2.- Google Custom Search API

Koray Tuğberk explained on this article how to Google Custom Search API to retrieve the results turning up for a specific query via advertools. You will only need to create a project on Google Developer Console, get your credentials and enable the Google Custom Search API.

Once you have obtain your credentials, you can make use of this API by using this piece of code:

import advertools as adv
api_key, cse_id = "YOUR API KEY", "YOUR CSE ID"
adv.serp_goog(key=api_key, cx=cse_id, q="Example Query", gl=["example country code"])

Unfortunately this API does not support most of the SERPs rich snippets or commands, so if you would like to make a more extensive analysis, you might need to use premium tool like the Real Time Crawler from Oxylabs.

4.3.- Chrome extensions

If you would need to scrape the SERPs but you do not really need to to it for many queries, you can also make use of Chrome extensions to extract the indexed results. You can use Web Scraper as I explained in this article to scrape the SERPs with a Google Chrome extension.

5.- FAQ section

Can I use the Real Time Crawler from Oxylabs with Python?

Yes, the Real Time Crawler supports Python language, so you can use Python to scrape the Google SERPs as explained in this blog post.

What can I do with proxies for SEO?

Rankings monitoring for organic, paid, top stories and video results, indexation analyses, getting the number of indexed pages for a query, finding guest-posting opportunities…

How long will it take me to use Oxylabs with Python?

Not long, as you can use most of my code samples although you might need to make small changes.