On today’s post I am going to show you how you can make use of the service provided by Oxylabs called the Real Time Crawler API and Python to scrape the SERPs, extract the metatitle showing up on the SERPs for a page and compare it with your on-page metatitle and H1 to analyze if Google is rewriting your metatitles. The final output of this script will return an Excel file like the one in the screenshot below (without the conditional formatting):

If you are not familiar with Oxylab’s API, you can have a read at this article where I explain how you can use Oxylab’s API and Python to scrape the SERPs. You will learn how the API works, what type of data you can obtain from it and how you can get the most out of it for SEO with some practical cases.

Having said this, let’s get started with the metatitles checker!

1.- How the script works?

Essentially, we will use Oxylab’s API to scrape the SERPs with the command “site:” for a list of given URLs and then we will scrape the URLs themselves to extract the on-page metatitles and H1s (as Google is likely to use the H1 when not using the actual metatitle).

It might be interesting to do this exercise with a sitemap of URLs to get to know:

  • What URLs are not indexed at all and take action on them: in the short term a manual indexation on Google Search Console can be requested. For the long term it is possible that some other actions might be required starting from some sanity checks to make sure that the pages are readable by Googlebot, on-page optimizations, the creation of internal and/or external links to increase its page rank, etcetera.
  • What URLs show a different metatitle on the SERPs to the on-page title: we can analyze why Google might be changing the metatitle and what alternative it considers that is better for the user. On my site I noticed that Google was excluding in many cases the final site name “- Daniel Heredia” with the intention of shortening them because the metatitles were already quite long.
  • What URLs are showing the H1 as a metatitle on the SERPs: in those cases I would recommend to go over the H1s and optimize them if there is any way to make them more appealing for users. It can be an enriching process for some sites that optimized their H1s mainly for search engines including unnatural exact matches based on a H1 pattern but disregarded them from a user experience perspective.

For the script we will use the libraries:

  • Requests: to make the request to Oxylab’s endpoint and scrape the SERPs.
  • Cloudscraper: to scrape the URLs. It could also be done with requests but cloudscraper is more reliable with sites that use Cloudflare. On the guide to SEO on-page scraping with Python I introduced cloudscraper and I explained how to scrape metatitles and H1s alongside the rest of SEO elements that can be valuable for SEO.
  • BeautifulSoup: to parse the object that we are going to receive from our request with cloudscraper.

2.- Using the script

First, we need to import the list of URLs that we are going to check. We can use Requests and BeautifulSoup to extract them easily from a sitemap, although they could be imported from any other data source.

from bs4 import BeautifulSoup
import requests

r = requests.get("https://www.yoursite.com/sitemap.xml")
xml = r.text

soup = BeautifulSoup(xml)
urls_list = [x.text for x in soup.find_all("loc")]

After importing the URLs, we only need to run the script to scrape the SERPs with Oxylabs and the actual URLs from the list:

import cloudscraper

list_comparison = []

for url in urls_list:
    scraper = cloudscraper.create_scraper() 

    indexation = False
    metatitle_coincidence = False
    metatitle_coincidence_h1 = False

     payload = {
    'source': 'google_search',
    'domain': 'com',
    'query': 'site:' + url,
    'parse':'true' 
    }

    response = requests.request(
        'POST',
        'https://realtime.oxylabs.io/v1/queries',
        auth=('<your_username>', '<your_password>'),
        json=payload,
    )
 

    for x in response.json()["results"][0]["content"]["results"]["organic"]:
        try:
            if x["url"].endswith(url):
                indexation = True
                html = scraper.get(url)
                soup = BeautifulSoup(html.text)
                metatitle = (soup.find('title')).get_text()
                h1 = (soup.find('h1')).get_text()

                if x["title"] == metatitle:
                    metatitle_coincidence = True
                    
                if x["title"] == h1:
                    metatitle_coincidence_h1 = True
                    

                list_comparison.append([url,indexation,x["title"],metatitle,h1,metatitle_coincidence,metatitle_coincidence_h1])

                break
        except:
            pass
        
        
    if indexation == False:
        list_comparison.append([url,indexation,"","","",metatitle_coincidence,metatitle_coincidence_h1])

This will generate a list that will contain the URLs, their indexation statuses, the SERPs metatitles, the on-page metatitles, the on-page H1s and two boolean variables to indicate if the metatitles from the SERPs are equal to the on-page metatitle and/or the on-page H1s.

We can now export this list with Pandas as an Excel file:

import pandas as pd
 
df = pd.DataFrame(list_comparison, columns = ["URL","Indexation", "Metatitle SERPs", "Metatitle", "H1", "Metatitle Coincidence", "H1 - metatitle Coincidence"])
df.to_excel('<filename>.xlsx', header=True, index=False)

This will return an Excel file that will look like the one shown in the screenshot at the beginning of the post.

That is all folks, I hope that you found this post interesting!