Last week Google Search Console released a very interesting new feature which enables us to check the crawl statistics for a GSC property and the first thing I thought about it was: this report is pure gold, there must be a way to automate the data extraction in order to visualize these statistics with some graphs and create alerts to flag crawl errors such as 404 response code pages or sudden drops in crawl requests!

If you have not had the chance to have a look at this report yet, I strongly recommend you to do so and play around with that feature as you will be able to come up with nice insights about how your website is crawled in terms of most crawled pages, user agents which are used to crawl it, resources which are not returning a 200 response code, etcetera. On the GSC side bar, you should go to the Settings section and click on “Open the report” button on the Crawl Statistics tab. Then you might see something like:

However, as far as I am concerned, there is not any API endpoint yet to download such a report or so I think after checking the Google Search Console API’s documentation. I foresee that in the future a new endpoint to download this data will be enabled if proven successful.

Fortunately, for the time being, I have found a temporary workaround which works quite well by using Selenium. Basically, we will simulate a Google Chrome browser session which signs into GSC, accesses the URL where this report can be exported as an Excel file and export it. Once we have exported this Excel file, we can use this data to create nice plots with matplotlib or alerts which are sent by email or any other messaging tools.

After this short explanation about what we are going to do on this post, let’s get the fun started!

1.- Automating the Data Extraction

To start with, we will automate the data extraction. As mentioned previously, basically what we are going to do is using Selenium to navigate onto the page where this report can be exported and exporting it. However, first we will need to log into the OUTH playground to access GSC, as if you try to log into GSC directly with Selenium, you will come across a message which will not let you access due to security reasons. If you are not familiar with Selenium, I recommend you to have a read at this article written by JC Chouinard to learn about it.

We will also use the webdriver_manager module to install the correct driver and the KEYS module to be able to enter once we need to introduce our email address and password to log in. To select the different onpage elements we will use their XPATHs.

¡¡WARNING: Replace & with only &. It is a plugin bug!!

#We import these modules
import time
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.keys import Keys

#We install our Selenium Driver
driver = webdriver.Chrome(ChromeDriverManager().install())

#We access the OUATH playground account to log into Google Services
driver.get('https://accounts.google.com/o/oauth2/v2/auth/oauthchooseaccount?redirect_uri=https%3A%2F%2Fdevelopers.google.com%2Foauthplayground&prompt=consent&response_type=code&client_id=407408718192.apps.googleusercontent.com&scope=email&access_type=offline&flowName=GeneralOAuthFlow')

#We wait a bit to make sure rendering is complete before selecting elements with Xpath and introducing our email address. 
time.sleep(10)
form1=driver.find_element_by_xpath('//*[@id="identifierId"]')
form1.send_keys("<your email address>")
form1.send_keys(Keys.ENTER)

#Same here, we wait a bit and then we introduce our password.
time.sleep(10)
form2=driver.find_element_by_xpath('//*[@id="password"]/div[1]/div/div[1]/input')
form2.send_keys("<your password>")
form2.send_keys(Keys.ENTER)

Now that we have logged into Google services, we can access the GSC URL where the we can export the data. We will access the HTML crawl statistics page and we will download the Excel file with the data about which HTML resources have been crawled in the last 90 days.

time.sleep(10)
property = "<your property>"
driver.get('https://search.google.com/search-console/settings/crawl-stats/drilldown?resource_id=' + property + '&amp;file_type=1')

#We select the export button and click on it
button1=driver.find_element_by_xpath('//*[@id="yDmH0d"]/c-wiz/span/c-wiz/div/div[1]/div[1]/div[2]')
button1.click()

#We select the download as an Excel file option and we start the downloading. Then, we wait a bit until the download is completed and we quit.
button2=driver.find_element_by_xpath('//*[@id="yDmH0d"]/c-wiz/div[2]/div/div/span[2]/div[3]/div')
button2.click()
time.sleep(20)
driver.quit()

After this, the downloading is completed! You must have downloaded an Excel file whose name follows this pattern: “your property name”_-Crawl-stats-By-file-type-“extraction date”.xlsx. As you can see, if for instance you run this piece of code every day, the only difference between the exports would be the extraction date which uses the format “%Y-%m-%d” (at least for me in the Spanish GSC version).

So in the case that we would like to get some daily data from this file, we could automate the data retrieving by creating a script with the present date as a variable used to open the file:

#For Example
today = date.today()
d1 = today.strftime("%Y-%m-%d")

df = pd.read_excel ('yourfilename' + d1 + '.xlsx', sheet_name='secondtab') 

2.- Creating some easy but insightful plots

After having downloaded the data, we can use it to create plots to visualize some interesting aspects of our crawl statistics. In this section, I’ll show you how to first open and read the Excel file which contains the data and later on, how to draw two easy but insightful plots to analyze the most crawled pages and the crawl requests over the days.

2.1.- Most crawled pages

To create this plot, we will first read the data from the Excel file with Pandas, convert it into a list and by using a dictionary we will put the URLs together with the occurrences count. At the end of the process, we will need to create two lists to create the plot with Matplotlib and we will eliminate the “danielherediamejias.com” URL part so that the plot labels will look much better.

import pandas as pd

#We import, read the file and convert it into a Dataframe by using pandas.
df = pd.read_excel ('yourfilename.xlsx', sheet_name='secondtab') 
#In Spanish the second tab is called "Tabla"

#Now we convert the dataframe into a list
listdataframe = df.values.tolist()

#We use a dictionary to count the number of occurrences
d = {}

for url in listdataframe:
    d.setdefault(url[1],0)
    d[url[1]] += 1

#We convert the dictionary into a tuple and we sort it out based on the number of occurrences
sorted_d = sorted(d.items(), key=lambda x: x[1])

#We create two lists to draw the plot, one for the URLs and the other one for the number of occurrences. We eliminate those URLs which contain an utm parameter and the host name.
urls = []
counturls = []
for x in sorted_d:
    if "utm" not in str(x[0]):
        urls.append(x[0].replace("https://www.danielherediamejias.com",""))
        counturls.append(x[1])

#We finally draw the plot after importing numpy and matplotlib

import numpy as np
import matplotlib.pyplot as plt
 
y_pos = np.arange(len(urls))
plt.barh(y_pos, counturls)
plt.yticks(y_pos, urls)
plt.show()

My first plot looks like this. As you can see, the URLs which are crawled the most are the homepage, the post about scraping on Instagram, the post about the Google Sheet to use the Google Knowledge API and the one about requesting your sitemaps to be crawled with Python.

2.2.- Crawl statistics over the time

On this second plot, we are going to draw the HTML requests over the time. First we will need to read the data from the first tab of the Excel file (in Spanish called “Gráfico”), afterwards we will create two lists: one for the dates and the other one for the number of requests and finally we will draw the plot.

#We import, read the file and convert it into a Dataframe by using pandas. This time we input the first tab as an argument
df = pd.read_excel ('yourfilename.xlsx', sheet_name='firsttab') 

#We create two lists: one for the dates and the other one for the number of requests
dates = []
crawls = []

for x in df.values.tolist():
    dates.append(x[0])
    crawls.append(x[1])

#We finally draw a very simple the plot
plt.plot(dates, crawls)
plt.xticks(rotation=90)

plt.show()

My plot looks like this:

My website seems to be crawled quite often, although there are some days like the 6th of November where Googlebot did not crawl it and the maximum number of requests it has made is 14 the 11th of November.

3.- Setting up alerts

Imagine that you would like to get notified for each page that Googlebot crawls which is returning a 404 response code in a daily basis by running a scheduled script. For that, first we will need to download the 404 response code report from GSC, manipulate such a data and finally, send for instance an email with those pages.

For this, first we need to replicate the previous process with Selenium, but in this case we request the URL where the 404 response code pages are and we export the file. The report URL might look like: https://search.google.com/search-console/settings/crawl-stats/drilldown?resource_id=yourproperty%2F&response=2. Once the file is downloaded, it will have a subtle different name to the report about HTML crawled pages following the pattern: <property name>-Crawl-stats-By-response-<extraction_date>.xlsx.

After this, we need to open and read the Excel file. For that, we can use Pandas again and we will create a list with the 404 response code pages. Once the list is done, we will create a HTML message which is intended to be sent by email which will notify us of the URLs which are returning 404 response codes.

df = pd.read_excel ('<404 file name>.xlsx', sheet_name='secondtab')
#Second tab in Spanish is called Tabla

#We append into the list those 404 response code URLs
urls_404 = []
for x in df.values.tolist():
    urls_404.append(x[1])

#We write a message with the 404 response code pages
message = "<h1>These URLs return 404 response codes:</h1><p>"
for x in urls_404:
    message = message + x + "<br>"
    
message = message + "</p>"

Once the message is created and formatted, we can use this piece of code which I got from a previous post to send an email by using a Gmail account:

from email import encoders
from email.message import Message
from email.mime.audio import MIMEAudio
from email.mime.base import MIMEBase
from email.mime.image import MIMEImage
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
from email.mime.text import MIMEText
import smtplib 
 
#We enter the password, the email adress and the subject for the email
msg = MIMEMultipart()
password = '<your email address password>'
msg['From'] = "<Sender name>"
msg['To'] = "<Receiver email address>"
 
#Here we set the message. If we send an HTML we can include tags
msg['Subject'] = "404 response code GSC"

 
#It attaches the message and its format, in this case, HTML
msg.attach(MIMEText(message, 'html'))
 
#It creates the server instance from where the email is sent
server = smtplib.SMTP('smtp.gmail.com: 587')
server.starttls()
 
#Login Credentials for sending the mail
server.login('<your email address>', password)
 
# send the message via the server.
server.sendmail(msg['From'], msg['To'], msg.as_string())
server.quit()

The final message that I will receive in my email inbox will be something like this:

Apart from creating these alerts, I think that it might be a good idea to store this data into a SQL table so that you can keep the records about your crawl statistics very granularly even further than the 90 days, which is the time range provided by Google Search Console. On this post that I wrote about what to do with your outputs on Python, I explain how you can store data into a SQL table.

This is all folks, I hope that you find this post interesting!