On today’s post we are going to explain how to create an alerting system by using Google Analytics API with Python to detect underperforming SEO pages based on the traffic mean and standard deviation for each day of the week. Basically this alerting system will use the following logic:

  • We get from Google Analytics API the number of organic sessions broken down by landing pages from yesterday and all the previous yesterday’s days of the week for a specific time range. For example, if yesterday was Monday, we will extract the traffic for each Monday for a specific period of time. One of the good things of taking this approach is that we avoid any bias related to the day of the week.
  • We calculate the mean and the standard deviation for the number of sessions for each landing page and we create a confidence interval.
  • We compare the lower interval with yesterday’s performance for each landing page. In case that yesterday’s performance is lower than expected for a page, then this page will be saved and further checks will be ran.
  • We can automate some on-page SEO checks to find out if the underperformance is due to any on-page issue such as a noindex tag.
  • Finally, an email can be sent out alerting about the undeperformant pages and the findings which have been found after running the automated SEO checks.

Does this sound interesting to spot SEO issues and minimize the losses coming from these errors? If you think that it is indeed interesting, I guess that this article will be of your interest!

1.- Setting up Google Analytics API

First of all, in order to make use of Google Analytics API, we need to create a project on Google’s developer console, enable the Google Analytics Reporting service and get the credentials. This article written by Jean-Christophe Chouinard explains very well how to move forward with this (I actually used it myself to sort out everything related to the credentials before I started playing with the API).

Regarding the requests, we are going to use the same as format as Jean-Christophe used on his post to query the data from Google Analytics too.

2.- Getting the desired report from GA API

After setting up Google Analytics API, we can start playing and making our first requests. To start with, we will need to install some Python libraries to authenticate:

from apiclient.discovery import build
from oauth2client.service_account import ServiceAccountCredentials

SCOPES = ['https://www.googleapis.com/auth/analytics.readonly']
KEY_FILE_LOCATION = 'path to your credentials'
VIEW_ID = 'XXXXXXXXX'

credentials = ServiceAccountCredentials.from_json_keyfile_name(KEY_FILE_LOCATION, SCOPES)
analytics = build('analyticsreporting', 'v4', credentials=credentials)

Now, we can make our first request. As explained previously, what we want to get is a report with the organic sessions broken down for each page for yesterday’s day of the week for an entered time range. We can get such a report with the following request:

from datetime import date, timedelta
import datetime

day_week = datetime.datetime.today().weekday()
time_range = 60

response = analytics.reports().batchGet(body={
    'reportRequests': [{
        'viewId': VIEW_ID,
        'dateRanges': [{'startDate': str(time_range) + 'daysAgo', 'endDate': 'yesterday'}],
        'metrics': [
            {"expression": "ga:sessions"}
        ], "dimensions": [
            {"name": "ga:landingPagePath"},
            {"name": "ga:date"}
        ],
        'orderBys': [{"fieldName": "ga:sessions", "sortOrder": "DESCENDING"}],
        "filtersExpression":"ga:channelGrouping=~Organic;ga:dayOfWeek==" + str(day_week), 
    }]}).execute()

The weekday variable will return yesterday’s day of the week and the time range variable can be changed to the period of time which is being analyzed. In this case, I introduced a time range of 60 days.

Note: for the weekday variable we use today’s date because the datetime library starts its day of the week order count on Monday while Google Analytics API starts its order count on Sunday.

As well as the report with the organic sessions for each landing page we will need to get another report which will return the number of days of the week as we need the total number of observations to calculate the mean, the standard deviation and the confidence interval.

response2 = analytics.reports().batchGet(body={
    'reportRequests': [{
        'viewId': VIEW_ID,
        'dateRanges': [{'startDate': str(time_range) + 'daysAgo', 'endDate': 'yesterday'}],
        'metrics': [
            {"expression": "ga:sessions"}
        ], "dimensions": [
            {"name": "ga:date"}
        ],
        'orderBys': [{"fieldName": "ga:sessions", "sortOrder": "DESCENDING"}],
        "filtersExpression":"ga:dayOfWeek==" + str(day_week), 
    }]}).execute()


number_of_days = len(response2["reports"][0]["data"]["rows"])

We save the number of observations under the variable called number_of_days.

Note: more information about the different dimensions and metrics that can be used for your reports can be found on this page.

3.- Parsing the data

Now we need to parse the data that we have gotten from the previous request. First of all, we will iterate over the JSON file response and we will split in different lists those values from yesterday and from the rest of the days.

yesterday = date.today() - timedelta(days=1)
yesterday = yesterday.strftime('%Y%m%d')

dict_values  = {}
yesterday_sessions = []

domain = "https://www.danielherediamejias.com"

for x in response["reports"][0]["data"]["rows"]:
    
    if domain + x["dimensions"][0] in dict_values:
        dict_values[domain + x["dimensions"][0]].append(int(x["metrics"][0]["values"][0]))
    else:
         dict_values[domain + x["dimensions"][0]] = [int(x["metrics"][0]["values"][0])]
    
    
    if str(yesterday) == str(x["dimensions"][1]):
        yesterday_sessions.append([domain + x["dimensions"][0],int(x["metrics"][0]["values"][0])])

After separating the values from the different groups, we will have to tweak the list with all the previous values by adding some 0s. The reason why we need to add some zeros is that if there is a day where a page has no organic sessions, the API will not return anything, which might cause some miscalculations as the number of observations would be different for each landing page and the days with no sessions would not be considered.

So we iterate over this list and we add zeros until the length of each inner list is equal to the number of days variable.

for x in dict_values.values():
    while len(x) != number_of_days:
        x.append(0)

4.- Calculating the confidence interval and finding the underperforming pages

So after parsing the data and putting together in the list the number of sessions from the previous days, it is time to calculate the mean, standard deviation and confidence interval.

import statistics
import math 

list_stats = []
for k, v in dict_values.items():
    mean = statistics.mean(v)
    dev = statistics.stdev(v)
    confidence_interval = round(mean - 1.960 * (dev / math.sqrt(number_of_days)))
    list_stats.append([k, mean, dev, confidence_interval])

Note: we only calculate the lower interval. However, it could also be a good idea to find the upper interval to spot overperformant pages in order to optimize them even more if possible and enhance their performances even further.

Finally, we will compare the lower sessions interval for each landing with yesterday’s results and append into a list those URLs which are not performing well.

list_underperforming = []
for x in yesterday_sessions:
    for z in list_stats:
        if x[0] == z[0] and int(x[1]) < int(z[3]):
            list_underperforming.append([x[0],int(x[1]) - int(z[3])])

Note: you could add a conditional statement to only store in the list with the underperforming pages those URLs which have a minimum number of sessions that you consider relevant.

5.- Automated on-page SEO checks

With this list of underperforming pages we can run automated on-page SEO checks and with the piece of code which is below we will check if the URL has a noindex tag, if the URL is blocked by robots.txt or if the canonical does not match with the URL and if any of these problems are detected then we will save them into a list and they will be stressed in the email alert which is going to be sent.

import requests 
from bs4 import BeautifulSoup
import urllib.robotparser


for x in list_underperforming:
    problems = ""
    headers = {'user-agent': 'Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'}
    r = requests.get(x[0], headers = headers)
    soup = BeautifulSoup(r.text, 'html.parser')
    canonical = soup.find('link', {'rel': 'canonical'})["href"]
    
    rp = urllib.robotparser.RobotFileParser()
    rp.set_url( domain + "/robots.txt")
    rp.read()

    robotstxt = rp.can_fetch("*", x[0])
    
    if "noindex" in str(soup.find('head')):
        problems = problems + "Noindex tag has been detected. "
        
    if robotstxt == False:
        problems = problems + "URL blocked by robots.txt. "
        
    if canonical != x[0]:
        problems = problems + "Canonical does not match URL."
    
    if problems == "":
        problems = "No problems have been found."
    x.append(problems)

Note: more automated checks could be incorporated although I thought that it was better to keep it simple for this demo.

6.- Sending an alert by email

Finally, we can use the library encoders to send out an email which will notify us about the underperforming pages and their indexability problems if any is found. Some information about how you need to configured your settings can be found on this article: What to do with your outputs when running Python scripts?

from email import encoders
from email.message import Message
from email.mime.audio import MIMEAudio
from email.mime.base import MIMEBase
from email.mime.image import MIMEImage
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
from email.mime.text import MIMEText
import smtplib 
 
message = ""    
for x in list_underperforming:
    message = message + "<p>The URL: " + x[0] + " is underperforming " + str(x[1]) + " organic sessions. " + x[2] + "</p>"
    

#We enter the password, the email adress and the subject for the email
msg = MIMEMultipart()
password = 'yourpassword'
msg['From'] = "Desired email sender"
msg['To'] = "email receptor"
 
#Here we set the message. If we send an HTML we can include tags
msg['Subject'] = "Underperforming URLs GA"

 
#It attaches the message and its format, in this case, HTML
msg.attach(MIMEText(message, 'html'))
 
#It creates the server instance from where the email is sent
server = smtplib.SMTP('smtp.gmail.com: 587')
server.starttls()
 
#Login Credentials for sending the mail
server.login('email address sender', password)
 
# send the message via the server.
server.sendmail(msg['From'], msg['To'], msg.as_string())
server.quit()

You would be receiving something like this:

That is all folks, I hope that you find this post useful and if you happen to have any question or you would like to share your feedback with you, do not hesitate to reach out to me!

FAQs section

Which libraries do you need?

You will only need Ouath2, statistics, math, requests, beautifulsoup, urllib and encoders.

What will you learn in this post?

You will learn how to use Google Analytics and Python to create alarms for underperforming alerts and check those URLs. Finally, an email will be sent to get notified about it.

How long will it take?

The code is already available so it should be quite fast.