On today’s post I am going to show you how you can use Python to find terms occurrences to improve your on-page optimization. Basically, what we are going to do is:

  1. Using a keyword import from Semrush we will extract keywords and URLs ranking for those keywords.
  2. We will iterate over the list of URLs, we will scrape their contents and we will search for the keywords occurrences on metatitles, metadescriptions, H1, H2 and paragraphs.
  3. Finally, we will download this data as an Excel file which will have a conditional formatting that will help us to spot some possible optimizations, working similar to a heatmap.
  4. From my point of view, the final cherry-picking of optimizations needs to be manually and some of the possible optimizations can be disregarded in order to not overoptimize the page or to keep it natural.

The final Excel file will look like as follows:

Does it sound interesting? Let’s get started then!

1.- Importing the data from Semrush

Initially, we will need to download a keyword level report from Semrush and import it to our notebook. For that, we will use pandas.

import pandas as pd

keywords = pd.read_excel ('<import-file-name>.xlsx')

Secondly, with the purpose of finding the low hanging fruits and maximize the return of this exercise, we will leave out those keywords which are ranking out of the top 15, although depending on the number of keywords and the current rankings, you can make your threshold higher or lower.

low_hanging = keywords[keywords['Position'] < 15]
low_hanging_list = low_hanging.values.tolist()

To avoid having to crawl an URL several times once we iterate over them, we will adjust the format of our input, transforming the list into a dictionary where the URL will be the key and we will save the keyword, the current ranking and the number of monthly searches as its values.

dict_urls = {}
for urls in low_hanging_list:
    if urls[6] in dict_urls:
        dict_urls[urls[6]] += [[urls[0],urls[1],urls[3]]]
    else:
        dict_urls[urls[6]] = [[urls[0],urls[1],urls[3]]]

The format is ready to proceed with the web scraping and finding the term occurrences!

2.- Scraping the URLs and finding the occurrences

To scrape the URLs we will use the Python library called cloudscraper. As I have already mentioned in other posts, I really like this library, which depends on Requests + requests_toolbelt, as it enables you to scrape those sites which are using Cloudflare without being banned.

In addition, to be able to parse the HTML response we will use beautifulsoup, which will enable us to make the object parsable. If you are interested in web scraping for SEO or other purposes, you can have a read at my previous article where I explained how you can extract all the content from a page based on their selectors.

Once we search for the term occurrences, we will aim at a broad match, meaning that we will check if the terms are present in the content separately instead of checking if the exact term combinations are present. In my opinion, this is a best approach which copes with some cases where the order of the keyword terms wouldn’t alter their meanings or when articles or prepositions are used in the natural language while when looking up on the Internet they are neglected.

However, in case you would like to search for the exact matches, you could also use that approach by tweaking a bit this piece of code.

import cloudscraper
from bs4 import BeautifulSoup

scraper = cloudscraper.create_scraper() 

for key, values in dict_urls.items():
    
    print(str(key))
    
    html = scraper.get(key, headers = {"User-agent" : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"})
    soup = BeautifulSoup(html.text)
    
    metatitle = (soup.find('title')).get_text()
    metadescription = soup.find('meta',attrs={'name':'description'})["content"]
    h1 = [a.get_text() for a in soup.find_all('h1')]
    h2 = [a.get_text() for a in soup.find_all('h2')]
    paragraph = [a.get_text() for a in soup.find_all('p')]
    
    
    for y in values:
        
        metatitle_occurrence = "True"
        metadescription_occurrence = "True"
        h1_occurrence = "True"
        h2_occurrence = "True"
        paragraph_occurrence = "True"
        
        for z in y[0].split(" "):
        
            if z not in str(metatitle).lower():
                metatitle_occurrence = "False"

            if z not in str(metadescription).lower():
                metadescription_occurrence = "False"

            if z not in str(h1).lower():
                h1_occurrence = "False"

            if z not in str(h2).lower():
                h2_occurrence = "False"

            if z not in str(paragraph).lower():
                paragraph_occurrence = "False"
            
        y.extend([metatitle_occurrence,metadescription_occurrence,h1_occurrence,h2_occurrence,paragraph_occurrence])

This piece of code will append to the dictionary a boolean value for each keyword and tag, throwing False is the keyword is not found and True if it is present.

3.- Downloading as an Excel file

Finally, we will download this dictionary as an Excel file by using the library openpyxl. This library will enable us to add the conditional formatting shown on the initial screenshot.

from openpyxl import Workbook
from openpyxl.formatting import Rule
from openpyxl.styles import Font, PatternFill, Border
from openpyxl.styles.differential import DifferentialStyle

wb=Workbook()
dest_filename = 'new_document.xlsx'
ws1 = wb.active

number=2

for key, values in dict_urls.items():
    
    ws1.cell(row=1,column=1).value= "URL"
    ws1.cell(row=1,column=2).value= "Keyword"
    ws1.cell(row=1,column=3).value= "Ranking"
    ws1.cell(row=1,column=4).value= "Searches"
    ws1.cell(row=1,column=5).value= "Metatitle Occurrence"
    ws1.cell(row=1,column=6).value= "Metadescription Occurrence"
    ws1.cell(row=1,column=7).value= "H1 Occurrence"
    ws1.cell(row=1,column=8).value= "H2 Occurrence"
    ws1.cell(row=1,column=9).value= "Paragraph Occurrence"
    
    for list_values in values:
        ws1.cell(row=number,column=1).value= key
        column = 2
        for iteration in list_values:
            ws1.cell(row=number, column=column).value = iteration
            column +=1
        number += 1
    

red_text = Font(color="9C0006")
red_fill = PatternFill(bgColor="FFC7CE")
green_text = Font(color="FFFFFF")
green_fill = PatternFill(bgColor="009c48")

dxf = DifferentialStyle(font=red_text, fill=red_fill)
dxf2 = DifferentialStyle(font=green_text, fill=green_fill)

rule = Rule(type="containsText", operator="containsText", formula=['A1:N' + str(number) + '= "False"'], dxf=dxf)
rule2 = Rule(type="containsText", operator="containsText", formula=['A1:N' + str(number) + '= "True"'], dxf=dxf2)

ws1.conditional_formatting.add('A1:N' + str(number), rule)
ws1.conditional_formatting.add('A1:N' + str(number), rule2)


wb.save(filename = dest_filename)

That is it, this will make the magic happen and export the data as an Excel file with the conditional formatting!

4.- Getting the data from Google Search Console

Alternatively, you can also use the data from Google Search Console to run this piece of code. You would only need to extract the data by using GSC API or you can also just download the data with an Excel file and import it to your notebook with pandas as done before with the export from Semrush.

If you would like to give a try to GSC API and you are not familiar with it, I highly recommend you to have a look at the amazing guide that JC Chouinard created to walk you through almost every single step of the set-up process.

When retrieving the data from Google Search Console API, it is possible that you will need to make some tweaks as the data is fetched day by day. In my case what I did is grouping the number of impressions and the average position by keyword and URL, simulating the export from Semrush.

sum_df = df.groupby(['query','page']).agg({'impressions': 'sum', 'avg_position': 'mean'})
sum_df.avg_position = sum_df.avg_position.round(0)
sum_df = sum_df.sort_values(by=['impressions'], ascending=False)

After grouping the impressions and the average position, you can use the rest of the piece of code for the Semrush export with the data from Google Search Console.

That is all folks, I hope that you found this blog post interesting!