Some months ago I published on my blog an article about how you can analyze the internal linking of your website with Python and Networkx. In that article I was using mainly graphs to analyze the internal linking profile in general, the relationship between specific sections and contextual internal linking and I was also introducing some external elements such as keywords, backlinks or page authority by using colors and node sizes to attempt to print the full picture.

In today’s article I am going to take a different approach which is not going to be so visual but I think that might be very useful and even more actionable to find internal linking opportunities. Basically what I am going to do is:

  1. Extracting the keywords of my website from Semrush.
  2. Getting the URLs which are ranking for these keywords and eliminating the duplicates.
  3. Crawling these URLs with Requests and Beautifulsoup and storing their texts.
  4. Iterating over all the keywords to find exact matches within the texts which were gotten from the previous crawl.

I actually think that this is a very interesting approach to find internal linking opportunities because on the one hand, the URLs that we are checking are already indexed in Google’s index and showing up for some keywords, therefore they might have some value and on the other hand, we can find powerful contextual exact-match anchor texts.

Does this sound interesting? So let’s get started then!

1.- Importing the Excel File

First of all, we will need to export the keywords from Semrush as an Excel or CSV file and import them to our notebook. We can make use of Pandas to import the file and after that we will transform the Dataframe into a list.

import pandas as pd

list_keywords = pd.read_excel('<your Excel file>.xlsx') 
list_keywords = list_keywords.values.tolist()

2.- Getting the URL list

Now we will create our URL list and we will use a dictionary function to eliminate the duplicated URLs. We will use this list to crawl the URLs and extract the texts.

list_urls = []
for x in list_keywords:
    list_urls.append(x[6])

list_urls = list(dict.fromkeys(list_urls))

Apart from the list with the URLs, we will need to create a list which will contain the keywords and the URLs. We will use this list to iterate over the keywords and check if any of the keywords is in any of the texts which will be extracted. We also add the URLs to this list so that we will be able to ignore those exact matches that occur in the same page that is already ranking for that keyword.

list_keyword_url = []
for x in list_keywords:
    list_keyword_url.append([x[6],x[0],x[1]])

3.- Crawling the pages and finding the matches

Finally, we will iterate over the list with URLs, extract the content which is within a “p” tag and check if any of the keywords can be found in those texts. At the end of the for loops, we store the keyword, the paragraph and the URL in a list that we will import later as an Excel file with the internal linking opportunities.

import requests
from bs4 import BeautifulSoup

internal_linking_opportunities = []
for iteration in list_urls:

    page = requests.get(iteration)
    print(iteration)
    soup = BeautifulSoup(page.text, 'html.parser')
    paragraphs = soup.find_all('p')
    paragraphs = [x.text for x in paragraphs]

    
    for x in list_keyword_url:
        for y in paragraphs:
            if x[1].lower() in y.lower() and iteration != x[0]:
                internal_linking_opportunities.append([x[1],y,iteration])

Basically, we use the conditional statements to detect if any of the keywords is inside the text and not consider those cases when the keyword belongs to that page. We use the lower method to make it case-insensitive.

4.- Exporting as an Excel file

Finally, we only need to use Pandas to import it as an Excel file:

pd.DataFrame(internal_linking_opportunities, columns = ["Keyword", "Text", "URL"]).to_excel('<your filename>.xlsx', header=True, index=False)

My final output looks like:

Maybe not an impressive output, but bear in mind that my website is not so big and not ranking for so many keywords. It might be much more impactful in bigger websites with many URLs although in those cases, the scraping process could last much longer. A good approach in big websites might be just taking the keywords and URLs that are ranking within the 10 or 20 first positions.

5.- Fine-tuned script

After using my script, Michael Pinatton came up with some suggestion to improve it:

  • Add the destination URL we got from Semrush: we have the keyword, but which page do we link back ? It could be useful for bigger websites.
  • Check if there’s already a link back to the URL destination.
  • Sometimes the keyword will be a part of another bigger word and it would get flagged. For instance: “logs” would be flagged in the word “blogs”.
  • Add the position from Semrush to make the decision of adding a link easier by Brahim Hassine.

In the polished version which can be found below, I have added these two functionalities that I think that make the script much more useful. Once the data is saved in the Excel file, there will be two extra columns with the Destination URL and whether there is any link already linking from the source URL to the destination URL or not.

In this new version you will only need to add your http protocole with the domain (for instance, https://www.danielherediamejias.com) through an input in order to check those links which are added only with a relative route. Thank you so much for your feedback Michael!

import requests
from bs4 import BeautifulSoup
import pandas as pd

absolute_rute = str(input("Insert your absolute rute: "))
internal_linking_opportunities = []
for iteration in list_urls:

    page = requests.get(iteration)
    print(iteration)
    soup = BeautifulSoup(page.text, 'html.parser')
    paragraphs = soup.find_all('p')
    paragraphs = [x.text for x in paragraphs]
    
    links = []
    for link in soup.findAll('a'):
        links.append(link.get('href'))

    
    for x in list_keyword_url:
        for y in paragraphs:
            if " " + x[1].lower() + " " in " " + y.lower().replace(",","").replace(".","").replace(";","").replace("?","").replace("!","") + " " and iteration != x[0]:
                links_presence = False
                for z in links:
                    try:
                        if x[0].replace(absolute_rute,"") == z.replace(absolute_rute,""):
                            links_presence = True
                    except AttributeError:
                        pass
                        
                
                if links_presence == False:
                    internal_linking_opportunities.append([x[1],y,iteration,x[0], "False", x[2]])
                else:
                    internal_linking_opportunities.append([x[1],y,iteration,x[0], "True", x[2]])


pd.DataFrame(internal_linking_opportunities, columns = ["Keyword", "Text", "Source URL", "Target URL", "Link Presence", "Keyword Position"]).to_excel('<your_filename>.xlsx', header=True, index=False)