All the SEOs know how important internal linking is to improve website crawlability and make your most valuable pages accessible and important within the network that your website represents. Even if lately the linkjuice theory seems to be more questioned than ever in favor of theories more related to user experience, the internal linking audit is still an important point to not be missed once a SEO audit is ran or a new website is built, although perhaps the approach needs to be slightly different and the actions to be taken once such an audit is done.

On today’s post I am going to audit the internal linking structure of my own website, https://www.danielherediamejias.com, with Python, Networkx and export files from tools like Screaming Frog, Ahrefs and Semrush. For most of the analyses we are going to draw diagrams which are going to represent the internal linking structure of our website, taking a very similar approach to the crawl diagram which can be built on Screaming Frog.

However, the main difference between drawing these graphs with Screaming Frog and Python or any other programming language is that we own the data and we an customize these graphs as much as we wish, enabling us to do more sophisticated and ad hoc analysis for each website depending on the detected pain points or necessities and integrate endless data sources to have the full picture.

In this post I am going to show you how to:

  • Draw crawl diagrams with Python.
  • Change the size of the nodes based on the internal links count, keywords from Semrush and the external backlinks or page authority from Ahrefs.
  • Change the nodes colors and include specific labels.
  • Analyze the relationship between specific sections.
  • Analyze contextual internal linking.

Does it sound good so far? So let’s get it started!

If you are interested in finding internal linking opportunities, you can also read this article where I explain how you can find potential internal links with a Semrush export and Python.

1.- Creating a simple diagram

To start with, we are going to export the “all_inlinks” report from Screaming Frog and we are going to import it with Pandas and convert the dataframe into a list:

import pandas as pd

df = pd.read_excel('all_inlinks.xlsx')
list_inlinks = df.values.tolist()

Now, in order to draw our graph, we will need to create a list where each node connection is in a tuple, in this case the source page of the link and the destination page of the link. For that we will iterate over the list which was created previously and we are going to create a new list adapted to the required format:

listwithtuples = []

for x in list_inlinks:
    if x[0] == "Hyperlink" and x[1].count("/") <= x[2].count("/") and "www.danielherediamejias.com" in x[2]:
        tuple_links = (x[1],x[2])
        listwithtuples.append(tuple_links)

With the conditional statements we discard the external links and those links which are not hyperlinks. We also include a conditional statement to only append to the list those links which are coming from an upper directory or from the same directory depth as we want to analyze the internal linking from top to the bottom and see how the deepest pages are linked and the linking among pages in the same directory.

Finally, before we draw our first graph, we can eliminate the duplicate links by using:

noduplicates = list(dict.fromkeys(listwithtuples))

Now we are ready to go and create our first graph! For that, first we need to install and import networkx and add the list with the internal links as edges:

import networkx as nx

G=nx.Graph()
G.add_edges_from(noduplicates)

Once we have added the nodes, we need to represent them with Matplotlib. In addition, we will also use numpy to find the ideal distance between the nodes and include it as an argument. We will use some matplotlib modes such as plt.figure to set the size of the figure, plt.show() to display the figure in our notebook and plt.savefig to save the figure as an image if needed.

from matplotlib import pyplot as plt
import numpy as np

pos = nx.spring_layout(G, k=0.3*1/np.sqrt(len(G.nodes())), iterations=20)
plt.figure(3, figsize=(30, 30))
nx.draw(G,pos, with_labels=True)

plt.savefig("<yourfilename.png>")

plt.show()

The final output that I have gotten for my website is:

My website mainly comprises two sections: the blog posts and the trends catcher tool which uses pytrends and tweepy to display the hottest trends for each country. One visible problem at first sight is that the hottest trends for each website are only linked from the International Trends Catcher lobby page, which could be causing a bottleneck and complicating the indexation of the regional trends pages.

2.- Making more insightful and visual graphs

After this first approach to Networkx, we are going to insert some elements to make these diagrams much more insightful at first sight. These elements are:

  • Differentiate pages by color depending on their indexability status.
  • Add the labels only for the main pages and set a larger font size for the labels.
  • Change the size of the node depending on the number of inlinks which are receiving.

So, let’s start with the colors depending on the indexability status! For that, we will first need to export the internal_html report from Screaming Frog and import it with Pandas:

df = pd.read_excel('internal_html.xlsx')
list_all = df.values.tolist()

After importing the internal_html report and convert it into a list, we will iterate over the list and the current nodes. If the indexability status is “Non-indexable”, we will assign to that node a red color, if it is “Indexable” we will assign to that node a blue color and if the node is not found on the internal_html report as it could happen with images or other types of pages, we will assign to that node a yellow color. All these colors will be appended in a list which will be used afterwards in the digram drawing.

list_colors = []
for y in G.nodes:
    match = False
    for x in list_all:
        if y == x[0]:
            match = True
            if x[4] == "Non-Indexable":
                list_colors.append("red")
            else:
                list_colors.append("blue")
                
        
    if match == False:
        list_colors.append("yellow")

When it comes to changing the node size depending on the number of inlinks, we will just need to create a dictionary by using:

dictionary_degree = dict(G.degree)

This line of code will return a dictionary with the URLs and the number of pages which are linking to them. Finally, to assign the labels let’s say only for the 5 most linked pages we will need to sort out this dictionary and create a new specific dictionary for the labels:

sort_dictionary_degree = dict(sorted(dictionary_degree.items(), key=lambda item: item[1], reverse = True))

counter = 0

for key, value in sort_dictionary_degree.items():
    if counter < 5:
        sort_dictionary_degree[key] = key
    else:
        sort_dictionary_degree[key] = ""
    
    counter = counter + 1

Once we have the different elements which will be needed to make our diagram more visual and insightful, let’s put them into action:

pos = nx.spring_layout(G, k=0.3*1/np.sqrt(len(G.nodes())), iterations=20)
plt.figure(3, figsize=(30, 30))
dictionary_degree = dict(G.degree)


nx.draw(G,pos, with_labels=False, node_size = [10 + v * 300 for v in dictionary_degree.values()],node_color = list_colors, font_size = 15)
nx.draw_networkx_labels(G,pos,sort_dictionary_degree,font_size=25,font_color='r')

plt.savefig("<yourfilename.png>")

plt.show()

The final output which I am going to get from the previous piece of code is:

The 5 most linked pages are: https://www.danielherediamejias.com/international-trends-twitter-google/, https://www.danielherediamejias.com/python-scripts-seo/, https://www.danielherediamejias.com/scraping-the-travel-insights-tool-with-selenium/, https://www.danielherediamejias.com/gsc-crawl-report-python-selenium/ and https://www.danielherediamejias.com/website-categorization-python/. Nevertheless, the gap between https://www.danielherediamejias.com/international-trends-twitter-google/ and the rest of the pages is quite big as https://www.danielherediamejias.com/international-trends-twitter-google/ has 42 links and the other pages all have around 20 links. That could help to mitigate the problem discovered in the first section as https://www.danielherediamejias.com/international-trends-twitter-google/ is the only page which is linking to the regional trend pages but it is the most linked page by far.

With the colors insertion, we can also see at first glance that most of the pages which are in the “outskirts” of the network are either non-indexable pages or other type of pages but not HTML files.

3.- Contextual inlinks and understanding the relationship between sections

On this section what we are going to do is basically drawing a graph for the contextual inlinks and learn how to draw the relationship between two sections. If we wanna get the contextual inlinks we can make it very easily by using a conditional statement when we iterate over the all inlinks list to store in the list of tuples with the inlinks just those links which are in the content plus in my case, I will also filter by Absolute path as I am more likely to insert those links within the text with the absolute path:

listwithtuplescontent = []

for x in list_inlinks:
    if x[0] == "Hyperlink" and "www.danielherediamejias.com" in x[2] and x[13] == "Content" and x[11] == "Absolute":

        tuple_links = (x[1],x[2])
        listwithtuplescontent.append(tuple_links)

Now we eliminate the duplicates:

noduplicatescontent = list(dict.fromkeys(listwithtuplescontent))

Finally, we draw the diagram:

G=nx.Graph()
G.add_edges_from(noduplicatescontent)

pos = nx.spring_layout(G, k=0.3*1/np.sqrt(len(G.nodes())), iterations=20)
plt.figure(3, figsize=(30, 30))
dictionary_degree = dict(G.degree)


nx.draw(G,pos, with_labels=True, node_size = [10 + v * 300 for v in dictionary_degree.values()], font_size = 20)

plt.savefig(<yourfilename.png>)

plt.show()

The output which I have gotten is:

The only section which is internally well-linked is the trends catcher. Regarding the blog posts, there is quite a lot of room for improvement as there are not too many links and considering that the topics are usually similar, there might be some opportunities for good contextual links which could help to enhance the organic positioning for the articles and drive users to other articles to improve user experience. Specially, we can see that there are some articles like the one about getting the most out of Page Speed Insights API which has no links. This article should receive a few links to improve its organic performance as from my point of view, it is an insightful article which might deserve to be ranking higher in the SERPs.

This sort of cross-sectional internal links analysis could also be interesting if you have a blog to bring traffic to the website with the final goal of eventually redirecting these traffic to other areas of your website such as the case of an Ecommerce. Are you inserting enough links to refer this traffic to a different area of the website? You can perform this analysis if you make use of the correct conditional statements once you create the list with the tuples to draw your diagram. For example:

listwithtuples = []

for x in list_inlinks:
    if x[0] == "Hyperlink" and "yourwebsitedomain" in x[2] and "/blog/" in x[1] and "/shop/" in x[2]:

        tuple_links = (x[1],x[2])
        listwithtuplescontent.append(tuple_links)

4.- Incorporating other datasources: Semrush or Ahrefs

Finally, what we are going is adding to our analysis other datasources like Semrush or Ahrefs in order to change the size of the nodes based on the number of keywords which are ranking from those pages on the SERPs and the page authority or the number of referring domains.

This is an interesting functionality as with Screaming Frog we need to use the Ahrefs API to be able to link Screaming Frog and Ahrefs while with Python we can connect both data sources only by using the export files and with the normal plan.

First what we are going to do is getting two reports: from Semrush the one with the landing pages and their number of keywords and from Ahrefs the landing pages sorted out by the number of referring domains. Once we have gotten these reports, we will import them by using Pandas and convert them into lists:

df = pd.read_csv('ahrefs best links.csv')
best_links = df.values.tolist()


df = pd.read_excel('Danielherediakeywords.xlsx')
list_kws = df.values.tolist()

Now if we would like to assign the node size based on the number of backlinks, we will have to iterate over this list and the nodes and create a dictionary with the number of referring domains for each page. In case the page has no backlinks, we will assign the number 0. As we are using an index to assign the value, we could just change the index from 4 to 1 to create the dictionary with the page authorities as the values for each page instead of the number of referring domains.

dictionary = {}
for y in G.nodes:
    match = True
    for x in best_links:
        if y == x[2]:
            match = False
            #dictionary[x[2]] = x[1] For the page authority
            dictionary[x[2]] = x[4]
        
    if match == True:
        dictionary[y] = 0

If we would like to adapt the labelling just to show in the diagram the labels of the pages with most backlinks we would also need to create a dictionary for the labels:

sort_dictionary = dict(sorted(dictionary.items(), key=lambda item: item[1], reverse = True))
counter = 0

for key, value in sort_dictionary.items():
    if counter < 5:
        sort_dictionary[key] = key
    else:
        sort_dictionary[key] = ""
    
    counter = counter + 1

Finally, now we can draw the plot:

G=nx.Graph()
G.add_edges_from(noduplicates)

pos = nx.spring_layout(G, k=0.3*1/np.sqrt(len(G.nodes())), iterations=20)
plt.figure(3, figsize=(30, 30))

nx.draw(G,pos, with_labels=False, node_size = [10 + v * 600 for v in dictionary.values()], font_size = 20)
nx.draw_networkx_labels(G,pos,sort_dictionary,font_size=25,font_color='r')

plt.savefig("yourimagename.png")

plt.show()

The ouput that I have got from running this piece of code is:

The pages with most backlinks are https://www.danielherediamejias.com/pagespeed-insights-api-with-python/, https://www.danielherediamejias.com/, https://www.danielherediamejias.com/gsc-crawl-report-python-selenium/, https://www.danielherediamejias.com/scraping-on-instagram-with-instagram-scraper-and-python/ and https://www.danielherediamejias.com/facebook-scraping-and-sentiment-analysis-with-python/. However, https://www.danielherediamejias.com/pagespeed-insights-api-with-python/ is the page with more backlinks by far with 24 backlinks. Contrastingly, as we saw before the internal linking for that page was very poor.

Finally, if we would like to represent the diagram based on the number of keywords that each landing page is appearing for, we only need to replicate the previous steps but this time for the export file with the landing pages broken down by keywords.

dictionary = {}
for y in G.nodes:
    match = True
    for x in list_kws:
        if y == x[0]:
            match = False
            dictionary[x[0]] = x[2]
            #dictionary[x[2]] = x[4]
        
    if match == True:
        dictionary[y] = 0

sort_dictionary = dict(sorted(dictionary.items(), key=lambda item: item[1], reverse = True))
counter = 0

for key, value in sort_dictionary.items():
    if counter < 5:
        sort_dictionary[key] = key
    else:
        sort_dictionary[key] = ""
    
    counter = counter + 1

G=nx.Graph()
G.add_edges_from(noduplicates)

pos = nx.spring_layout(G, k=0.3*1/np.sqrt(len(G.nodes())), iterations=20)
plt.figure(3, figsize=(30, 30))

nx.draw(G,pos, with_labels=False, node_size = [10 + v * 600 for v in dictionary.values()], font_size = 20)
nx.draw_networkx_labels(G,pos,sort_dictionary,font_size=25,font_color='r')

plt.savefig("yourimagename.png")

plt.show()

The output which I have gotten after running this piece of code is:

Indeed, the page with most keywords and which brings most of the traffic to my website at the moment is the Instagram scraper article.

So that is all folks, thanks for reading this article and going along with me during the internal linking audit of my own website. After this, I have already gotten quite a lot of actionable points that I will look into shortly to improve the internal linking of my website!

FAQs section

Which libraries do you need?

You will need Pandas, networkx, matplotlib and Numpy.

What will you learn in this post?

You will learn how to create plots with Python to analyze the internal linking structure of a website.

How long will it take?

The code is already available what takes most of the time is analyzing.