On today’s post I am going to show you a very easy trick to create a txt snippet that you can use in your htaccess file to set pages as not indexable based on their performances. The logic that we will use is:

  • We use Screaming Frog connected with Google Search Console API to crawl our website and get those pages that do not turn up on Google Search Console.
  • We export those pages as an Excel file.
  • We run a very simple Python code to create a txt snippet that we can paste in our htaccess file to serve a noindex tag in the HTTP response to prevent the Googlebot to index (and render in most cases) the underperforming pages.

Does this sound interesting? So let’s get started then!

1.- Getting the underperforming pages with Screaming Frog

The first thing that we need to do is connecting Screaming Frog with our Google Search Console account to be able to obtain a report with those URLs that are found in the crawl but not on GSC’s data. We can access this feature on the navigational menu under API Access –> Google Search Console:

After that, Screaming Frog will prompt us to connect with our Google Search Console account. We just need to click on New account and log into our account. Once we are logged in, we need to select the account that corresponds to the site that we are going to crawl.

Another important point is that we can extend the date range of the data in the Date Range tab if needed to have a bigger volume of data and avoid setting to not indexable URLs with no impressions due to seasonal effects.

Once everything is set up, we just need to run the crawl and when it is finished, we will have to export the report called “No GSC data” which can be found on the sidebar under section Search Console.

2.- Creating the TXT snippets with Python

This article from Yoast inspired me to create these txt snippets to be added on the htaccess file as I thought that with Python we could iterate over the list of pages without GSC data very easily and write an exclusion rule for each of them. However, even if this article can be of inspiration to automate the pruning task, I would recommend to those SEOs without a big technical knowledge to be helped out by a developer as the htaccess is a very sensitive file that can break your site in case something is not inserted correctly, although the logic can still be used. In case you would still like to take some risks, you can test the htaccess file with some htaccess validators as I recommended in this article about page to page redirects with Python and htaccess.

Another important thing to be mentioned when pruning your website based on underperforming pages is that some pages might not be performing well due to some technical issues or other reasons, so before deindexing the URLs take some time to have a look at the type of URLs that you are about to set as non-indexable and identify the problem why they do not rank well. Pruning is mainly recommended on those sites with lots of pages with thin content, so if your pages are not performing well due to thin content and there is no intention of extending their contents, it might be a good decision to exclude them. However, if you believe that the content on your pages have a big quality, it is very likely that there must be other issues or it is just a matter of time that Google values that content if it has been published recently.

After this small disclaimer message, let’s go over the code. First we import with Pandas the document with the URLs without GSC data and we exclude all the URLs that do not have a 200 response code because they are already not indexable.

import pandas as pd

file_name = 'search_console_no_gsc_data.xlsx' 
df = pd.read_excel(file_name)

df_200 = df.loc[df['Status Code'] == 200]
list_200 = df_200.values.tolist()

When the data is imported, we can just iterate over the URLs and create the text snippet with a for loop. We will also make use of urllib.parse so that we can break down the URLs and get only the relative path, which is what we need to add on the htaccess. The code to create the snippet for Apache servers is:

from urllib.parse import urlparse

text = ""
for x in list_200:
    text = text + '<FilesMatch "' + urlparse(x[0]).path[1::] + '">\nHeader set X-Robots-Tag "noindex"\n</FilesMatch>\n'
    

Whereas the piece of code to create this txt snippet for Nginx servers is:

from urllib.parse import urlparse

text = ""
for x in list_200:
    
    text = text + '''
location = ''' + urlparse(x[0]).path[1::] + ''' {
    add_header  X-Robots-Tag "noindex";
}
    '''

Finally, in both cases we can export the htaccess snippet with:

file_ = open("Prunning.txt", 'w')
file_.write(text)
file_.close()

If everything goes well, you will export a txt file that should be pasted on your htaccess file and look like:

That is all folks, I hope that you found this article useful!