Since John Muller confirmed that bold text has some SEO benefits for SEO to help Google to understand the page better there have been quite a lot of discussions about it so on this post I am going to show you how you can use a TF IDF model to find the main words from a group of pages and/or articles so that you can bold them.

But first of all, let’s try to clarify a bit what the TF IDF logic is and how the model works. In short, our TF IDF model will replace the words of the article with identifiers and it will give a higher score to those terms that appear in our page but not in the other articles. With this logic, if the sample of documents or articles is large enough, it will surely give a low score to stop words like articles and connectors and highlight the actual principal terms. If you would like to know more about TF IDF and its technicalities you can check the article this article that Koray Tuğberk recently posted.

So considering how a TF IDF model works, what we are going to do in this article is scraping the content from most of my blog posts, creating our own TF IDF model and obtaining the main keywords to be bolded. Something important to mention, it is that as the “seed” articles will al be very specialized in a field (concretely SEO and Python), it will not give a high score to Python terms, which is something positive because it will help to differentiate the articles and highlight the main “gap” terms.

Creating our own TF IDF model is a technique that will work very well for already clustered and specialized pieces of content. Otherwise, for example if we had use a generic TF IDF model, it would highlight some terms that are not so natural in the language like Python, but very common among my articles.

1.- Scraping the content from the pages

First, we will scrape the <p> content from the pages with cloudscraper and beautifulsoup, we will process these texts with textblob and we will store the processed texts in a list. In the code below, you would need to insert the pages that you would like to use for your model in a list format:

import cloudscraper
from bs4 import BeautifulSoup
from textblob import TextBlob as tb

list_pages = [<insert your pages in a list format>]

scraper = cloudscraper.create_scraper() 
list_content = []

for x in list_pages:
    content = ""
    html = scraper.get(x)
    soup = BeautifulSoup(html.text)
    for y in soup.find_all('p'):
            content = content + " " + y.text.lower()

2.- Obtaining the main terms with TF IDF

Now that we have the content, we can compute the term frequencies, train our model and obtain the main terms for each of the pages. I learnt how to use the TF IDF model with Textblob on this article written by Steven Loria.

First we declare our functions:

import math
from textblob import TextBlob as tb

def tf(word, blob):
    return blob.words.count(word) / len(blob.words)

def n_containing(word, bloblist):
    return sum(1 for blob in bloblist if word in blob.words)

def idf(word, bloblist):
    return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))

def tfidf(word, blob, bloblist):
    return tf(word, blob) * idf(word, bloblist)

And now, we can iterate over the list with the page contents, obtain the main 5 terms and store them in a list that we will use later to export to Excel:

list_words_scores = [["URL","Word","TF-IDF score"]]
for i, blob in enumerate(list_content):
    scores = {word: tfidf(word, blob, list_content) for word in blob.words}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    for word, score in sorted_words[:5]:

3.- Exporting as an Excel file

Finally, we can export the results as an Excel file with Pandas:

import pandas as pd
df = pd.DataFrame(list_words_scores)
df.to_excel('<filename>.xlsx', header=False, index=False)

It will create an Excel file with three columns for the URL, the word and its TF IDF score (which the closer to 1 it is, the more relevant it will be).

4.- Try the Google Colab notebook

You can try the Google Colab Notebook to find your most relevant terms over here! You will be prompted to share your group pages with a sitemap and you will also need to grant access to Google Colab to your Drive to be able to export the main terms as an Excel file on Drive.

One Reply to “Find your main relevant words with TF IDF and Python”

Comments are closed.