On today’s post I am going to show you how to categorize a set of websites with Python, Google Cloud NLP API and Google Translate API. I believe that this process can be useful for SEOs to enhance their off-page strategy as it would enable them to explore linkbuilding opportunities and analyze backlinks profiles based on topicality. However, this is something which can be helpful not only for SEOs, but also to boost other digital marketing activities as thanks to it you would be able to find affiliation opportunities or analyze the bunch of websites where your display ads are appearing in order to find the best performant typology of websites where you would like to increase your bid.

The first time I thought about creating a Python script which could automate the website categorization, I thought about using a sort of WhoIs API. I happened to find this API which seemed to work quite well, but unfortunately it was returning very unreliable results for non-English websites hence I needed to be a bit more creative. Nevertheless, note that if you intent to categorize English websites, this API should work quite well and it would make the process much easier.

For non-English websites, the logic that we are going to follow to be able to categorize successfully these websites is:

  1. Scraping some of their content tags with the Python module called Cloud Scraper. We use the Cloud Scraper module to try to avoid being banned by Cloud Flare technology.
  2. Translate the content which has been scraped with the Google Translate Python module as the the Google NLP API is only able to process content in English. Note: if you need to get many websites translated and make sure that your workflow is stable, it is better to use the official Google Cloud Translation module.
  3. We reduce the number of characters for each website up to only 1.000 characters, which should be enough for Google NLP API to determine what the website is about.
  4. We use the Website Categorization Google NLP API module to find the assigned category and its confidence score. The closer to 100% the confidence score is, the more accurate the website categorization will be.
  5. Finally, we will need to read the output returned by the Website Categorization Google NLP API.

So now that we now how we are going to move forward with the Website Categorization, let’s put it into practice!

1.- Scraping the content

As mentioned in the workflow explanation, first what we will do is scraping the content which can be found on those websites with the module Cloud Scraper. Afterwards, we will use Beautiful Soup to find the tags which are of our interest: metatitle, metadescription, paragraphs and headers. Finally, we will put all this content together, prioritizing the tags based on the relative importance within the text in this order:

  1. Metatitle
  2. Metadescription
  3. Headers
  4. Paragraphs

And we will limit the number of characters up to 1.000 as it should be enough to determine the website typology and we avoid consuming over one token for each call to the Google Cloud NLP API.

In Python, this is translated as follows:

import cloudscraper
from bs4 import BeautifulSoup

scraper = cloudscraper.create_scraper() 
headers = {'user-agent': 'Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'}

try:
    r = scraper.get(<your website>, headers = headers)

    soup = BeautifulSoup(r.text, 'html.parser')
    title = soup.find('title').text
    description = soup.find('meta', attrs={'name': 'description'})

    if "content" in str(description):
        description = description.get("content")
    else:
        description = ""


    h1 = soup.find_all('h1')
    h1_all = ""
    for x in range (len(h1)):
        if x ==  len(h1) -1:
            h1_all = h1_all + h1[x].text
        else:
            h1_all = h1_all + h1[x].text + ". "


    paragraphs_all = ""
    paragraphs = soup.find_all('p')
    for x in range (len(paragraphs)):
        if x ==  len(paragraphs) -1:
            paragraphs_all = paragraphs_all + paragraphs[x].text
        else:
            paragraphs_all = paragraphs_all + paragraphs[x].text + ". "



    h2 = soup.find_all('h2')
    h2_all = ""
    for x in range (len(h2)):
        if x ==  len(h2) -1:
            h2_all = h2_all + h2[x].text
        else:
            h2_all = h2_all + h2[x].text + ". "



    h3 = soup.find_all('h3')
    h3_all = ""
    for x in range (len(h3)):
        if x ==  len(h3) -1:
            h3_all = h3_all + h3[x].text
        else:
            h3_all = h3_all + h3[x].text + ". "

    allthecontent = ""
    allthecontent = str(title) + " " + str(description) + " " + str(h1_all) + " " + str(h2_all) + " " + str(h3_all) + " " + str(paragraphs_all)
    allthecontent = str(allthecontent)[0:999]

except Exception as e:
        print(e)

2.- Translating the content

In this second stage, what we are going to do is translating the content, for that we are going to use the Google Trans module, although as suggested in the first paragraphs, I would recommend you to use the Translation Cloud official module if you are after a more stable solution. As the Google Trans module can get blocked if so many calls are made in a very short range of time, it is also recommendable to use the time module to make the script sleep if it is ran iteratively to categorize websites in a bulk mode (10 seconds between calls should be enough for a quite big bunch of websites).

The content translation code looks like:

from googletrans import Translator
translator = Translator()

try:
        translation = translator.translate(allthecontent).text
        translation = str(translation)[0:999]
        time.sleep(10)
        
    except Exception as e:
        print(e)

3.- Using Google NLP API to categorize the website

Finally, we are going to categorize the translated text from the website with Google NLP API. I got to know about this API thanks to this article that was published by Greg Bernhardt, so if you wanna take a closer look at all the functionalities you can get from this API, I encourage you to have a read at this article. It is important to create your API key on the Google Cloud platform before getting started with the website categorization.

To categorize the website we will use the following piece of code:

import os
from google.cloud import language_v1
from google.cloud.language_v1 import enums
from google.cloud import language
from google.cloud.language import types

os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = <path to your credentials file>

try:
        text_content = str(translation)
        text_content = text_content[0:1000]

        client = language_v1.LanguageServiceClient()

        type_ = enums.Document.Type.PLAIN_TEXT
        language = "en"
        document = {"content": text_content, "type": type_, "language": language}

        encoding_type = enums.EncodingType.UTF8

        response = client.classify_text(document)
        print(response.categories[0].name)
        print(str(int(round(response.categories[0].confidence,3)*100))+"%")


except Exception as e:
    print(e)

That is all, if we put this into practice and we run the script with some sample websites…

  • Marca.com: Spanish sports newspaper. Categorized with NLP API: News/Sports News with 79% of confidence.
  • Happypancake.se: Swedish dating website. Categorized with NLP API: Online Communities/Dating & Personals with 99% of confidence.
  • Mavcsoport.hu: Hungarian website dedicated to selling train tickets. Categorized with NLP API: Travel/Bus & Rail with 60% of confidence.
  • Ilcasalingodivoghera.it: Italian food website. Categorized with NLP API: Food & Drink with 76% of confidence.
  • Autoscout24.de: German vehicles website. Categorized with NLP API: Autos & Vehicles/Vehicle Shopping/Used Vehicles with 96% of confidence.

Pretty convincing and reliable results for different sectors and languages, aren’t they!?