Python can be a very useful resource to scrape and extract on-page data from a page. On this post I am going to share with you the most common Python keys to extract the most important information from a page from an SEO perspective.

1.- Making the web request

First, before starting to parse the HTML code from a page in order to obtain the data which is of our interest, we need to make the request to the URL that we would like to scrape. The library that I usually use for these type of requests is cloudscraper, which works in a very similar way to Requests but it is much better at accessing websites which use Cloudflare without being banned. If you are interested in web scraping with Python, you can also have a read at this article where I explain 6 tricks for basic web scraping with Python.

Once we access the URL with cloudscraper, we will use BeautifulSoup to parse the HTML code and obtain the SEO data.

import cloudscraper
from bs4 import BeautifulSoup

scraper = cloudscraper.create_scraper() 

html = scraper.get("<your_url>")
soup = BeautifulSoup(html.text)

2.- Metas scraping

2.1.- Metatitle

The metatitle is one of the main metas as we all the SEOs are already aware of as it is displayed on the search snippet and it can be optimized for determined keywords. The metatitle can be obtained with this key:

metatitle = (soup.find('title')).get_text()

2.2.- Metadescription

Metadescriptions can be used to briefly explain what your page is about and they will also appear on the search snippet. The metadescription can be obtained with this key:

metadescription = soup.find('meta',attrs={'name':'description'})["content"]

2.3.- Robots

The meta robots is a very important SEO tag as it specifies if the page can be indexed and it will give some directives about how the page is to be shown on the SERPs. Noindex, index, follow, nofollow, noarchive, nosnippet, notranslate, noimageindex or unavailable_after directives can be found here among others. If you are not familiar with some of these robots directives, you can find them all with their explanations over here.

The key that we would need to use to extract this data is as follows and it will return a list with all the content directives which are separated by commas:

robots_directives = soup.find('meta',attrs={'name':'robots'})["content"].split(",")

2.4.- Viewport

The meta viewport will indicate the visible part of the page and it is mainly important for mobile friendliness purposes. The meta viewport can be extracted with this key:

viewport = soup.find('meta',attrs={'name':'viewport'})["content"]

2.6.- Charset

The meta charset will indicate to the search engine bots the character encoding. It is specially useful for those pages which are not written in English and might contain special characters not available in English when working on international SEO.

The meta charset can be found with this key:

charset = soup.find('meta',attrs={'charset':True})["charset"]

2.7.- HTML language

The html language is not a meta itself, but it can be an interesting element in case of working on international SEO as it gives to search engine bots some hints about what country the page is intended to target.

We can get this element with they key:

html_language = soup.find('html')["lang"]

3.- Alternates and canonicals scraping

3.1.- Canonical

Rel=”canonical” indicates to Google what is the page that should be indexed. It can point to itself in case the own page is to be indexed or to another page if another page is meant to be indexed instead. It is important to mention that Google takes canonicals as recommendations, so it might be possible that it might disregard the canonical indication.

Canonical URLs can be obtained with this key:

canonical = soup.find('link',attrs={'rel':'canonical'})["href"]

3.2.- Hreflangs

Hreflangs are specially important when having a website with different language versions in order to indicate to search engines what version should be indexed and showcased in each country’s SERPs.

The key that needs to be used to extract hreflangs is as follows and it will return a list with the provided pages for each language and their country codes:

list_hreflangs = [[a['href'], a["hreflang"]] for a in soup.find_all('link', href=True, hreflang=True)]

3.3.- Mobile alternates

They are used when a page has a mobile version which is hosted in a different page. This is the mainly the case of some non-responsive websites which do not use dynamic serving and have a mobile subdomain.

The mobile alternates can be extracted with:

mobile_alternate = soup.find('link',attrs={'media':'only screen and (max-width: 640px)'})["href"]

4.- Schema mark-up scraping

4.1.- Quick schema mark-up overview

It is especially easy to extract and analyze schema mark-up injections if they have been inserted in a JSON format. We can extract the whole script and then we can analyze it as if it were a Python dictionary. First, we need to find and parse the script with Beautifulsoup.

import json

json_schema = soup.find('script',attrs={'type':'application/ld+json'})
json_file = json.loads(json_schema.get_text())

Then after this, we can iterate over the schema mark-up and see at first glance which type of mark-ups are being used by that page:

for x in json_file["@graph"]:
    print(x["@type"])

For instance:

4.2.- Breadcrumbs

If the page has a breadcrumb list schema mark-up, we can use Python to extract the provided URLs in the mark-up and have an idea about which are its parental pages and its internal structure depth:

breadcrumb_urls = [[x["position"],x["item"]] if "item" in str(x) else [x["position"],"Final URL"] for x in json_file["@graph"][3]["itemListElement"]]
breadcumb_depth = len(breadcrumb_urls)

5.- Content scraping

5.1.- Text

5.1.1.- Paragraphs

The key which is as follows will scrape the paragraph texts (“p”) and will return a list with all the paragraph texts. In addition, we can calculate the number of characters from these texts.

paragraph = [a.get_text() for a in soup.find_all('p')]
#Text length
text_length = sum([len(a) for a in paragraph])

5.1.2.- Headings

Beautifulsoup enables us to extract a specific type of headers, let’s say for instance H1, but it also enables us to extract different type of tags by using the method find_all and introducing a list with all the tags that we would like to extract. So translating this into the code, it would be:

h1 = [a.get_text() for a in soup.find_all('h1')]
headers = soup.find_all(["h1","h2","h3","h4","h5","h6"])

#Cleaning the headers list to get the tag and the text as different elements in a list
list_headers = [[str(x)[1:3], x.get_text()] for x in headers]

With Jupyter Notebook we can print HTML code and we can actually see how the headings hierarchy of that page looks like at first sight as the size of the font will be smaller or larger depending on the heading types:

from IPython.core.display import display, HTML
for x in headers:
    display(HTML(str(x)))

For example, the headers structure of my article about scraping the SERPs with Oxylabs API is represented as:

5.2.- Images

Something that might be interesting is scraping the image URLs from a page to have an idea about how many images are being used and their alt texts to see if they are properly optimized.

With this piece of code we can get all the image URLs from that page and their alt texts in a list format:

images = [[a["src"],a["alt"]] if "alt" in str(a) else [a["src"],""] for a in soup.find_all('img')]

5.3.- Links

We can also extract the links and obtain two lists that will contain the internal and external links with their anchor texts and whether they are follow or nofollow.

internal_links = [[a.get_text(), a["href"], "nofollow"] if "nofollow" in str(a) else [a.get_text(), a["href"], "follow"] for a in soup.find_all('a', href=True) if "<your_domain>" in a["href"] or a["href"].startswith("/")]
external_links = [[a.get_text(), a["href"], "nofollow"] if "nofollow" in str(a) else [a.get_text(), a["href"], "follow"] for a in soup.find_all('a', href=True) if "<your_domain>" not in a["href"] and not a["href"].startswith("/")]

#To get the number of links
number_internal_links = len(internal_links)
number_external_links = len(external_links)

On the other hand, we can also differentiate our links depending on where they are nested. So for instance, if the links are nested under a paragraph or a heading tag, we can assume that they are contextual links whereas if they are nested under div or span tags, we can assume that they will not be so.

contextual_links = [[a.get_text(), a["href"], "nofollow"] if "nofollow" in str(a) else [a.get_text(), a["href"], "follow"] for x in soup.find_all(["p", "h1", "h2", "h3", "h4", "h5", "h6"]) for a in x.find_all('a', href=True)]
div_links = [[a.get_text(), a["href"], "nofollow"] if "nofollow" in str(a) else [a.get_text(), a["href"], "follow"] for x in soup.find_all(["div","span"]) for a in x.find_all('a', href=True)]

6.- Open Graph scraping

Last but not least, we can scrape the open graph and create a list that will contain the type of open graph and their specifications:

open_graph = [[a["property"].replace("og:",""),a["content"]] for a in soup.select("meta[property^=og]")]

For example:

That is all folks, I hope that you find this guide helpful and if you happen to think of an element that I might have neglected, just let me know and I will add it to the guide!