Wikipedia, that holy source of wisdom and information, presents many good opportunities to boost the SEO performance of your websites. On today’s post we are going to use the Wikipedia API with Python for SEO to automate the processes of:

  1. Finding entities.
  2. Finding linkbuilding opportunities.
  3. Getting inspiration for content creation.

Are you curious about how we can use Wikipedia for SEO with Python? Let’s get it started and I’ll show you how!

1.- Getting familiar with the Wikipedia API

To start with, we are going to get familiar with this API and show the main methods that we are going to make use of and apply to SEO in the next sections.

The first basic step that we will need to do is installing the Wikipedia library:

pip install wikipedia

And importing it in our notebook to start using it:

import wikipedia

Now that we have already imported the library, we can start using the main methods. I have summarized the methods that we are going to use for SEO purposes in the following list:

  1. wikipedia.set_lang(“yourlanguage”): first we need to enter the Wikipedia language version that we would like to access.
  2. wikipedia.search(“yourquery”): this method will return a list with the suggested Wikipedia pages that are related to the entered query.
  3. wikipedia.summary(“wikipediasuggestion”): we will get the summary from a page. It is important to use one suggested page from the list which is gotten after running wikipedia.search(“yourquery”). If your argument does not match any existing page, then the API will not be able to retrieve any data.
  4. wikipedia.page(“wikipediasuggestion”): we will get the content from a page. It is important to introduce as an argument an existing page to retrieve the data. We can specify the piece of content that we would like to access:
    1. Whole HTML: with wikipedia.page(“wikipediasuggestion”).html()
    2. Content: with wikipedia.page(“wikipediasuggestion”).content
    3. References: wikipedia.page(“wikipediasuggestion”).references
    4. Links: wikipedia.page(“wikipediasuggestion”).links
    5. URL: wikipedia.page(“wikipediasuggestion”).url

Even if we can specify the piece of content that we would like to get with wikipedia.page, I feel that this function is quite limited hence to get a specific piece of content what we are going to do is retrieving the whole HTML code and parsing it with BeautifulSoup.

2.- Finding Search Entities

Finding search entities is very easy. Basically, the only thing you need to do is using the method search and inserting your query as an argument. This will return a list of suggested entities from which you can choose the one which meets the best your intent.

Let’s run an example with the query: “Spurs”.

suggestions = wikipedia.search("Spurs")

This will return a list which comprises:

As you can see in the image above, the list with the entities for the query Spurs contains two basketball teams, a football team, the term “spur”, etcetera… If you are interested in finding entities, you can also use the Google Knowledge API and the Google Spreadsheet that I created that makes it easier.

In order to find linkbuilding opportunities we will use mainly two tactics:

  • Second tier links: we will scrape all the links from a page and we will create a list with those pages. These pages which are linked from the Wikipedia page could be valuable and indirectly we could benefit from the Wikipedia link authority if we manage to place a backlink pointing to our website on any of these linked pages.
  • Direct links from Wikipedia: we will check the status code from the extracted links and if any of them is returning a 404 response code, we could create a page with the required information and ask to Wikipedia to link to our page instead.

So, let’s first create the list with the links. For that, we will use the whole HTML code, we will parse it with Beautiful Soup and we will store the anchor text and the targeted pages into a list.

from bs4 import BeautifulSoup

html_page = wikipedia.page("yourquery").html()
soup = BeautifulSoup(html_page,"lxml")

list_links = []
list_links_2 = []
for link in soup.find_all('a', href=True):
    if "http" in link['href'] and "wikipedia.org" not in link['href']:
        list_links.append([link.text,link['href']])
        list_links_2.append(link['href'])

As we are interested only in the outlinks, we will use a conditional statement which will exclude internal links. We create two lists: the first one with the anchor texts and the links the second one with the links but without the anchor texts. Later on we will use the second list to differentiate the contextual links and the references links.

references = wikipedia.page("yourquery").references

list_no_references = list_links[0:list_links_2.index(references[0])]
list_references = list_links[list_links_2.index(references[0]):len(list_links)]

This piece of code will split the initial list into two lists. List_no_references will contain the contextual links and their anchor texts and list_references will contain the references links with their anchor texts.

Now that we have the list with the links, we can make use of a tool like URL profiler to extract email addresses and/or contact pages or we can even make use of a simple Python scraper to try to find email addresses by using Regex. This is very similar to what we did on the Instagram scraping article to extract email addresses from the Instagram profiles. The final intention of extracting these email addresses or contact pages is getting in touch with the webmasters to try to place some backlinks pointing to our website.

We will not consider PDF files because they are usually very long and it takes very long for the parser to parse the entire content.

import re
import requests
 
list_emails = []
for x in list_no_references:
    if ".pdf" not in x[1]:
        try:
            html = requests.get(x[1])
            soup = BeautifulSoup(html.text,"lxml").get_text()
            emails = re.findall(r"[a-z0-9\.\-+_][email protected][a-z0-9\.\-+_]+\.com", soup, re.I)
            list_emails.append([x[1],emails])
        except Exception as e:
            print(e)

Similarly to the previous process, we can also use the library Requests to withdraw the response code from the linked pages and see if any of them is returning a 404 response code with the purpose of asking Wikipedia to replace that link with a link targeting our website.

list_response_codes = []
for x in list_no_references:
    url = requests.get(x[1])
    list_response_codes.append([x[1],url.status_code])

In case we find a page which is returning a 404 response code, we can use the Wayback machine and see if that page happened to be stored so that we can replicate the content that used to be on that page and claim the backlink from Wikipedia.

4.- Getting inspiration for content creation

Finally, let’s say for instance that you would like to write an article about the basketball team San Antonio Spurs. Then, you might be interested in getting to know which are the main highlights from the Wikipedia page and the most used terms within the page to make sure you do not forget to mention something which might be relevant for the users who are looking for information about San Antonio Spurs.

In order to scrape the content table, we will use the whole HTML code and we will select the content table span with BeautifulSoup by using the class “toctext”.

html_page = wikipedia.page("San Antonio Spurs").html()
soup = BeautifulSoup(html_page,"lxml")
content_table = soup.findAll("span", {"class": "toctext"})


content_table_clean = []
for x in content_table:
    content_table_clean.append(x.text)

If we would like to find the most used two words terms, we can use a for loop iterating over all the content, create a dictionary with the terms and their counts and sort them out. This is something which I explained in a previous article that I published called 3 easy tricks and useful tricks for SEO with Python. In addition, we use a stopwords list to avoid generic terms to be added into the dictionary.

count_words2 = dict()

content = wikipedia.page("San Antonio Spurs").content
words = content.split(" ")

stoplist = ["i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while", "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before", "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again", "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than", "too", "very", "s", "t", "can", "will", "just", "don", "should", "now"]
 
counter = 0
for word in words:
    if counter != 0:
        stop_presence = False
        for stopword in stoplist:
            if old_word.lower() == stopword or word.lower() == stopword:
                stop_presence = True
        
        if stop_presence == False:
            if old_word + " " + word in count_words2:
                count_words2[old_word + " " + word] += 1
            else:
                count_words2[old_word + " " + word] = 1
            
    old_word = word
    counter = counter + 1
     
     
sorted_count = sorted(count_words2.items(), key=lambda kv: kv[1], reverse=True)[0:20]

This will return a list with the 20 most used two-words terms in the page which looks like:

That’s all folks, I hope you found this article interesting and helpful! Get in touch with me if you would like to share your feedback or if there is any step which is unclear.

FAQs section

Which libraries do you need?

You will need wikipedia, beautifulsoup, requests and re.

What will you learn in this post?

You will learn how to use Wikipedia API and Python to find entities and linkbuilding opportunities and getting inspiration for content creation.

How long will it take?

Only around 5-10 minutes.