On today’s post I am going to show you how you can use Copyscape API with Python to detect duplicate or plagiarized content in your website mainly to:

  • Make sure that nobody is plagiarizing the content which is available in your website for their own benefit.
  • Make sure that the content in the website is original and unique to improve your SEO performance. Specially I have seen this problem affecting many e-commerces which use the product cards provided by their suppliers and whose competitors are also using the very same product cards.

Basically, what we are going to do is reading an Excel file where I have saved several URLs from an e-commerce which are likely to have duplicate content, using Copyscape API to retrieve those pages from other websites which have the same content and we will save in an Excel file all the pages with their text snippets, the number of characters that are duplicate and the matching percentage.

Does this sound interesting? So let’s get started then!

1.- Getting familiar with Copyscape

Copyscape is a tool that is used to spot external duplicate content that is very useful for SEO and legal matters. It does have a freemium version which enables you to get up to 10 results that might have duplicate content manually.

If you do not have so many URLs to be checked, the freemium version can be a good solution. However, if you need to check many URLs, you might need to automate this process by using the API. Unfortunately the API is not for free but I think that its price is quite competitive plus you can cap the amount of dollars that you are willing to spend on each page (in my case I set the limit in 0.30 dollars per page).

In fact, before getting started with Copyscape API I strongly recommend you to go to the Users settings and cap the amount of dollars per page and add those domains that you do not want to report like duplicate content.

In order to get your API username and password you will need to go to the “Premium API” page and scroll down a little bit until the section called: Your API key. Remember that this API is not for free so you will also need to make a payment before being able to play with it.

2.- Reading the Excel file

As I mentioned before, I have an Excel file with the URLs that I need to check which looks like:

So first I will need to import this Excel file into my notebook. For that, I am going to use Pandas and I am going to transform the Dataframe into a list:

import pandas as pd

df = pd.read_excel ('<your-excel-file>.xlsx')
list_urls = df.values.tolist()

2.- Making the request to Copyscape API

In order to use Copyscape API and read the xml file which is returned we will need to use urllib and Beautifulsoup. Making the request is very easy, we will just have to use https://www.copyscape.com/api/ as an endpoint and introduce the username, the password and the URL with parameters.

I noticed that sometimes the API was not able to analyze correctly the matching percentage of characters, returning an error so I decided to make use of “try” and “except” so that if this happens, an “Error” message will be appended into the list where the rest of data will be stored.

Warning: the plugin that I use to insert code replaces & with &amp;. Keep this in mind if you want to use the code as you will need to replace the &amp with & in the parameters introduced in the URL to make the request.

username = "<your-username>"
password = "<your-password>"

for iteration in list_urls:
    page = urlopen("https://www.copyscape.com/api/?u=" + username + "&k=" + password + "&o=csearch&c=10&q=" + iteration[0] )
    soup = BeautifulSoup(page,'lxml')
    
    list_duplicates = []
    for x in range (len(soup.find_all("result"))):
        list_duplicates.append([soup.find_all("result")[x].find("url").text, soup.find_all("result")[x].find("title").text, soup.find_all("result")[x].find("textsnippet").text, soup.find_all("result")[x].find("minwordsmatched").text, soup.find_all("result")[x].find("viewurl").text])
        try:
            list_duplicates[x].append(soup.find_all("result")[x].find("percentmatched").text)
        except:
            list_duplicates[x].append("Error")

3.- Saving the data in the Excel file

Finally, we will save the data in the initial Excel file from where the URLs were extracted. This piece of code would have to be inside the iteration where the request for the URL is made.

from openpyxl import load_workbook

book = load_workbook("<your-excel-file>.xlsx")
writer = pd.ExcelWriter("<your-excel-file>.xlsx", engine = 'openpyxl')
writer.book = book

df = pd.DataFrame(list_duplicates, columns=['Result for the URL: '+ iteration[0], 'title', 'textsnippet', 'minwordsmatched', 'viewurl', 'percentmatched'])
df.to_excel(writer, sheet_name=iteration[0].split("/")[len(iteration[0].split("/"))-1], index = False)
writer.save()
writer.close()

This will create a tab for each URL. The name of the tab will be the final slug of the URL but if the URL happens to finish with slash, then you would need to modify a little bit the code and substitute the argument sheet_name=iteration[0].split(“/”)[len(iteration[0].split(“/”))-1] with sheet_name=iteration[0].split(“/”)[len(iteration[0].split(“/”))-2].

The final output in the Excel file will contain the URLs with the duplicate content, their titles, the text snippets, the number of characters that are duplicate and the matching percentage in separate columns.

4.- Putting the code together

Warning: the plugin that I use to insert code replaces & with &amp;. Keep this in mind if you want to use the code as you will need to replace the &amp with & in the parameters introduced in the URL to make the request.

from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
from openpyxl import load_workbook

username = "<your username>"
password = "<your password>"

df = pd.read_excel ('<your-excel-file>.xlsx')
list_urls = df.values.tolist()

for iteration in list_urls:
    page = urlopen("https://www.copyscape.com/api/?u=" + username + "&k=" + password + "&o=csearch&c=10&q=" + iteration[0] )
    soup = BeautifulSoup(page,'lxml')
    
    list_duplicates = []
    for x in range (len(soup.find_all("result"))):
        list_duplicates.append([soup.find_all("result")[x].find("url").text, soup.find_all("result")[x].find("title").text, soup.find_all("result")[x].find("textsnippet").text, soup.find_all("result")[x].find("minwordsmatched").text, soup.find_all("result")[x].find("viewurl").text])
        try:
            list_duplicates[x].append(soup.find_all("result")[x].find("percentmatched").text)
        except:
            list_duplicates[x].append("Error")
    
    book = load_workbook("<your-excel-file>.xlsx")
    writer = pd.ExcelWriter("<your-excel-file>.xlsx", engine = 'openpyxl')
    writer.book = book
    
    df = pd.DataFrame(list_duplicates, columns=['Result for the URL: '+ iteration[0], 'title', 'textsnippet', 'minwordsmatched', 'viewurl', 'percentmatched'])
    df.to_excel(writer, sheet_name=iteration[0].split("/")[len(iteration[0].split("/"))-1], index = False)
    writer.save()
    writer.close()
    

Which Python libraries will I need?

You will need Pandas, urllib, openpyxl and Beautifulsoup

Which Copyscape plan will I need?

To use the API you will need to make a payment and use the premium version. Unfortunately the API is not available in the freemium version.

How long will it take me?

The code is already available so replicating it should be very easy