Scraping Google Play Store can be an useful activity in order to find new affiliation opportunities on apps, which are ranking well on Google Play Store for keywords related to your business and have lots of installs, or set a benchmark if you have your own app and you would like to start working on ASO as you might need to have an idea about the number of reviews or installs that the best performant competitors have to rank in the first positions for your targeted keywords.

On today’s post I’ll show you how you can make use of Python and concretely Selenium to scrape Google Play Store and end up having an Excel file like the one which is below with the apps which are showing up for a search, their URLs, number of stars, number of reviews, installs and email addresses.

1.- How does the Google Play Store search URL work?

Before getting started with the coding, we need to understand how the search URL works to make the HTTP request with Selenium.

The original URL will be https://play.google.com/store/search and we can use four parameters to filter by:

  • Search “keyword”: we use the parameter q.
  • Type of search: we use the parameter c. The value of this parameter must be apps in case we would like to get an apps listing.
  • Language: we use the parameter hl.
  • Country: we use the parameter gl.

In my example I scraped the list of apps from Spain for the search casino and with Spanish language with the URL: https://play.google.com/store/search?q=casino&c=apps&hl=es&gl=es.

2.- Using Selenium to scrape the list of apps

Now that we know how the search URLs are built, we can start scraping some apps listings. To start with, we will need to install and import selenium, the web driver and time.

import time
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager

And after this we can build our first web driver and make our first HTTP request:

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get('https://play.google.com/store/search?q=casino&c=apps&hl=es&gl=es')
time.sleep(10)

The most challenging step that I have found for the Google Play Store scraping is when you need to scroll down until the bottom of the page to load all the apps which are turning up for that query. Fortunately this is something that I managed to solve thanks to the piece of code that I found on this Stackoverflow page. However, I needed to increase the scroll pause time as 0.5 seconds were not enough for the Google Play Store to completely load the next 50 games.

SCROLL_PAUSE_TIME = 5

# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")
time.sleep(SCROLL_PAUSE_TIME)

while True:
    # Scroll down to bottom
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # Wait to load page
    time.sleep(SCROLL_PAUSE_TIME)

    # Calculate new scroll height and compare with last scroll height
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

After loading the entire list of apps, we can start scraping. First, we will create a list with all the links pointing to the app description pages. This is something which can be distinguished very easily because these URLs contain “details?id” in their URLs. After grouping all these links, we use a dictionary functionality to remove the duplicate links.

links_games = []
elems = driver.find_elements_by_xpath("//a[@href]")
for elem in elems:
    if "details?id" in elem.get_attribute("href"):
        links_games.append((elem.get_attribute("href")))
        
links_games = list(dict.fromkeys(links_games))

3.- Scraping the metrics for each app page

Now that we have got the list of app pages which are appearing for a query, we need to figure out how to scrape those app fields which are of our interest. In my case, I decided to scrape five metrics: the app name, the number of stars, the number of reviews, the number of installs and the email address from the supplier. Depending on the element we will use either the class name or the tag name to extract this data.

3.1.- Extracting the app name

Extracting the app name is pretty straightforward as it is marked with a H1 tag, so we will only have to select the element based on the tag name.

header1 = driver.find_element_by_tag_name("h1")

3.2.- Extracting the number of stars

In order to select the number of stars element we will use the class name which is BHMmbe.

star = driver.find_element_by_class_name("BHMmbe")

3.3.- Extracting the number of reviews

We will use the same tactic as for the number of stars and select this element by making use of the class name which is EymY4b.

comments = driver.find_element_by_class_name("EymY4b")

3.4.- Extracting the number of installs and email address

This extraction might be the trickiest one as the class name for each of the additional information fields is the same so I needed to search for a workaround.

First we extract the content for each additional information field. As we are using the class name and the divider and the span tags are using the same class name, the elements that we are going to get will be duplicate so we will use the conditional statement x % 2 == 0 to remove those values which are duplicate if they are even numbers.

others = driver.find_elements_by_class_name("htlgb")
list_others = []
for x in range (len(others)):
     if x % 2 == 0:
          list_others.append(others[x].text)

Next step is creating a list with the additional information titles, whose length will match the length of the previously created list with the information for each field. To create this list we will use the titles class name: BgcNfc.

titles = driver.find_elements_by_class_name("BgcNfc")

Finally, we will iterate over the list of the additional information titles and use conditional statements to store the data which is of our interest in a new list. As my browser is in Spanish, the number of installs is called “Descargas”, so I will use the iteration index once the title is equal to “Descargas” to store the number of installs. Same logic is applied for the email addresses, but in this case we need to split the content on that section, iterate over it with a for loop and we will store in the list the content piece which contains “@”.

list_elements = [iteration,header1.text, float(star.text.replace(",",".")), comments.text.split()[0]]
for x in range (len(titles)):
            if titles[x].text == "Descargas":
                list_elements.append(list_others[x])
            if titles[x].text == "Desarrollador":
                for y in list_others[x].split("\n"):
                    if "@" in y:
                        list_elements.append(y)
                        break

3.5.- Putting everything together

If we put the previous pieces of content together…

list_all_elements = []
for iteration in links_games:
    try:
        driver.get(iteration)
        print(iteration)
        time.sleep(3)

        header1 = driver.find_element_by_tag_name("h1")
        star = driver.find_element_by_class_name("BHMmbe")


        others = driver.find_elements_by_class_name("htlgb")
        list_others = []
        for x in range (len(others)):
            if x % 2 == 0:
                list_others.append(others[x].text)

        titles = driver.find_elements_by_class_name("BgcNfc")
        comments = driver.find_element_by_class_name("EymY4b")

        list_elements = [iteration,header1.text, float(star.text.replace(",",".")), comments.text.split()[0]]
        for x in range (len(titles)):
            if titles[x].text == "Descargas":
                list_elements.append(list_others[x])
            if titles[x].text == "Desarrollador":
                for y in list_others[x].split("\n"):
                    if "@" in y:
                        list_elements.append(y)
                        break

        list_all_elements.append(list_elements)
    except Exception as e:
        print(e)

This piece of code will return a list called “list_all_elements” which will contain all the scraped data for every single app from the initial scraped listing.

4.- Exporting to Excel

Finally, we can export this data to an Excel file by using Pandas. This will create an Excel file which will look like the one that I displayed at the beginning of the article.

import pandas as pd
 
df = pd.DataFrame(list_all_elements,columns=['URL', 'Name', 'Stars', 'Comments', 'Installs', 'Email Address'])
df.to_excel('scraping_playstore.xlsx', header = True, index=False) 

That is all folks, I hope that you found this post interesting and if you happen to have any question or you would like to give me some feedback, do not hesitate to get in touch with me!