How to Scrape Site with Infinite Scrolling - Python + Selenium

In this post, you can find how to scrape a website with infinite scrolling in Python with Selenium. You can use a headless or headfull browser automation tool like Selenium or Puppeteer. Here, I'll provide two examples using Selenium in Python.

Intall Selenium

First, make sure you have Selenium installed:

pip install selenium

More info can be found here: selenium

You need to download the appropriate web driver for your browser (e.g., ChromeDriver for Chrome) and make sure it's in your system's PATH.

You can find the drivers and details here: ChromeDriver

Example 1 - Tag existince - full page - Infinite Scrolling

Here is a Python script using Selenium to scrape a site with infinite scrolling:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Set up the browser
driver = webdriver.Chrome()

# Replace this URL with the actual URL of the site you want to scrape
url = "https://mtricht.github.io/wikiscroll/"
driver.get(url)

try:
    # Scroll to the bottom of the page to load more content
    while True:
        # Adjust the sleep time based on your site's loading speed
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.TAG_NAME, "footer")))
        
        # You might need to find a different element to check if more content is loaded
        # Here, we're checking if the footer element is present as an indicator that more content has loaded

except Exception as e:
    print(e)
finally:
    # After loading all content, you can scrape the data
    # Here, we're just printing the page source as an example
    print(driver.page_source)

    # Close the browser
    driver.quit()

You can scroll to bottom by exact number or scrollHeight :

  • driver.execute_script("window.scrollTo(0, 1000)")
  • driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

You can search for a given tag by name, class, ID or creating a flag:

flag = True
try:
    while flag:
        driver.execute_script("window.scrollTo(0, 1000)")

        if WebDriverWait(driver, 15).until(EC.presence_of_element_located((By.TAG_NAME, "h7ere"))):
            flag = False
        WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.CLASS_NAME, "your-class-name")))
        WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.ID, "footer")))

Example 2 - tag search by XPATH - Infinite Scrolling

You can define custom function for scrolling where you can adjust different parameters like: time, scroll height etc.

The example below will list a number of topics/keywords and will iterate over each page. Then it will load pages untill all page content is loaded:

import time
import pandas as pd
from seleniumwire import webdriver  
from selenium.webdriver.common.by import By

def scroll(driver, timeout):
    scroll_pause_time = timeout

    # Get scroll height
    last_height = driver.execute_script("return document.body.scrollHeight")

    while True:
        # Scroll down to bottom
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

        # Wait to load page
        time.sleep(scroll_pause_time)

        # Calculate new scroll height and compare with last scroll height
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            # If heights are the same it will exit the function
            break
        last_height = new_height

data = []

for cat in list(['cat1', 'cat2']):
    # Create a new instance of the Firefox driver
    driver = webdriver.Firefox()
    
    # move to some url
    driver.get(f'https://mtricht.github.io/wikiscroll/')
    
    
    # use "scroll" function to scroll the page every N seconds
    scroll(driver, 7)
    
    ul = driver.find_element(By.XPATH, "/html/body/div/div/div/div[2]/div/section/ul")
    for item in (ul.find_elements(By.TAG_NAME, "li")):
        name, rank, link = '', '' ,''
        try:
            name = item.find_element(By.XPATH, "div/div[1]/div[2]/h5").text
            rank = item.find_element(By.XPATH, "div/div[2]/div").text
            link = item.find_element(By.XPATH, "div/div[2]/p").text
            text = item.find_element(By.XPATH, "div/div[1]/p").text
        except:
            # print(name, rank, link)
            try:
                name = item.find_element(By.XPATH, "a/div[1]/div[2]/h5").text
                rank = item.find_element(By.XPATH, "a/div[2]/div").text
                link = item.find_element(By.XPATH, "a/div[2]/p").text
                text = item.find_element(By.XPATH, "a/div[1]/p").text
            except:
                a = 1
        if name != '':
            data.append({'cat': cat, 'name': name, 'rank': rank, 'link': link, 'text': text})

Data is loaded into a DataFrame by extracting element content by:

  • tag name - ul.find_elements(By.TAG_NAME, "li")
  • XPATH - item.find_element(By.XPATH, "a/div[2]/div").text

Finally we can load the data into a DataFrame:

df = pd.DataFrame(data)
df

Resources