How to Scrape Site with Infinite Scrolling - Python + Selenium
In this post, you can find how to scrape a website with infinite scrolling in Python with Selenium. You can use a headless or headfull browser automation tool like Selenium or Puppeteer. Here, I'll provide two examples using Selenium in Python.
Intall Selenium
First, make sure you have Selenium installed:
pip install selenium
More info can be found here: selenium
You need to download the appropriate web driver for your browser (e.g., ChromeDriver for Chrome) and make sure it's in your system's PATH.
You can find the drivers and details here: ChromeDriver
Example 1 - Tag existince - full page - Infinite Scrolling
Here is a Python script using Selenium to scrape a site with infinite scrolling:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Set up the browser
driver = webdriver.Chrome()
# Replace this URL with the actual URL of the site you want to scrape
url = "https://mtricht.github.io/wikiscroll/"
driver.get(url)
try:
# Scroll to the bottom of the page to load more content
while True:
# Adjust the sleep time based on your site's loading speed
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.TAG_NAME, "footer")))
# You might need to find a different element to check if more content is loaded
# Here, we're checking if the footer element is present as an indicator that more content has loaded
except Exception as e:
print(e)
finally:
# After loading all content, you can scrape the data
# Here, we're just printing the page source as an example
print(driver.page_source)
# Close the browser
driver.quit()
You can scroll to bottom by exact number or scrollHeight :
driver.execute_script("window.scrollTo(0, 1000)")
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
You can search for a given tag by name, class, ID or creating a flag:
flag = True
try:
while flag:
driver.execute_script("window.scrollTo(0, 1000)")
if WebDriverWait(driver, 15).until(EC.presence_of_element_located((By.TAG_NAME, "h7ere"))):
flag = False
WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.CLASS_NAME, "your-class-name")))
WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.ID, "footer")))
Example 2 - tag search by XPATH - Infinite Scrolling
You can define custom function for scrolling where you can adjust different parameters like: time, scroll height etc.
The example below will list a number of topics/keywords and will iterate over each page. Then it will load pages untill all page content is loaded:
import time
import pandas as pd
from seleniumwire import webdriver
from selenium.webdriver.common.by import By
def scroll(driver, timeout):
scroll_pause_time = timeout
# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
# Scroll down to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load page
time.sleep(scroll_pause_time)
# Calculate new scroll height and compare with last scroll height
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
# If heights are the same it will exit the function
break
last_height = new_height
data = []
for cat in list(['cat1', 'cat2']):
# Create a new instance of the Firefox driver
driver = webdriver.Firefox()
# move to some url
driver.get(f'https://mtricht.github.io/wikiscroll/')
# use "scroll" function to scroll the page every N seconds
scroll(driver, 7)
ul = driver.find_element(By.XPATH, "/html/body/div/div/div/div[2]/div/section/ul")
for item in (ul.find_elements(By.TAG_NAME, "li")):
name, rank, link = '', '' ,''
try:
name = item.find_element(By.XPATH, "div/div[1]/div[2]/h5").text
rank = item.find_element(By.XPATH, "div/div[2]/div").text
link = item.find_element(By.XPATH, "div/div[2]/p").text
text = item.find_element(By.XPATH, "div/div[1]/p").text
except:
# print(name, rank, link)
try:
name = item.find_element(By.XPATH, "a/div[1]/div[2]/h5").text
rank = item.find_element(By.XPATH, "a/div[2]/div").text
link = item.find_element(By.XPATH, "a/div[2]/p").text
text = item.find_element(By.XPATH, "a/div[1]/p").text
except:
a = 1
if name != '':
data.append({'cat': cat, 'name': name, 'rank': rank, 'link': link, 'text': text})
Data is loaded into a DataFrame by extracting element content by:
- tag name -
ul.find_elements(By.TAG_NAME, "li")
- XPATH -
item.find_element(By.XPATH, "a/div[2]/div").text
Finally we can load the data into a DataFrame:
df = pd.DataFrame(data)
df