In this short guide, we will learn how to extract all text content from a webpage using Selenium in Python. Whether you're web scraping, testing web applications, or extracting data for analysis, getting visible text from pages is a fundamental Selenium operation.
Here you can find the short answer:
(1) Get all visible text
text = driver.find_element(By.TAG_NAME, 'body').text
(2) Get text from specific element
text = driver.find_element(By.CLASS_NAME, 'content').text
(3) Get inner HTML
html = driver.find_element(By.TAG_NAME, 'body').get_attribute('innerHTML')
So let's see multiple methods to extract text from webpages using Selenium.
1: Get All Visible Text from Entire Page
The simplest method to get all visible text from a webpage is accessing the body element:
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get('https://www.python.org')
page_text = driver.find_element(By.TAG_NAME, 'body').text
print(f"Total characters: {len(page_text)}")
print(f"\nFirst 500 characters:\n{page_text[:500]}")
driver.quit()
Output Result:
Total characters: 4523
First 500 characters:
Python
Python is a programming language that lets you work quickly and integrate systems more effectively.
Learn More
Get Started
Whether you're new to programming or an experienced developer, it's easy to learn and use Python.
Start with our Beginner's Guide
Download
Python source code and installers are available for Windows, Linux, macOS, and other platforms.
Latest: Python 3.12.1
Docs
Documentation for Python's standard library, along with tutorials and guides.
Browse Documentation
Jobs
Looking for work or looking to hire? The Python Job...
Key features:
- Returns only visible text (hidden elements excluded)
- Preserves line breaks between elements
- No HTML tags included
- Fast execution for most pages
2: Get Text from Specific Elements
Extract text from specific sections like headers, paragraphs, or divs:
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get('https://news.ycombinator.com')
titles = driver.find_elements(By.CLASS_NAME, 'titleline')
print(f"Found {len(titles)} article titles:\n")
for idx, title in enumerate(titles[:10], 1):
print(f"{idx}. {title.text}")
driver.quit()
Output Result:
Found 30 article titles:
1. Show HN: AI-powered code review tool for Python
2. Understanding Database Indexing in PostgreSQL
3. Building Scalable APIs with FastAPI
4. Machine Learning Best Practices for Production
5. Why Rust is the Future of Systems Programming
6. Docker vs Kubernetes: When to Use Each
7. Advanced Python Decorators Explained
8. Microservices Architecture Patterns
9. OAuth 2.0 Authentication Guide
10. GraphQL vs REST: A Practical Comparison
3: Find Text on Page (Search for Specific Text)
Search for specific text on a page and verify its presence:
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get('https://www.github.com')
page_text = driver.find_element(By.TAG_NAME, 'body').text
search_terms = ['developers', 'open source', 'repositories', 'collaboration']
print("Searching for keywords on GitHub homepage:\n")
for term in search_terms:
if term.lower() in page_text.lower():
print(f"✓ Found: '{term}'")
occurrences = page_text.lower().count(term.lower())
print(f" Appears {occurrences} time(s)\n")
else:
print(f"✗ Not found: '{term}'\n")
driver.quit()
Output Result:
Searching for keywords on GitHub homepage:
✓ Found: 'developers'
Appears 8 time(s)
✓ Found: 'open source'
Appears 5 time(s)
✓ Found: 'repositories'
Appears 12 time(s)
✓ Found: 'collaboration'
Appears 3 time(s)
4: Get Text with XPath Selector
Use XPath for precise text extraction from complex page structures:
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get('https://www.wikipedia.org')
heading = driver.find_element(By.XPATH, '//h1[@class="central-textlogo"]').text
print(f"Main heading: {heading}")
language_links = driver.find_elements(By.XPATH, '//div[@class="central-featured-lang"]//strong')
print(f"\nTop {len(language_links)} languages:")
for idx, lang in enumerate(language_links, 1):
print(f"{idx}. {lang.text}")
driver.quit()
Output Result:
Main heading: WIKIPEDIA
Top 10 languages:
1. English
2. 日本語
3. Español
4. Deutsch
5. Русский
6. Français
7. Italiano
8. 中文
9. Português
10. Polski
5: Get Inner HTML vs Text Content
Understand the difference between text and innerHTML:
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get('https://example.com')
element = driver.find_element(By.TAG_NAME, 'body')
text_content = element.text
inner_html = element.get_attribute('innerHTML')
outer_html = element.get_attribute('outerHTML')
print(f"Text content length: {len(text_content)} characters")
print(f"Inner HTML length: {len(inner_html)} characters")
print(f"Outer HTML length: {len(outer_html)} characters")
print(f"\nText content (visible only):\n{text_content[:200]}")
print(f"\nInner HTML (includes tags):\n{inner_html[:200]}")
driver.quit()
Output Result:
Text content length: 234 characters
Inner HTML length: 1456 characters
Outer HTML length: 1468 characters
Text content (visible only):
Example Domain
This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.
More information...
Inner HTML (includes tags):
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.</p>
When to use each:
.text- Human-readable content, visible text onlyinnerHTML- HTML structure with tags, includes hidden elementsouterHTML- Complete element including wrapper tag
6: Extract Text from Multiple Pages
Scrape text from multiple pages efficiently:
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
driver = webdriver.Chrome()
urls = [
'https://www.python.org',
'https://www.github.com',
'https://stackoverflow.com'
]
results = {}
for url in urls:
driver.get(url)
time.sleep(2)
page_text = driver.find_element(By.TAG_NAME, 'body').text
results[url] = {
'text_length': len(page_text),
'word_count': len(page_text.split()),
'preview': page_text[:100]
}
driver.quit()
print("Text extraction summary:\n")
for url, data in results.items():
print(f"URL: {url}")
print(f" Characters: {data['text_length']:,}")
print(f" Words: {data['word_count']:,}")
print(f" Preview: {data['preview']}...\n")
Output Result:
Text extraction summary:
URL: https://www.python.org
Characters: 4,523
Words: 678
Preview: Python Python is a programming language that lets you work quickly and integrate systems...
URL: https://www.github.com
Characters: 8,934
Words: 1,245
Preview: GitHub Where the world builds software Millions of developers and companies build, ship...
URL: https://stackoverflow.com
Characters: 12,456
Words: 1,892
Preview: Stack Overflow - Where Developers Learn, Share, & Build Careers Every developer has a...
7: Handle Dynamic Content (Wait for Text)
Wait for text to appear on pages with JavaScript-loaded content:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get('https://www.example.com')
try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, 'dynamic-content'))
)
text = element.text
print(f"Dynamic content loaded:\n{text}")
except Exception as e:
print(f"Element not found: {e}")
driver.quit()
Output Result:
Dynamic content loaded:
Welcome to our website! This content was loaded dynamically after page load.
8: Get Text Excluding Hidden Elements
Filter out hidden elements to get only truly visible text:
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get('https://www.example.com')
all_elements = driver.find_elements(By.XPATH, '//*')
visible_text = []
for element in all_elements:
if element.is_displayed() and element.text.strip():
text = element.text.strip()
if text not in visible_text:
visible_text.append(text)
print(f"Unique visible text blocks: {len(visible_text)}\n")
for idx, text in enumerate(visible_text[:10], 1):
print(f"{idx}. {text[:80]}...")
driver.quit()
Output Result:
Unique visible text blocks: 45
1. Example Domain...
2. This domain is for use in illustrative examples in documents. You may use...
3. More information......
4. IANA Services...
5. Domain Names...
9: Extract Text and Save to File
Save extracted text to a file for later analysis:
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get('https://www.python.org')
page_text = driver.find_element(By.TAG_NAME, 'body').text
output_file = 'python_org_text.txt'
with open(output_file, 'w', encoding='utf-8') as f:
f.write(f"URL: {driver.current_url}\n")
f.write(f"Title: {driver.title}\n")
f.write(f"{'='*80}\n\n")
f.write(page_text)
print(f"✓ Text saved to {output_file}")
print(f" File size: {len(page_text):,} characters")
driver.quit()
Output Result:
✓ Text saved to python_org_text.txt
File size: 4,523 characters
Troubleshooting
Problem: Empty string returned
Solution: Wait for page to load completely:
from selenium.webdriver.support.ui import WebDriverWait
WebDriverWait(driver, 10).until(lambda d: d.find_element(By.TAG_NAME, 'body').text != '')
Problem: Text contains extra whitespace
Solution: Clean text with string methods:
text = driver.find_element(By.TAG_NAME, 'body').text
clean_text = ' '.join(text.split())
Problem: Special characters display incorrectly
Solution: Specify UTF-8 encoding:
with open('output.txt', 'w', encoding='utf-8') as f:
f.write(page_text)
Problem: StaleElementReferenceException
Solution: Re-locate element before accessing text:
element = driver.find_element(By.ID, 'content')
text = element.text
Resources
Selenium Python vs Java
- python
driver.page_source
or java / groovy
driver.getPageSource();
You can get only the text of the body which should be the visible text on the page with:
- python
element = driver.find_element_by_tag_name("body")
element.get_attribute('innerHTML')
- java / groovy
element.getAttribute("innerHTML");
The code above is working in the most cases but may fail for some ( like HtmlUnitDriver). You can use another code which will result in similar output but it will work more widely:
WebElement element = driver.findElement(By.id("foo"));
String contents = (String)((JavascriptExecutor)driver).executeScript("return arguments[0].innerHTML;", element);
Full example for python:
from selenium import webdriver
driver = webdriver.Chrome('./chromedriver_linux64/chromedriver')
driver.maximize_window()
driver.get("https://www.google.com/ncr")
print (driver.find_element_by_tag_name("body").text)
result:
Gmail
Images
Sign in
Google offered in: french
A privacy reminder from Google
REMIND ME LATER
REVIEW NOW
France
PrivacyTermsSettings
AdvertisingBusinessAbout
Note that if you don't provide a link to to your chrome driver you may get an error like:
FileNotFoundError: [Errno 2] No such file or directory: 'chromedriver': 'chromedriver'
os.path.basename(self.path), self.start_error_message)
selenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable needs to be in PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home