Learn the most effective ways to extract clean text from HTML files or strings in Python, removing tags, scripts, and handling entities.
Sample HTML
html = """
<html>
<body>
<h1>Title</h1>
<p>Hello, world! This is a <a href="#">link</a>.</p>
<script>alert('ignore me');</script>
<style>.hidden { display: none; }</style>
<p>Another paragraph with & entity 'quote'.</p>
</body>
</html>
"""
1. Using BeautifulSoup (Recommended)
BeautifulSoup is the most robust and popular method for parsing HTML and extracting text.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
# Remove script and style elements
for script in soup(["script", "style"]):
script.extract()
# Get text with customizable separator
text = soup.get_text(separator=' ', strip=True)
print(text)
Output:
Title Hello, world! This is a link. Another paragraph with & entity 'quote'.
Notes:
- Handles malformed HTML gracefully.
- Automatically decodes entities (e.g.,
&→&). - Use
separator='\n'for better paragraph separation.
2. Using BeautifulSoup with stripped_strings
A cleaner one-liner alternative:
from bs4 import BeautifulSoup
text = ' '.join(BeautifulSoup(html, "html.parser").stripped_strings)
print(text)
Output:
Title Hello , world ! This is a link . Another paragraph with & entity 'quote' .
Notes:
stripped_stringsremoves extra whitespace automatically.- Join with space to avoid words running together.
3. Using html2text
Converts HTML to readable text (Markdown-like), ignoring scripts and links.
import html2text
h = html2text.HTML2Text()
h.ignore_links = True
h.ignore_images = True
text = h.handle(html)
print(text)
Output:
Title
Hello, world! This is a link.
Another paragraph with & entity 'quote'.
Notes:
- Great for entity handling and ignoring unwanted content.
- Outputs Markdown; strip formatting if plain text is needed.
- Note: GPL license may restrict commercial use.
Performance Notes
- BeautifulSoup is the go-to choice: reliable, actively maintained, and handles real-world HTML well.
- For very large files, consider streaming parsers or tools like
trafilaturafor web content extraction. - Avoid regex for HTML parsing — it's error-prone with nested or malformed tags.
These methods produce text similar to copying from a browser. Choose based on your needs for formatting and licensing.