Learn the most effective ways to extract clean text from HTML files or strings in Python, removing tags, scripts, and handling entities.

Sample HTML

html = """
<html>
    <body>
        <h1>Title</h1>
        <p>Hello, world! This is a <a href="#">link</a>.</p>
        <script>alert('ignore me');</script>
        <style>.hidden { display: none; }</style>
        <p>Another paragraph with &amp; entity &#39;quote&#39;.</p>
    </body>
</html>
"""

BeautifulSoup is the most robust and popular method for parsing HTML and extracting text.

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")

# Remove script and style elements
for script in soup(["script", "style"]):
    script.extract()

# Get text with customizable separator
text = soup.get_text(separator=' ', strip=True)

print(text)

Output:
Title Hello, world! This is a link. Another paragraph with & entity 'quote'.

Notes:

  • Handles malformed HTML gracefully.
  • Automatically decodes entities (e.g., &amp;&).
  • Use separator='\n' for better paragraph separation.

2. Using BeautifulSoup with stripped_strings

A cleaner one-liner alternative:

from bs4 import BeautifulSoup

text = ' '.join(BeautifulSoup(html, "html.parser").stripped_strings)

print(text)

Output:
Title Hello , world ! This is a link . Another paragraph with & entity 'quote' .

Notes:

  • stripped_strings removes extra whitespace automatically.
  • Join with space to avoid words running together.

3. Using html2text

Converts HTML to readable text (Markdown-like), ignoring scripts and links.

import html2text

h = html2text.HTML2Text()
h.ignore_links = True
h.ignore_images = True

text = h.handle(html)

print(text)

Output:

Title

Hello, world! This is a link.

Another paragraph with & entity 'quote'.

Notes:

  • Great for entity handling and ignoring unwanted content.
  • Outputs Markdown; strip formatting if plain text is needed.
  • Note: GPL license may restrict commercial use.

Performance Notes

  • BeautifulSoup is the go-to choice: reliable, actively maintained, and handles real-world HTML well.
  • For very large files, consider streaming parsers or tools like trafilatura for web content extraction.
  • Avoid regex for HTML parsing — it's error-prone with nested or malformed tags.

These methods produce text similar to copying from a browser. Choose based on your needs for formatting and licensing.

Resources

Notebook