How to Extract Text from HTML in Python

Learn the most effective ways to extract clean text from HTML files or strings in Python, removing tags, scripts, and handling entities.

Sample HTML

html = """
<html>
    <body>
        <h1>Title</h1>
        <p>Hello, world! This is a <a href="#">link</a>.</p>
        <script>alert('ignore me');</script>
        <style>.hidden { display: none; }</style>
        <p>Another paragraph with &amp; entity &#39;quote&#39;.</p>
    </body>
</html>
"""

1. Using BeautifulSoup (Recommended)

BeautifulSoup is the most robust and popular method for parsing HTML and extracting text.

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")

# Remove script and style elements
for script in soup(["script", "style"]):
    script.extract()

# Get text with customizable separator
text = soup.get_text(separator=' ', strip=True)

print(text)

Output:
Title Hello, world! This is a link. Another paragraph with & entity 'quote'.

Notes:

Handles malformed HTML gracefully.
Automatically decodes entities (e.g., & → &).
Use separator='\n' for better paragraph separation.

2. Using BeautifulSoup with stripped_strings

A cleaner one-liner alternative:

from bs4 import BeautifulSoup

text = ' '.join(BeautifulSoup(html, "html.parser").stripped_strings)

print(text)

Output:
Title Hello , world ! This is a link . Another paragraph with & entity 'quote' .

Notes:

stripped_strings removes extra whitespace automatically.
Join with space to avoid words running together.

3. Using html2text

Converts HTML to readable text (Markdown-like), ignoring scripts and links.

import html2text

h = html2text.HTML2Text()
h.ignore_links = True
h.ignore_images = True

text = h.handle(html)

print(text)

Output:

Title

Hello, world! This is a link.

Another paragraph with & entity 'quote'.

Notes:

Great for entity handling and ignoring unwanted content.
Outputs Markdown; strip formatting if plain text is needed.
Note: GPL license may restrict commercial use.

Performance Notes

BeautifulSoup is the go-to choice: reliable, actively maintained, and handles real-world HTML well.
For very large files, consider streaming parsers or tools like trafilatura for web content extraction.
Avoid regex for HTML parsing — it's error-prone with nested or malformed tags.

These methods produce text similar to copying from a browser. Choose based on your needs for formatting and licensing.

Resources

Notebook

> Python Basics

> Advanced Python Tutorials

> Python Errors

> Pandas Advanced

> Pandas Count

> Pandas Column

> Pandas Basics

> Pandas DataFrame

> Pandas Row

> User Interface

> Advanced Linux

> Troubleshoot

> Video & Sound

> Linux Commands

> MySQL

> SQL Basics

> Python

> DB apps

> JupyterLab

> Jupyter Tips

> Jupyter Display

> Regex in Text Editor

> Regex Basics

> Regex Match

> Regex Date

> PyCharm Advanced

> Git and PyCharm

> PyCharm Error

> PyCharm Tips

> Linux Mint Applications

> VIrtual Machine

> Miscellaneous

> Java

> Automation

> Windows

> Office

> Cheat Sheet

Sample HTML

1. Using BeautifulSoup (Recommended)

2. Using BeautifulSoup with stripped_strings

3. Using html2text

Performance Notes

Resources