How to scrape page with Python Requests and BeautifulSoup
In this post you can find how to scrape tags from a webpage using Python:
- the requests library will fetch the HTML content
- BeautifulSoup will parse and extract content.
Install
Install the beautifulsoup4
library if you haven't already:
pip install beautifulsoup4
Example 1 - BeautifulSoup extract headers
import requests
from bs4 import BeautifulSoup
page = requests.get(
"https://en.wikipedia.org/wiki/Main_Page")
soup = BeautifulSoup(page.content, 'html.parser')
page_title = soup.title.text
print(page_title)
anchors = [td.find('h1').text for td in soup.findAll('body')]
anchors
result:
Wikipedia, the free encyclopedia
['Main Page']
Example 2 - BeautifulSoup extract all h2 tags
import requests
from bs4 import BeautifulSoup
def scrape_h2_tags(url):
# Make a GET request to the URL
response = requests.get(url)
# Check if the request was successful (status code 200)
if response.status_code == 200:
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
# Find all <h2> tags
h2_tags = soup.find_all('h2')
# Print the text content of each <h2> tag
for h2_tag in h2_tags:
print(h2_tag.text)
else:
print(f"Failed to retrieve the webpage. Status code: {response.status_code}")
# Example usage
url_to_scrape = "https://en.wikipedia.org/wiki/Main_Page"
scrape_h2_tags(url_to_scrape)
result:
From today's featured article
Did you know ...
In the news
On this day
From today's featured list
Today's featured picture
Other areas of Wikipedia
Wikipedia's sister projects
Wikipedia languages