Extracting links from a web page is a common task in web scraping, data analysis, and automation. Python provides several libraries to accomplish this efficiently.
Below, we’ll explore different methods using:
BeautifulSoup
requests
urllib
along with explanations and code examples. This post is inspired by the need to extract all country links from this wiki page: Lists of cities by country.
Lets start with the link extraction:
1: Using BeautifulSoup and requests
The most popular approach combines BeautifulSoup
(for parsing HTML) and requests
(for fetching web pages).
Install Required Libraries
pip install beautifulsoup4 requests
Code
from bs4 import BeautifulSoup
import requests
url = "https://en.wikipedia.org/wiki/Lists_of_cities_by_country"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
links = soup.find_all('a')
for link in links:
print(link.get('href'))
print(len(links))
the result is 964 links and output of:
#bodyContent
/wiki/Main_Page
/wiki/Wikipedia:Contents
Explanation
soup.find_all('a')
- finds all<a>
tags (hyperlinks). You can customize it to extract certain class or do filtering.link.get('href')
- extracts thehref
attribute (the actual URL).
2: Using urllib (Python Standard Library)
If you prefer avoiding third-party libraries, Python’s built-in urllib
works too.
Code
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/Lists_of_cities_by_country"
html = urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
links = soup.find_all('a')
print(len(links))
for link in links:
print(link.get('href'))
the result is exactly the same as the previous step. The parsing is done again by BeautifulSoup
.
3: Extracting Only Valid HTTP/HTTPS Links
Often, you’ll want to filter out mailto:
, javascript:
, or internal anchors (#
).
Python Code
from bs4 import BeautifulSoup
import requests
import re
url = "https://en.wikipedia.org/wiki/Lists_of_cities_by_country"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for link in soup.find_all('a', href=True):
href = link['href']
if re.match(r'^https?://', href):
print(href)
result is reduced only to 42 valid links extracted
Explanation
- The
href=True
filter ensures only tags withhref
are processed. re.match(r'r', href)
checks for valid web URLs.
4: Extract Links (with Title) to Pandas DataFrame
We can extract all links from a page, filter by condition and load the result into a Pandas DataFrame by:
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
url = "https://en.wikipedia.org/wiki/Lists_of_cities_by_country"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
links = []
for link in soup.find_all('a', href=True):
href = link['href']
text = link.get('title')
if re.match(r'/wiki/.*', href):
links.append({'text': text, 'href':href})
df_links = pd.DataFrame(links)
df_links
the resulted DataFrame has 839 links:
text | href | |
---|---|---|
0 | Visit the main page [z] | /wiki/Main_Page |
1 | Guides to browsing Wikipedia | /wiki/Wikipedia:Contents |
2 | Articles related to current events | /wiki/Portal:Current_events |
3 | Visit a randomly selected article [x] | /wiki/Special:Random |
4 | Learn about Wikipedia and how it works | /wiki/Wikipedia:About |
Advanced Filtering
We can perform advanced filtering to get only the countries links by:
mask = (df_links['text'] != '\n\n') & (df_links['text'] != '') & (df_links['link'] != '#')
df_links[mask].drop_duplicates()
Common Issues & Solutions
- Relative vs. Absolute URLs
- Some links may be relative (e.g.,
/about
instead ofhttps://example.com/about
). - Fix: Use
urllib.parse.urljoin()
to convert them:
- Some links may be relative (e.g.,
from urllib.parse import urljoin
absolute_url = urljoin(url, href)
-
Dynamic Content (JavaScript-Rendered Links)
- If the page loads content via JavaScript,
requests
won’t capture it. - Solution: Use
selenium
orrequests-html
for dynamic pages.
- If the page loads content via JavaScript,
-
Rate Limiting/Bot Detection
- Some sites block scrapers. Use headers (e.g.,
User-Agent
) or delays:
- Some sites block scrapers. Use headers (e.g.,
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers=headers)
Conclusion
Extracting links in Python is straightforward with BeautifulSoup
and requests
. For advanced use cases (e.g., dynamic pages or filtering), combine additional tools like selenium
or regex. Always respect robots.txt
and website terms of service when scraping.
For more details, refer to: