Extracting links from a web page is a common task in web scraping, data analysis, and automation. Python provides several libraries to accomplish this efficiently.
Below, we’ll explore different methods using:
BeautifulSouprequestsurllib
along with explanations and code examples. This post is inspired by the need to extract all country links from this wiki page: Lists of cities by country.
Lets start with the link extraction:
1: Using BeautifulSoup and requests
The most popular approach combines BeautifulSoup (for parsing HTML) and requests (for fetching web pages).
Install Required Libraries
pip install beautifulsoup4 requests
Code
from bs4 import BeautifulSoup
import requests
url = "https://en.wikipedia.org/wiki/Lists_of_cities_by_country"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
links = soup.find_all('a')
for link in links:
print(link.get('href'))
print(len(links))
the result is 964 links and output of:
#bodyContent
/wiki/Main_Page
/wiki/Wikipedia:Contents
Explanation
soup.find_all('a')- finds all<a>tags (hyperlinks). You can customize it to extract certain class or do filtering.link.get('href')- extracts thehrefattribute (the actual URL).
2: Using urllib (Python Standard Library)
If you prefer avoiding third-party libraries, Python’s built-in urllib works too.
Code
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/Lists_of_cities_by_country"
html = urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
links = soup.find_all('a')
print(len(links))
for link in links:
print(link.get('href'))
the result is exactly the same as the previous step. The parsing is done again by BeautifulSoup.
3: Extracting Only Valid HTTP/HTTPS Links
Often, you’ll want to filter out mailto:, javascript:, or internal anchors (#).
Python Code
from bs4 import BeautifulSoup
import requests
import re
url = "https://en.wikipedia.org/wiki/Lists_of_cities_by_country"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for link in soup.find_all('a', href=True):
href = link['href']
if re.match(r'^https?://', href):
print(href)
result is reduced only to 42 valid links extracted
Explanation
- The
href=Truefilter ensures only tags withhrefare processed. re.match(r'r', href)checks for valid web URLs.
4: Extract Links (with Title) to Pandas DataFrame
We can extract all links from a page, filter by condition and load the result into a Pandas DataFrame by:
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
url = "https://en.wikipedia.org/wiki/Lists_of_cities_by_country"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
links = []
for link in soup.find_all('a', href=True):
href = link['href']
text = link.get('title')
if re.match(r'/wiki/.*', href):
links.append({'text': text, 'href':href})
df_links = pd.DataFrame(links)
df_links
the resulted DataFrame has 839 links:
| text | href | |
|---|---|---|
| 0 | Visit the main page [z] | /wiki/Main_Page |
| 1 | Guides to browsing Wikipedia | /wiki/Wikipedia:Contents |
| 2 | Articles related to current events | /wiki/Portal:Current_events |
| 3 | Visit a randomly selected article [x] | /wiki/Special:Random |
| 4 | Learn about Wikipedia and how it works | /wiki/Wikipedia:About |
Advanced Filtering
We can perform advanced filtering to get only the countries links by:
mask = (df_links['text'] != '\n\n') & (df_links['text'] != '') & (df_links['link'] != '#')
df_links[mask].drop_duplicates()
Common Issues & Solutions
- Relative vs. Absolute URLs
- Some links may be relative (e.g.,
/aboutinstead ofhttps://example.com/about). - Fix: Use
urllib.parse.urljoin()to convert them:
- Some links may be relative (e.g.,
from urllib.parse import urljoin
absolute_url = urljoin(url, href)
-
Dynamic Content (JavaScript-Rendered Links)
- If the page loads content via JavaScript,
requestswon’t capture it. - Solution: Use
seleniumorrequests-htmlfor dynamic pages.
- If the page loads content via JavaScript,
-
Rate Limiting/Bot Detection
- Some sites block scrapers. Use headers (e.g.,
User-Agent) or delays:
- Some sites block scrapers. Use headers (e.g.,
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers=headers)
Conclusion
Extracting links in Python is straightforward with BeautifulSoup and requests. For advanced use cases (e.g., dynamic pages or filtering), combine additional tools like selenium or regex. Always respect robots.txt and website terms of service when scraping.
For more details, refer to: