Extracting links from a web page is a common task in web scraping, data analysis, and automation. Python provides several libraries to accomplish this efficiently.

Below, we’ll explore different methods using:

  • BeautifulSoup
  • requests
  • urllib

along with explanations and code examples. This post is inspired by the need to extract all country links from this wiki page: Lists of cities by country.

Lets start with the link extraction:

1: Using BeautifulSoup and requests

The most popular approach combines BeautifulSoup (for parsing HTML) and requests (for fetching web pages).

Install Required Libraries

pip install beautifulsoup4 requests

Code

from bs4 import BeautifulSoup
import requests

url = "https://en.wikipedia.org/wiki/Lists_of_cities_by_country"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

links = soup.find_all('a')
for link in links:
    print(link.get('href'))

print(len(links))

the result is 964 links and output of:

#bodyContent
/wiki/Main_Page
/wiki/Wikipedia:Contents

Explanation

  • soup.find_all('a') - finds all <a> tags (hyperlinks). You can customize it to extract certain class or do filtering.
  • link.get('href') - extracts the href attribute (the actual URL).

2: Using urllib (Python Standard Library)

If you prefer avoiding third-party libraries, Python’s built-in urllib works too.

Code

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Lists_of_cities_by_country"
html = urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')

links = soup.find_all('a')
print(len(links))
for link in links:
    print(link.get('href'))

the result is exactly the same as the previous step. The parsing is done again by BeautifulSoup.

Often, you’ll want to filter out mailto:, javascript:, or internal anchors (#).

Python Code

from bs4 import BeautifulSoup
import requests
import re

url = "https://en.wikipedia.org/wiki/Lists_of_cities_by_country"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

for link in soup.find_all('a', href=True):
    href = link['href']
    if re.match(r'^https?://', href):
        print(href)

result is reduced only to 42 valid links extracted

Explanation

  • The href=True filter ensures only tags with href are processed.
  • re.match(r'r', href) checks for valid web URLs.

We can extract all links from a page, filter by condition and load the result into a Pandas DataFrame by:

from bs4 import BeautifulSoup
import requests
import re
import pandas as pd

url = "https://en.wikipedia.org/wiki/Lists_of_cities_by_country"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

links = []

for link in soup.find_all('a', href=True):
    href = link['href']
    text = link.get('title')

    if re.match(r'/wiki/.*', href):
        links.append({'text': text, 'href':href})

df_links = pd.DataFrame(links)        
df_links

the resulted DataFrame has 839 links:

text href
0 Visit the main page [z] /wiki/Main_Page
1 Guides to browsing Wikipedia /wiki/Wikipedia:Contents
2 Articles related to current events /wiki/Portal:Current_events
3 Visit a randomly selected article [x] /wiki/Special:Random
4 Learn about Wikipedia and how it works /wiki/Wikipedia:About

Advanced Filtering

We can perform advanced filtering to get only the countries links by:

mask = (df_links['text'] != '\n\n') & (df_links['text'] != '') & (df_links['link'] != '#')
df_links[mask].drop_duplicates()

Common Issues & Solutions

  1. Relative vs. Absolute URLs
    • Some links may be relative (e.g., /about instead of https://example.com/about).
    • Fix: Use urllib.parse.urljoin() to convert them:
from urllib.parse import urljoin
absolute_url = urljoin(url, href)
  1. Dynamic Content (JavaScript-Rendered Links)

    • If the page loads content via JavaScript, requests won’t capture it.
    • Solution: Use selenium or requests-html for dynamic pages.
  2. Rate Limiting/Bot Detection

    • Some sites block scrapers. Use headers (e.g., User-Agent) or delays:
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers=headers)

Conclusion

Extracting links in Python is straightforward with BeautifulSoup and requests. For advanced use cases (e.g., dynamic pages or filtering), combine additional tools like selenium or regex. Always respect robots.txt and website terms of service when scraping.

For more details, refer to:

Useful Resources