How to Extract All Links from a Web Page Using Python

Extracting links from a web page is a common task in web scraping, data analysis, and automation. Python provides several libraries to accomplish this efficiently.

Below, we’ll explore different methods using:

BeautifulSoup
requests
urllib

along with explanations and code examples. This post is inspired by the need to extract all country links from this wiki page: Lists of cities by country.

Lets start with the link extraction:

1: Using BeautifulSoup and requests

The most popular approach combines BeautifulSoup (for parsing HTML) and requests (for fetching web pages).

Install Required Libraries

pip install beautifulsoup4 requests

Code

from bs4 import BeautifulSoup
import requests

url = "https://en.wikipedia.org/wiki/Lists_of_cities_by_country"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

links = soup.find_all('a')
for link in links:
    print(link.get('href'))

print(len(links))

the result is 964 links and output of:

#bodyContent
/wiki/Main_Page
/wiki/Wikipedia:Contents

Explanation

soup.find_all('a') - finds all <a> tags (hyperlinks). You can customize it to extract certain class or do filtering.
link.get('href') - extracts the href attribute (the actual URL).

2: Using urllib (Python Standard Library)

If you prefer avoiding third-party libraries, Python’s built-in urllib works too.

Code

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Lists_of_cities_by_country"
html = urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')

links = soup.find_all('a')
print(len(links))
for link in links:
    print(link.get('href'))

the result is exactly the same as the previous step. The parsing is done again by BeautifulSoup.

3: Extracting Only Valid HTTP/HTTPS Links

Often, you’ll want to filter out mailto:, javascript:, or internal anchors (#).

Python Code

from bs4 import BeautifulSoup
import requests
import re

url = "https://en.wikipedia.org/wiki/Lists_of_cities_by_country"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

for link in soup.find_all('a', href=True):
    href = link['href']
    if re.match(r'^https?://', href):
        print(href)

result is reduced only to 42 valid links extracted

Explanation

The href=True filter ensures only tags with href are processed.
re.match(r'r', href) checks for valid web URLs.

4: Extract Links (with Title) to Pandas DataFrame

We can extract all links from a page, filter by condition and load the result into a Pandas DataFrame by:

from bs4 import BeautifulSoup
import requests
import re
import pandas as pd

url = "https://en.wikipedia.org/wiki/Lists_of_cities_by_country"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

links = []

for link in soup.find_all('a', href=True):
    href = link['href']
    text = link.get('title')

    if re.match(r'/wiki/.*', href):
        links.append({'text': text, 'href':href})

df_links = pd.DataFrame(links)        
df_links

the resulted DataFrame has 839 links:

	text	href
0	Visit the main page [z]	/wiki/Main_Page
1	Guides to browsing Wikipedia	/wiki/Wikipedia:Contents
2	Articles related to current events	/wiki/Portal:Current_events
3	Visit a randomly selected article [x]	/wiki/Special:Random
4	Learn about Wikipedia and how it works	/wiki/Wikipedia:About

Advanced Filtering

We can perform advanced filtering to get only the countries links by:

mask = (df_links['text'] != '\n\n') & (df_links['text'] != '') & (df_links['link'] != '#')
df_links[mask].drop_duplicates()

Common Issues & Solutions

Relative vs. Absolute URLs
- Some links may be relative (e.g., /about instead of https://example.com/about).
- Fix: Use urllib.parse.urljoin() to convert them:

from urllib.parse import urljoin
absolute_url = urljoin(url, href)

Dynamic Content (JavaScript-Rendered Links)
- If the page loads content via JavaScript, requests won’t capture it.
- Solution: Use selenium or requests-html for dynamic pages.
Rate Limiting/Bot Detection
- Some sites block scrapers. Use headers (e.g., User-Agent) or delays:

headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers=headers)

Conclusion

Extracting links in Python is straightforward with BeautifulSoup and requests. For advanced use cases (e.g., dynamic pages or filtering), combine additional tools like selenium or regex. Always respect robots.txt and website terms of service when scraping.

For more details, refer to:

> Python Basics

> Advanced Tutorials

> Python Errors

> Pandas Advanced

> Pandas Count

> Pandas Column

> Pandas Basics

> Pandas DataFrame

> Pandas Row

> User Interface

> Advanced Linux

> Troubleshoot

> Video & Sound

> Linux Commands

> MySQL

> SQL Basics

> Python

> DB apps

> JupyterLab

> Jupyter Tips

> Jupyter Display

> Regex in Text Editor

> Regex Basics

> Regex Match

> Regex Date

> PyCharm Advanced

> Git and PyCharm

> PyCharm Error

> PyCharm Tips

> Linux Mint Applications

> VIrtual Machine

> Miscellaneous

> Java

> Automation

> Windows

> Office

> Cheat Sheet

1: Using BeautifulSoup and requests

Install Required Libraries

Code

Explanation

2: Using urllib (Python Standard Library)

Code

3: Extracting Only Valid HTTP/HTTPS Links

Python Code

Explanation

4: Extract Links (with Title) to Pandas DataFrame

Advanced Filtering

Common Issues & Solutions

Conclusion

Useful Resources