In this short guide, you'll see how to extract all links from a website using Python.

Here you can find the short answer:

(1) Using BeautifulSoup

from bs4 import BeautifulSoup
import requests

soup = BeautifulSoup(html, 'html.parser')
links = [a['href'] for a in soup.find_all('a', href=True)]

(2) Using requests-html

from requests_html import HTMLSession

session = HTMLSession()
r = session.get(url)
links = r.html.absolute_links

(3) Using Selenium for dynamic pages

from selenium import webdriver

driver = webdriver.Chrome()
links = [elem.get_attribute('href') for elem in driver.find_elements('tag name', 'a')]

So let's see several useful examples on how to extract all links from websites with Python.

Let's start with the most popular method - using BeautifulSoup to parse HTML and extract all hyperlinks:

from bs4 import BeautifulSoup
import requests

url = 'https://www.python.org'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

links = [a.get('href') for a in soup.find_all('a', href=True)]

print(f"Found {len(links)} links")
print(links[:5])

result will be:

Found 87 links
['#content', '#python-network', '/', '/psf-landing/', '/about/']

This method works perfectly for static websites where all content loads immediately. BeautifulSoup is fast, lightweight, and handles most HTML parsing needs.

To get only absolute URLs (full URLs with domain), you can filter the results:

from urllib.parse import urljoin

base_url = 'https://www.python.org'
absolute_links = [urljoin(base_url, link) for link in links if link.startswith('http') or link.startswith('/')]

print(absolute_links[:3])

result:

['https://www.python.org/', 'https://www.python.org/psf-landing/', 'https://www.python.org/about/']

What if you want to extract only external links (links pointing to other domains)? You can filter based on the domain:

from bs4 import BeautifulSoup
import requests
from urllib.parse import urlparse

url = 'https://www.github.com'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

base_domain = urlparse(url).netloc
external_links = []

for a in soup.find_all('a', href=True):
    link = a['href']
    if link.startswith('http'):
        link_domain = urlparse(link).netloc
        if link_domain != base_domain:
            external_links.append(link)

print(f"Found {len(set(external_links))} unique external links")
print(list(set(external_links))[:3])

result:

Found 12 unique external links
['https://docs.github.com', 'https://skills.github.com', 'https://support.github.com']

For websites that load content dynamically with JavaScript (like single-page applications), BeautifulSoup won't capture all links. Use Selenium instead:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument('--headless')

driver = webdriver.Chrome(options=chrome_options)
driver.get('https://www.amazon.com')

links = [elem.get_attribute('href') for elem in driver.find_elements(By.TAG_NAME, 'a')]
links = [link for link in links if link]

print(f"Total links found: {len(links)}")
print(links[:5])

driver.quit()

result:

Total links found: 234
['https://www.amazon.com/gp/help/customer/display.html', 'https://www.amazon.com/ap/signin', 'https://www.amazon.com/gp/cart/view.html', 'https://www.amazon.com/prime', 'https://www.amazon.com/bestsellers']

Selenium is essential for modern websites built with React, Vue, or Angular where content loads after the initial page load.

Sometimes you need more than just the URL - you might want the link text, title attribute, or CSS classes:

from bs4 import BeautifulSoup
import requests

url = 'https://news.ycombinator.com'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

link_data = []
for a in soup.find_all('a', href=True):
    link_data.append({
        'url': a.get('href'),
        'text': a.get_text(strip=True),
        'title': a.get('title', ''),
        'class': ' '.join(a.get('class', []))
    })

print(f"Extracted {len(link_data)} links with metadata")
print(link_data[:3])

result:

Extracted 156 links with metadata
[{'url': 'https://news.ycombinator.com', 'text': 'Hacker News', 'title': '', 'class': ''}, 
 {'url': 'newest', 'text': 'new', 'title': '', 'class': ''}, 
 {'url': 'front', 'text': 'past', 'title': '', 'class': ''}]

This approach is useful for content analysis, SEO audits, or building web crawlers that need context about each link.

Finally, let's save all extracted links to a CSV file for further analysis:

from bs4 import BeautifulSoup
import requests
import pandas as pd

url = 'https://www.reddit.com/r/python'
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')

links = []
for a in soup.find_all('a', href=True):
    links.append({
        'url': a['href'],
        'text': a.get_text(strip=True)[:50]
    })

df = pd.DataFrame(links)
df.to_csv('extracted_links.csv', index=False)

print(f"Saved {len(df)} links to CSV file")
print(df.head())

result:

Saved 287 links to CSV file
                                 url                                text
0               /r/Python/            Python
1      /r/Python/wiki/index           Wiki
2                /r/Python/           Rules
3              /r/Python/hot            Hot
4              /r/Python/new            New

This creates a structured dataset perfect for spreadsheet analysis, data visualization, or further processing with pandas.

Resources