In this short guide, you'll see how to extract all links from a website using Python.
Here you can find the short answer:
(1) Using BeautifulSoup
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(html, 'html.parser')
links = [a['href'] for a in soup.find_all('a', href=True)]
(2) Using requests-html
from requests_html import HTMLSession
session = HTMLSession()
r = session.get(url)
links = r.html.absolute_links
(3) Using Selenium for dynamic pages
from selenium import webdriver
driver = webdriver.Chrome()
links = [elem.get_attribute('href') for elem in driver.find_elements('tag name', 'a')]
So let's see several useful examples on how to extract all links from websites with Python.
1: Extract links using BeautifulSoup
Let's start with the most popular method - using BeautifulSoup to parse HTML and extract all hyperlinks:
from bs4 import BeautifulSoup
import requests
url = 'https://www.python.org'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
links = [a.get('href') for a in soup.find_all('a', href=True)]
print(f"Found {len(links)} links")
print(links[:5])
result will be:
Found 87 links
['#content', '#python-network', '/', '/psf-landing/', '/about/']
This method works perfectly for static websites where all content loads immediately. BeautifulSoup is fast, lightweight, and handles most HTML parsing needs.
To get only absolute URLs (full URLs with domain), you can filter the results:
from urllib.parse import urljoin
base_url = 'https://www.python.org'
absolute_links = [urljoin(base_url, link) for link in links if link.startswith('http') or link.startswith('/')]
print(absolute_links[:3])
result:
['https://www.python.org/', 'https://www.python.org/psf-landing/', 'https://www.python.org/about/']
2: Extract unique external links
What if you want to extract only external links (links pointing to other domains)? You can filter based on the domain:
from bs4 import BeautifulSoup
import requests
from urllib.parse import urlparse
url = 'https://www.github.com'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
base_domain = urlparse(url).netloc
external_links = []
for a in soup.find_all('a', href=True):
link = a['href']
if link.startswith('http'):
link_domain = urlparse(link).netloc
if link_domain != base_domain:
external_links.append(link)
print(f"Found {len(set(external_links))} unique external links")
print(list(set(external_links))[:3])
result:
Found 12 unique external links
['https://docs.github.com', 'https://skills.github.com', 'https://support.github.com']
3: Extract links from dynamic websites using Selenium
For websites that load content dynamically with JavaScript (like single-page applications), BeautifulSoup won't capture all links. Use Selenium instead:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument('--headless')
driver = webdriver.Chrome(options=chrome_options)
driver.get('https://www.amazon.com')
links = [elem.get_attribute('href') for elem in driver.find_elements(By.TAG_NAME, 'a')]
links = [link for link in links if link]
print(f"Total links found: {len(links)}")
print(links[:5])
driver.quit()
result:
Total links found: 234
['https://www.amazon.com/gp/help/customer/display.html', 'https://www.amazon.com/ap/signin', 'https://www.amazon.com/gp/cart/view.html', 'https://www.amazon.com/prime', 'https://www.amazon.com/bestsellers']
Selenium is essential for modern websites built with React, Vue, or Angular where content loads after the initial page load.
4: Extract links with additional metadata
Sometimes you need more than just the URL - you might want the link text, title attribute, or CSS classes:
from bs4 import BeautifulSoup
import requests
url = 'https://news.ycombinator.com'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
link_data = []
for a in soup.find_all('a', href=True):
link_data.append({
'url': a.get('href'),
'text': a.get_text(strip=True),
'title': a.get('title', ''),
'class': ' '.join(a.get('class', []))
})
print(f"Extracted {len(link_data)} links with metadata")
print(link_data[:3])
result:
Extracted 156 links with metadata
[{'url': 'https://news.ycombinator.com', 'text': 'Hacker News', 'title': '', 'class': ''},
{'url': 'newest', 'text': 'new', 'title': '', 'class': ''},
{'url': 'front', 'text': 'past', 'title': '', 'class': ''}]
This approach is useful for content analysis, SEO audits, or building web crawlers that need context about each link.
5: Save extracted links to CSV file
Finally, let's save all extracted links to a CSV file for further analysis:
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'https://www.reddit.com/r/python'
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
links = []
for a in soup.find_all('a', href=True):
links.append({
'url': a['href'],
'text': a.get_text(strip=True)[:50]
})
df = pd.DataFrame(links)
df.to_csv('extracted_links.csv', index=False)
print(f"Saved {len(df)} links to CSV file")
print(df.head())
result:
Saved 287 links to CSV file
url text
0 /r/Python/ Python
1 /r/Python/wiki/index Wiki
2 /r/Python/ Rules
3 /r/Python/hot Hot
4 /r/Python/new New
This creates a structured dataset perfect for spreadsheet analysis, data visualization, or further processing with pandas.