In this short guide, we will learn how to merge multiple DataFrames in Pandas using different techniques like merge(), concat(), and join(). Whether you're combining customer data from different sources, merging sales records, or joining employee information, Pandas provides powerful methods to combine DataFrames efficiently.
Understanding when to use merge vs concat vs join is crucial for data analysis, ETL pipelines, and data science projects. Each method serves different use cases depending on how your data is structured.
1. Merging Multiple DataFrames Using merge()
The merge() function works like SQL joins and is perfect when you need to combine DataFrames based on common columns. It supports inner, outer, left, and right joins for flexible data combination.
import pandas as pd
df1 = pd.DataFrame({
'id': [1, 2, 3],
'name': ['Emma', 'James', 'Oliver']
})
df2 = pd.DataFrame({
'id': [3, 4, 5],
'age': [25, 30, 35]
})
df3 = pd.DataFrame({
'id': [5, 6, 7],
'city': ['New York', 'Los Angeles', 'Chicago']
})
result = pd.merge(pd.merge(df1, df2, on='id', how='outer'), df3, on='id', how='outer')
print(result)
Output Result:
| id | name | age | city | |
|---|---|---|---|---|
| 0 | 1 | Emma | NaN | NaN |
| 1 | 2 | James | NaN | NaN |
| 2 | 3 | Oliver | 25.0 | NaN |
| 3 | 4 | NaN | 30.0 | NaN |
| 4 | 5 | NaN | 35.0 | New York |
| 5 | 6 | NaN | NaN | Los Angeles |
| 6 | 7 | NaN | NaN | Chicago |
How it works: The outer join keeps all rows from both DataFrames, filling missing values with NaN. The on='id' parameter specifies the key column for matching. This is essential when working with relational data or database-style joins.
2. Concatenating Multiple DataFrames with concat()
The concat() function is ideal for stacking DataFrames vertically (rows) or horizontally (columns). Unlike merge, it doesn't require matching columns and simply combines DataFrames based on their structure.
import pandas as pd
df1 = pd.DataFrame({
'Name': ['Sarah', 'Michael'],
'Sales': [50000, 60000]
})
df2 = pd.DataFrame({
'Name': ['David', 'Lisa'],
'Revenue': [75000, 85000]
})
df3 = pd.DataFrame({
'Name': ['John', 'Anna'],
'Profit': [12000, 15000]
})
result = pd.concat([df1, df2, df3], axis=0, join='outer', ignore_index=True)
print(result)
Output Result:
| Name | Sales | Revenue | Profit | |
|---|---|---|---|---|
| 0 | Sarah | 50000.0 | NaN | NaN |
| 1 | Michael | 60000.0 | NaN | NaN |
| 2 | David | NaN | 75000.0 | NaN |
| 3 | Lisa | NaN | 85000.0 | NaN |
| 4 | John | NaN | NaN | 12000.0 |
| 5 | Anna | NaN | NaN | 15000.0 |
Key parameters explained:
- axis=0: Concatenates vertically (stacks rows). Use axis=1 for horizontal concatenation
- join='outer': Includes all columns from all DataFrames, filling missing with NaN
- ignore_index=True: Creates a new sequential index instead of keeping original indices
When to use: Perfect for combining monthly reports, quarterly data, or appending new records to existing datasets.
3. Joining DataFrames Based on Index with join()
The join() method combines DataFrames based on their index values rather than column names. It's faster than merge when working with indexed data and is best for side-by-side combinations.
import pandas as pd
sales_df = pd.DataFrame({
'Q1_Sales': [100000, 150000, 200000],
'Q2_Sales': [120000, 160000, 210000]
}, index=['Product_A', 'Product_B', 'Product_C'])
costs_df = pd.DataFrame({
'Q1_Costs': [70000, 90000, 130000],
'Q2_Costs': [75000, 95000, 135000]
}, index=['Product_A', 'Product_B', 'Product_C'])
profit_df = pd.DataFrame({
'Q1_Profit': [30000, 60000, 70000],
'Q2_Profit': [45000, 65000, 75000]
}, index=['Product_A', 'Product_B', 'Product_C'])
result = sales_df.join([costs_df, profit_df])
print(result)
Output Result:
| Q1_Sales | Q2_Sales | Q1_Costs | Q2_Costs | Q1_Profit | Q2_Profit | |
|---|---|---|---|---|---|---|
| Product_A | 100000 | 120000 | 70000 | 75000 | 30000 | 45000 |
| Product_B | 150000 | 160000 | 90000 | 95000 | 60000 | 65000 |
| Product_C | 200000 | 210000 | 130000 | 135000 | 70000 | 75000 |
Why use join(): When your DataFrames are already indexed properly, join is the most efficient method. It's commonly used in time series analysis, financial data, and panel data where indices represent dates or identifiers.
4. Real-World Example: Combining Customer Data
Here's a practical example combining customer information from multiple sources using all three methods to compare their behavior.
import pandas as pd
customers = pd.DataFrame({
'customer_id': [101, 102, 103],
'name': ['Alice Johnson', 'Bob Smith', 'Charlie Brown']
})
orders = pd.DataFrame({
'customer_id': [101, 102, 104],
'total_spent': [5000, 7500, 3000]
})
loyalty = pd.DataFrame({
'customer_id': [101, 103, 105],
'points': [2500, 1800, 900]
})
merged_data = pd.merge(customers, orders, on='customer_id', how='inner')
merged_data = pd.merge(merged_data, loyalty, on='customer_id', how='inner')
print("Merged Result (Inner Join - Only Matching Customers):")
print(merged_data)
print()
all_customers = pd.merge(customers, orders, on='customer_id', how='outer')
all_customers = pd.merge(all_customers, loyalty, on='customer_id', how='outer')
print("Merged Result (Outer Join - All Customers):")
print(all_customers)
Output Result:
Merged Result (Inner Join - Only Matching Customers):
customer_id name total_spent points
0 101 Alice Johnson 5000 2500.0
Merged Result (Outer Join - All Customers):
customer_id name total_spent points
0 101 Alice Johnson 5000.0 2500.0
1 102 Bob Smith 7500.0 NaN
2 103 Charlie Brown NaN 1800.0
3 104 NaN 3000.0 NaN
4 105 NaN NaN 900.0
| customer_id | name | total_spent | points | |
|---|---|---|---|---|
| 0 | 101 | Alice Johnson | 5000.0 | 2500.0 |
| 1 | 102 | Bob Smith | 7500.0 | NaN |
| 2 | 103 | Charlie Brown | NaN | 1800.0 |
| 3 | 104 | NaN | 3000.0 | NaN |
| 4 | 105 | NaN | NaN | 900.0 |
Real-world insight: The inner join shows only customers present in all datasets (complete records), while the outer join reveals data gaps - customers without orders or loyalty points, which is valuable for data quality analysis.
Common Use Cases
CRM Data Integration: Merging customer profiles, purchase history, and support tickets
customers.merge(purchases, on='customer_id').merge(tickets, on='customer_id')
Financial Analysis: Combining quarterly revenue, expenses, and profit reports
pd.concat([q1_data, q2_data, q3_data, q4_data], ignore_index=True)
E-commerce Analytics: Joining product catalogs with inventory and sales data
products.join([inventory, sales_stats])
ETL Pipelines: Merging data from multiple database tables or API sources
Time Series Analysis: Concatenating daily, weekly, or monthly datasets
Join Types Quick Reference
| Join Type | Description | Use When |
|---|---|---|
| inner | Only matching rows | Need complete records |
| outer | All rows from both | Want to see all data |
| left | All from left, matching from right | Preserve main dataset |
| right | All from right, matching from left | Preserve secondary dataset |
Method Comparison
| Method | Best For | Key Parameter | Speed |
|---|---|---|---|
| merge() | Column-based joins | on='column' |
Medium |
| concat() | Stacking DataFrames | axis=0 or 1 |
Fast |
| join() | Index-based joins | Index matching | Fastest |
Choosing the Right Method
- Have common columns? → Use merge() with appropriate join type
- Stacking similar DataFrames? → Use concat() with axis parameter
- Already indexed data? → Use join() for best performance
- Multiple DataFrames to merge? → Use reduce() with merge() or multiple concat()
For most database-style operations where you need to match records by ID or key, merge() with outer join is your best choice. It's the most flexible and handles missing data gracefully.