In this article, we'll see how to read/unzip file(s) from zip or tar.gz with Python. We will describe the extraction of single or multiple files from the archive.
If you are interested in parallel extraction from archive than you can check: Python Parallel Processing Multiple Zipped JSON Files Into Pandas DataFrame
Step 1: Get info from Zip Or Tar.gz Archive with Python
First we can check what is the content of the zip file by this code snippet:
from zipfile import ZipFile
zipfile = 'file.zip'
z = ZipFile(zipfile)
z.infolist()
the result:
<ZipInfo filename='text1.txt' filemode='-rw-rw-r--' external_attr=0x8020 file_size=0>]
From which we can find two filenames and size:
- pandas-dataframe-background-color-based-condition-value-python.png
- text1.txt
Step 2: List and Read all files from Archive with Python
Next we can list all files from the archive in a list by:
from zipfile import ZipFile
archive = 'file.zip'
zip_file = ZipFile(archive)
[text_file.filename for text_file in zip_file.infolist() ]
Result:
['pandas-dataframe-background-color-based-condition-value-python.png',
'text1.txt']
If you like to filter them - for example only .json
ones - or read the files as Pandas DataFrames you can do:
from zipfile import ZipFile
archive = 'file.zip'
zip_file = ZipFile(archive)
dfs = {text_file.filename: pd.read_csv(zip_file.open(text_file.filename)) for text_file in zip_file.infolist() if text_file.filename.endswith('.json')}
dfs
Step 3: Extract files from zip archive With Python
Package zipfile
can be used in order to extract files from zip archive for Python. Basic usage is shown below:
import zipfile
archive = 'file.zip'
with zipfile.ZipFile(archive, 'r') as zip_file:
zip_file.extractall(directory_to_extract_to)
Step 4: Extract files from Tar/Tar.gz With Python
For Tar/Tar.gz
files we can use the code below in order to extract the files. It uses module - tarfile
and differs the two types in order to use proper extraction mode:
import tarfile
zipfile = 'file.zip'
if zipfile.endswith("tar.gz"):
tar = tarfile.open(zipfile, "r:gz")
elif zipfile.endswith("tar"):
tar = tarfile.open(zipfile, "r:")
tar.extractall()
tar.close()
Note: All files from the archive will be unzipped in the current working directory for the script.
Step 5: Extract single file from Archive
If you like to get just a single file from Archive then you can use the method: zipObject.extract(fileName, 'temp_py')
. Basic usage is shown below:
import zipfile
archive = 'file.zip'
with zipfile.ZipFile(archive, 'r') as zip_file:
zip_file.extract('text1.txt', '.')
In this example we are going to extract the file - 'text1.txt'
in the current working directory. If you like to change the output directory than you can change the second parameter - '.'
Conclusion
In this tutorial, we covered how to extract single or multiple files from Archive with Python. It covered two different python packages - zipfile
and tarfile
.
You've also learned how to list and get info from archived files.