Python Parallel Processing Multiple Zipped JSON Files Into Pandas DataFrame
In this short guide, we'll explore how to read multiple JSON files from archive and load them into Pandas DataFrame.
Similar use case for CSV files is shown here: Parallel Processing Zip Archive CSV Files With Python and Pandas
The full code and the explanation:
from multiprocessing import Pool
from zipfile import ZipFile
import pandas as pd
import tarfile
def process_archive(json_file):
try:
df_temp = pd.read_json(zip_file.open(json_file), lines=True)
df_temp.to_csv('data/all_filses.csv', mode='a', header=False)
except:
print(json_file + '\n')
zip_file = 'data/41.zip'
try:
zip_file = ZipFile(zip_file)
zip_files = {text_file.filename for text_file in zip_file.infolist() if text_file.filename.endswith('.json')}
except:
zip_file = tarfile.open(zip_file, "r:gz")
zip_files = {text_file.filename for text_file in zip_file.getmembers() if text_file.filename.endswith('.json')}
p = Pool(6)
p.map(process_archive, zip_files)
In this example we are defining 6 parallel processes:
p = Pool(6)
p.map(process_archive, zip_files)
which are going to work with method:
def process_archive(json_file):
try:
df_temp = pd.read_json(zip_file.open(json_file), lines=True)
df_temp.to_csv('data/all_filses.csv', mode='a', header=False)
except:
print(json_file + '\n')
The method read JSON lines and writes them to the output file: 'data/all_filses.csv'
. In case of higher parallelisation errors might be raised.
Reading the zip
or tar.gz
files is done by:
try:
zip_file = ZipFile(zip_file)
zip_files = {text_file.filename for text_file in zip_file.infolist() if text_file.filename.endswith('.json')}
except:
zip_file = tarfile.open(zip_file, "r:gz")
zip_files = {text_file.filename for text_file in zip_file.getmembers() if text_file.filename.endswith('.json')}
Note that only .json
files will be processed from the archive.