In this short guide, we'll explore how to read multiple JSON files from archive and load them into Pandas DataFrame.
Similar use case for CSV files is shown here: Parallel Processing Zip Archive CSV Files With Python and Pandas
The full code and the explanation:
from multiprocessing import Pool
from zipfile import ZipFile
import pandas as pd
import tarfile
def process_archive(json_file):
try:
df_temp = pd.read_json(zip_file.open(json_file), lines=True)
df_temp.to_csv('data/all_filses.csv', mode='a', header=False)
except:
print(json_file + '\n')
zip_file = 'data/41.zip'
try:
zip_file = ZipFile(zip_file)
zip_files = {text_file.filename for text_file in zip_file.infolist() if text_file.filename.endswith('.json')}
except:
zip_file = tarfile.open(zip_file, "r:gz")
zip_files = {text_file.filename for text_file in zip_file.getmembers() if text_file.filename.endswith('.json')}
p = Pool(6)
p.map(process_archive, zip_files)
In this example we are defining 6 parallel processes:
p = Pool(6)
p.map(process_archive, zip_files)
which are going to work with method:
def process_archive(json_file):
try:
df_temp = pd.read_json(zip_file.open(json_file), lines=True)
df_temp.to_csv('data/all_filses.csv', mode='a', header=False)
except:
print(json_file + '\n')
The method read JSON lines and writes them to the output file: 'data/all_filses.csv'
. In case of higher parallelisation errors might be raised.
Reading the zip
or tar.gz
files is done by:
try:
zip_file = ZipFile(zip_file)
zip_files = {text_file.filename for text_file in zip_file.infolist() if text_file.filename.endswith('.json')}
except:
zip_file = tarfile.open(zip_file, "r:gz")
zip_files = {text_file.filename for text_file in zip_file.getmembers() if text_file.filename.endswith('.json')}
Note that only .json
files will be processed from the archive.