Python Parallel Processing Multiple Zipped JSON Files Into Pandas DataFrame

In this short guide, we'll explore how to read multiple JSON files from archive and load them into Pandas DataFrame.

Similar use case for CSV files is shown here: Parallel Processing Zip Archive CSV Files With Python and Pandas

The full code and the explanation:

from multiprocessing import Pool
from zipfile import ZipFile

import pandas as pd

import tarfile

    
def process_archive(json_file):
    try:
        df_temp = pd.read_json(zip_file.open(json_file), lines=True)
        df_temp.to_csv('data/all_filses.csv', mode='a', header=False)
    except:
        print(json_file + '\n')
    
    
zip_file = 'data/41.zip'
    
try:    
    zip_file = ZipFile(zip_file)
    zip_files = {text_file.filename    for text_file in zip_file.infolist()       if text_file.filename.endswith('.json')}
except:
    zip_file = tarfile.open(zip_file, "r:gz")
    zip_files = {text_file.filename    for text_file in zip_file.getmembers()       if text_file.filename.endswith('.json')}


p = Pool(6)
p.map(process_archive, zip_files)

In this example we are defining 6 parallel processes:

p = Pool(6)
p.map(process_archive, zip_files)

which are going to work with method:

def process_archive(json_file):
    try:
        df_temp = pd.read_json(zip_file.open(json_file), lines=True)
        df_temp.to_csv('data/all_filses.csv', mode='a', header=False)
    except:
        print(json_file + '\n')

The method read JSON lines and writes them to the output file: 'data/all_filses.csv'. In case of higher parallelisation errors might be raised.

Reading the zip or tar.gz files is done by:

try:    
    zip_file = ZipFile(zip_file)
    zip_files = {text_file.filename    for text_file in zip_file.infolist()       if text_file.filename.endswith('.json')}
except:
    zip_file = tarfile.open(zip_file, "r:gz")
    zip_files = {text_file.filename    for text_file in zip_file.getmembers()       if text_file.filename.endswith('.json')}

Note that only .json files will be processed from the archive.