In this short guide, we'll explore how to read multiple JSON files from archive and load them into Pandas DataFrame.
Similar use case for CSV files is shown here: Parallel Processing Zip Archive CSV Files With Python and Pandas
The full code and the explanation:
from multiprocessing import Pool
from zipfile import ZipFile
import pandas as pd
import tarfile
    
def process_archive(json_file):
    try:
        df_temp = pd.read_json(zip_file.open(json_file), lines=True)
        df_temp.to_csv('data/all_filses.csv', mode='a', header=False)
    except:
        print(json_file + '\n')
    
    
zip_file = 'data/41.zip'
    
try:    
    zip_file = ZipFile(zip_file)
    zip_files = {text_file.filename    for text_file in zip_file.infolist()       if text_file.filename.endswith('.json')}
except:
    zip_file = tarfile.open(zip_file, "r:gz")
    zip_files = {text_file.filename    for text_file in zip_file.getmembers()       if text_file.filename.endswith('.json')}
p = Pool(6)
p.map(process_archive, zip_files)
In this example we are defining 6 parallel processes:
p = Pool(6)
p.map(process_archive, zip_files)
which are going to work with method:
def process_archive(json_file):
    try:
        df_temp = pd.read_json(zip_file.open(json_file), lines=True)
        df_temp.to_csv('data/all_filses.csv', mode='a', header=False)
    except:
        print(json_file + '\n')
The method read JSON lines and writes them to the output file: 'data/all_filses.csv'. In case of higher parallelisation errors might be raised.
Reading the zip or tar.gz files is done by:
try:    
    zip_file = ZipFile(zip_file)
    zip_files = {text_file.filename    for text_file in zip_file.infolist()       if text_file.filename.endswith('.json')}
except:
    zip_file = tarfile.open(zip_file, "r:gz")
    zip_files = {text_file.filename    for text_file in zip_file.getmembers()       if text_file.filename.endswith('.json')}
Note that only .json files will be processed from the archive.
 
                     
                         
                         
                         
                         
                         
                         
                         
                        