Python Parallel Processing Multiple Zipped JSON Files Into Pandas DataFrame

In this short guide, we'll explore how to read multiple JSON files from archive and load them into Pandas DataFrame.

Similar use case for CSV files is shown here: Parallel Processing Zip Archive CSV Files With Python and Pandas

The full code and the explanation:

from multiprocessing import Pool
from zipfile import ZipFile

import pandas as pd

import tarfile

    
def process_archive(json_file):
    try:
        df_temp = pd.read_json(zip_file.open(json_file), lines=True)
        df_temp.to_csv('data/all_filses.csv', mode='a', header=False)
    except:
        print(json_file + '\n')
    
    
zip_file = 'data/41.zip'
    
try:    
    zip_file = ZipFile(zip_file)
    zip_files = {text_file.filename    for text_file in zip_file.infolist()       if text_file.filename.endswith('.json')}
except:
    zip_file = tarfile.open(zip_file, "r:gz")
    zip_files = {text_file.filename    for text_file in zip_file.getmembers()       if text_file.filename.endswith('.json')}


p = Pool(6)
p.map(process_archive, zip_files)

In this example we are defining 6 parallel processes:

p = Pool(6)
p.map(process_archive, zip_files)

which are going to work with method:

def process_archive(json_file):
    try:
        df_temp = pd.read_json(zip_file.open(json_file), lines=True)
        df_temp.to_csv('data/all_filses.csv', mode='a', header=False)
    except:
        print(json_file + '\n')

The method read JSON lines and writes them to the output file: 'data/all_filses.csv'. In case of higher parallelisation errors might be raised.

Reading the zip or tar.gz files is done by:

try:    
    zip_file = ZipFile(zip_file)
    zip_files = {text_file.filename    for text_file in zip_file.infolist()       if text_file.filename.endswith('.json')}
except:
    zip_file = tarfile.open(zip_file, "r:gz")
    zip_files = {text_file.filename    for text_file in zip_file.getmembers()       if text_file.filename.endswith('.json')}

Note that only .json files will be processed from the archive.

> Python Basics

> Advanced Tutorials

> Python Errors

> Pandas Advanced

> Pandas Count

> Pandas Column

> Pandas Basics

> Pandas DataFrame

> Pandas Row

> User Interface

> Advanced Linux

> Troubleshoot

> Video & Sound

> Linux Commands

> MySQL

> SQL Basics

> Python

> DB apps

> JupyterLab

> Jupyter Tips

> Jupyter Display

> Regex in Text Editor

> Regex Basics

> Regex Match

> Regex Date

> PyCharm Advanced

> Git and PyCharm

> PyCharm Error

> PyCharm Tips

> Linux Mint Applications

> VIrtual Machine

> Miscellaneous

> Java

> Automation

> Windows

> Office

> Cheat Sheet