In this short guide, we'll explore how to read multiple JSON files from archive and load them into Pandas DataFrame.

Similar use case for CSV files is shown here: Parallel Processing Zip Archive CSV Files With Python and Pandas

The full code and the explanation:

from multiprocessing import Pool
from zipfile import ZipFile

import pandas as pd

import tarfile

    
def process_archive(json_file):
    try:
        df_temp = pd.read_json(zip_file.open(json_file), lines=True)
        df_temp.to_csv('data/all_filses.csv', mode='a', header=False)
    except:
        print(json_file + '\n')
    
    
zip_file = 'data/41.zip'
    
try:    
    zip_file = ZipFile(zip_file)
    zip_files = {text_file.filename    for text_file in zip_file.infolist()       if text_file.filename.endswith('.json')}
except:
    zip_file = tarfile.open(zip_file, "r:gz")
    zip_files = {text_file.filename    for text_file in zip_file.getmembers()       if text_file.filename.endswith('.json')}


p = Pool(6)
p.map(process_archive, zip_files)

In this example we are defining 6 parallel processes:

p = Pool(6)
p.map(process_archive, zip_files)

which are going to work with method:

def process_archive(json_file):
    try:
        df_temp = pd.read_json(zip_file.open(json_file), lines=True)
        df_temp.to_csv('data/all_filses.csv', mode='a', header=False)
    except:
        print(json_file + '\n')

The method read JSON lines and writes them to the output file: 'data/all_filses.csv'. In case of higher parallelisation errors might be raised.

Reading the zip or tar.gz files is done by:

try:    
    zip_file = ZipFile(zip_file)
    zip_files = {text_file.filename    for text_file in zip_file.infolist()       if text_file.filename.endswith('.json')}
except:
    zip_file = tarfile.open(zip_file, "r:gz")
    zip_files = {text_file.filename    for text_file in zip_file.getmembers()       if text_file.filename.endswith('.json')}

Note that only .json files will be processed from the archive.