Reading huge files with Python ( personally in 2019 I count files greater than 100 GB ) for me it is a challenging task when you need to read it without enough resources. Pandas and Python are able do read fast and reliably files if you have enough memory. Otherwise you can do some tricks in order to read and analyze such information.
Python Huge .jl - line separated JSON files
Let say that you have huge JSON lines file(Line separated JSON files are supported in pandas method read_json since 0.19.0) - extension .jl:
pd.read_json(huge_json_file, lines=True)
This command will read the .jl file line by line optimized for resources and performance. If the above command is not working then you can try the next:
1. Split JSON file into smaller chunks
For really huge files or when the previous command is not working well then files can split into smaller ones . First step will be to find how many lines your JSON separate file contains by this Linux command:
wc -l huge_json_file.jl
result:
1245587 huge_json_file.jl
The same can be done with pure Python code:
count_lines = sum(1 for line in open(huge_json_file))
count_lines
As you can see for this given file the number of the lines is not big. So most probably there is a field which contains huge amount of data(Because for this file there are 1M lines but 30 GB size).
Now knowing the number of lines we can split the file into smaller chunks by:
split -l 350000 huge_json_file.jl
result:
xaa, xab, xac, xad
You can use different syntax for the same command in order to get user friendly names like(or split by size):
split --bytes 200G --numeric-suffixes --suffix-length=2 mydata mydata.
result:
mydata.00, mydata.01
2. Load files to pandas and analyze them
This command will generate several smaller files which can be read efficiently and loaded in pandas with:
import pandas as pd
import json
with open('/home/user/mydata/xaa') as f:
df = pd.DataFrame(json.loads(line) for line in f)
Once you read one of them you can analyze what columns do you have and which is the biggest culprit for the data size.
You can do optimization for your DataFrame: How to Optimize and Speed Up Pandas
In my case I have a column which has raw logs information which is not needed for my analyses. So I'll drop this column and write the DataFrame as a pickle:
df = df.drop('raw_log', axis=1)
df.to_pickle("/home/user/mydata/xaa.pkl")
With the last operation the DataFrame and file size dropped drastically from 30 GB to 15 MB.
3. Concatenate JSON files into single one
Now we can complete the operation for the rest of the files and finally we can concatenate them into a single DataFrame with:
import pandas as pd
xaa = pd.read_pickle("/home/user/Downloads/xaa.pkl")
xab = pd.read_pickle("/home/user/Downloads/xab.pkl")
xac = pd.read_pickle("/home/user/Downloads/xac.pkl")
xad = pd.read_pickle("/home/user/Downloads/xad.pkl")
df_all = pd.concat([xaa, xab, xac, xad])
Now we have the JSON file loaded into df_all containing all records (without the log information which is not needed). These steps make possible to read 30 GB files with Python, analyze and clean it from data which is not necessary for our work.
P.S. Another really efficient way to work with huge JSON files on Linux is by using command-line JSON processor JQ. I'll prepare article for it and add it here.