Reading and counting lines of huge files could be easy or nightmare. It depends on your knowledge and tools. In this article you can learn how to ease your life with huge files by using next tools:
sed
, a stream editorsplit
- split a file into piecesgrep
- prints lines that contain a matchcat
- concatenate files and print- JQ - is like sed for JSON data
- python fileinput
Topics in the article:
- Count number of lines in one large file
- Count number of lines in multiple files
- Search huge files by keyword
- Split huge files
- Read big files
- Working with JSON files and JQ
- For JL JSON lines files
- Modify large files
- Using python for huge files
- References
Count number of lines in one large file
You can count number of lines of huge files ( > 30 GB) with simple commands like(for Linux like systems) very fast:
wc -l myfile.txt
result:
16 normal_json.json
where:
-l, --lines - print the newline counts
Count number of lines in multiple files
To count number of lines in multiple files you can pipe several commands in Linux. For example cat
and wc
:
cat *.json | wc -l
result:
16
or for csv files:
cat *.csv | wc -l
result:
5
Search huge files by keyword
Sometimes you need to read and return the files which has a given keyword in the file content. In this example we list all files where a keyword is present in the file:
grep -Rl "id" ./
result:
./json_lines3.jl
./temp_pdf.pdf
./json_lines2.jl
./normal_json.json
./new_temp_pdf.pdf
Counting number of files where the keyword is found:
grep -Rl "id" ./ | wc -l
result:
5
Where:
-R --dereference-recursive - For each directory operand, read and process all files in that directory, recursively, following all symbolic links.
-l --files-with-matches - Suppress normal output; instead print the name of each input file from which output would normally have been printed. The scanning of each file stops on the first match. (-l is specified by POSIX.)
Search and return the lines where a given keyword is found. Searching all JSON line files in the current folder and search for keyword id in each file:
cat *.jl | grep id
result:
{"id": 1, "label": "A", "size": "S"}
{"id": 2, "label": "B", "size": "XL"}
{"id": 3, "label": "C", "size": "XXl"}
{"id":1,"label":"A","size":"S"}
Another solution for search in multiple large files is sed
:
sed -n -e '/id/p' *.jl
result:
{"id": 1, "label": "A", "size": "S"}
{"id": 2, "label": "B", "size": "XL"}
{"id": 3, "label": "C", "size": "XXl"}
{"id":1,"label":"A","size":"S"}
where:
-n, --quiet, --silent - suppress automatic printing of pattern space
-e script, --expression=script - add the script to the commands to be execute
Split huge files
If you need to split huge file into smaller chunks then you can use command split
. You can use the count from the previous section and divide the file based on the number of lines. For example diving the csv file in 5,000,000 lines per chunk:
split -l 5000000 my.csv
You can also split the file by size. For example diving the same file in 1GB chunks:
split -b 1G -d my.csv my_split.csv
the equivalent for MB:
split --bytes=500M my.csv
You can see the second name - my_split.csv
which is optional and can be used for naming pattern of the new files.
Split can works and for JSON line files:
split -l 300000 my.jl
where:
-l, --lines=NUMBER - put NUMBER lines/records per output file
-b, --bytes=SIZE - put SIZE bytes per output file
Read big files
You can read YY number of lines starting from XX position with a command like:
tail -n +XX normal_json.json | head -nYY
Example:
tail -n +5 normal_json.json | head -n5
result:
"id": 1,
"label": "A",
"size": "S"
Working with JSON files and JQ
JQ is advertised as:
jq is like sed for JSON data - you can use it to slice and filter and map and transform structured data with the same ease that sed , awk , grep and friends let you
It can be used for efficient work with huge JSON files(ends on .json):
- count lines:
jq '. | length' my.json
result:
2374896
JQ read XX-YY lines from JSON file:
jq '.[2:4]' my.json
result:
{"id":2,"label":"B","size":"XL"}
{"id":3,"label":"C","size":"XXl"}
Search for items which has given value in JSON array file. Finally return the whole record:
jq '.[] | select(.id=="2")' my.json
result:
{"id":2,"label":"B","size":"XL"}
Search by key and return only one value:
jq '.[] | select(.id=="2").label' my.json
result:
"B"
For JL JSON lines files
Reading first lines of large JSON lines file. This files ends usually on .jl
:
tail -n +1 my.jl | head -n1
result
{"id":2,"label":"B","size":"XL"}
Searching by key in huge .jl files can be done by:
jq 'select(.id==3).size' my.jl
result:
"XXl"
Modify large files
If you need to correct large file you can use command like sed. For example simple replacement: id
-> pid
:
sed 's/id/pid/' json_lines3.jl > json_lines3_mod.jl
result:
json_lines3_mod.jl
{"pid":1,"label":"A","size":"S"}
{"pid":2,"label":"B","size":"XL"}
{"pid":3,"label":"C","size":"XXl"}
Using python for huge files
Reading number of lines in a file by using python can be done in many ways. Here you can find some typical examples like:
Count number of lines with python
num_lines = sum(1 for line in open('myfile.txt'))
or:
open('file').read().count('\n')
Note: This might return wrong value for json file. In order to get the correct value for JSON files you can change it to:
open('file').read().count('')
Python read first N lines of huge file
f = my.jl
with open(f) as infile:
for lineno, line in enumerate(infile):
if lineno > 1:
break
print(line)
result:
{"pid":1,"label":"A","size":"S"}
{"pid":2,"label":"B","size":"XL"}
this is the modification for line from XX to YY
with open(f) as infile:
for lineno, line in enumerate(infile):
if lineno > 2 and lineno < 5:
print(line)
if lineno == 5:
break
Search big file with python:
with open(f) as infile:
for line in infile:
if 'my_keyword' in line:
print(line)
Another good option for python might be:
import fileinput
for line in fileinput.input(f):
print(line)