Reading and counting lines of huge files could be easy or nightmare. It depends on your knowledge and tools. In this article you can learn how to ease your life with huge files by using next tools:

  • sed, a stream editor
  • split - split a file into pieces
  • grep - prints lines that contain a match
  • cat - concatenate files and print
  • JQ - is like sed for JSON data
  • python fileinput

Topics in the article:

Count number of lines in one large file

You can count number of lines of huge files ( > 30 GB) with simple commands like(for Linux like systems) very fast:

wc -l myfile.txt

result:

16 normal_json.json

where:

-l, --lines - print the newline counts

Count number of lines in multiple files

To count number of lines in multiple files you can pipe several commands in Linux. For example cat and wc:

cat *.json | wc -l

result:

16

or for csv files:

cat *.csv | wc -l

result:

5

Search huge files by keyword

Sometimes you need to read and return the files which has a given keyword in the file content. In this example we list all files where a keyword is present in the file:

grep -Rl "id" ./

result:

./json_lines3.jl
./temp_pdf.pdf
./json_lines2.jl
./normal_json.json
./new_temp_pdf.pdf

Counting number of files where the keyword is found:

grep -Rl "id" ./ | wc -l

result:

5

Where:

-R --dereference-recursive - For each directory operand, read and process all files in that directory, recursively, following all symbolic links.
-l --files-with-matches - Suppress normal output; instead print the name of each input file from which output would normally have been printed. The scanning of each file stops on the first match. (-l is specified by POSIX.)

Search and return the lines where a given keyword is found. Searching all JSON line files in the current folder and search for keyword id in each file:

cat *.jl | grep id

result:

{"id": 1, "label": "A", "size": "S"}
{"id": 2, "label": "B", "size": "XL"}
{"id": 3, "label": "C", "size": "XXl"}
{"id":1,"label":"A","size":"S"}

Another solution for search in multiple large files is sed:

sed -n -e  '/id/p' *.jl

result:

{"id": 1, "label": "A", "size": "S"}
{"id": 2, "label": "B", "size": "XL"}
{"id": 3, "label": "C", "size": "XXl"}
{"id":1,"label":"A","size":"S"}

where:

-n, --quiet, --silent - suppress automatic printing of pattern space
-e script, --expression=script - add the script to the commands to be execute

Split huge files

If you need to split huge file into smaller chunks then you can use command split. You can use the count from the previous section and divide the file based on the number of lines. For example diving the csv file in 5,000,000 lines per chunk:

split -l 5000000 my.csv

You can also split the file by size. For example diving the same file in 1GB chunks:

split -b 1G -d my.csv my_split.csv

the equivalent for MB:

split --bytes=500M my.csv

You can see the second name - my_split.csv which is optional and can be used for naming pattern of the new files.

Split can works and for JSON line files:

split -l 300000 my.jl

where:

-l, --lines=NUMBER - put NUMBER lines/records per output file
-b, --bytes=SIZE - put SIZE bytes per output file

Read big files

You can read YY number of lines starting from XX position with a command like:
tail -n +XX normal_json.json | head -nYY

Example:

tail -n +5 normal_json.json | head -n5

result:

"id": 1,
"label": "A",
"size": "S"

Working with JSON files and JQ

JQ is advertised as:

jq is like sed for JSON data - you can use it to slice and filter and map and transform structured data with the same ease that sed , awk , grep and friends let you

It can be used for efficient work with huge JSON files(ends on .json):

  • count lines:
jq '. | length' my.json

result:

2374896

JQ read XX-YY lines from JSON file:

jq '.[2:4]' my.json

result:

{"id":2,"label":"B","size":"XL"}
{"id":3,"label":"C","size":"XXl"}

Search for items which has given value in JSON array file. Finally return the whole record:

jq '.[] | select(.id=="2")'  my.json

result:

{"id":2,"label":"B","size":"XL"}

Search by key and return only one value:

jq '.[] | select(.id=="2").label'  my.json

result:

"B"

For JL JSON lines files

Reading first lines of large JSON lines file. This files ends usually on .jl:

tail -n +1 my.jl | head -n1

result

{"id":2,"label":"B","size":"XL"}

Searching by key in huge .jl files can be done by:

jq 'select(.id==3).size' my.jl

result:

"XXl"

Modify large files

If you need to correct large file you can use command like sed. For example simple replacement: id -> pid:

sed 's/id/pid/' json_lines3.jl > json_lines3_mod.jl

result:

json_lines3_mod.jl

{"pid":1,"label":"A","size":"S"}
{"pid":2,"label":"B","size":"XL"}
{"pid":3,"label":"C","size":"XXl"}

Using python for huge files

Reading number of lines in a file by using python can be done in many ways. Here you can find some typical examples like:

Count number of lines with python

num_lines = sum(1 for line in open('myfile.txt'))

or:

open('file').read().count('\n')

Note: This might return wrong value for json file. In order to get the correct value for JSON files you can change it to:

open('file').read().count('')

Python read first N lines of huge file

f = my.jl
with open(f) as infile:
    for lineno, line in enumerate(infile):
        if lineno > 1:
            break
        print(line)

result:

{"pid":1,"label":"A","size":"S"}
{"pid":2,"label":"B","size":"XL"}

this is the modification for line from XX to YY

with open(f) as infile:
    for lineno, line in enumerate(infile):
        if lineno > 2 and lineno < 5:
            print(line)
        if lineno == 5:
            break

Search big file with python:

with open(f) as infile:
    for line in infile:
        if 'my_keyword' in line:
            print(line)

Another good option for python might be:

import fileinput
for line in fileinput.input(f):
    print(line)

References: