In this post you can find how to read first or any lines from a large gzip archive. I have a 1.1 GB .gz file which becomes a 65 GB text file after decompression. So let's find how we can do it in Linux Mint terminal without executing the file and external libraries.
I would like to read the first 5 lines and lines between 1000 and 1100. Decompression takes time and storage which is not acceptable for my needs.
All solutions are using the Linux Mint terminal.
Step 1: Read First Lines From Large Compressed File in Terminal
To start let's read the first N lines from the large archive. For this purpose we are going to use command - zcat
in combination with command - head
. The file is available on this path - /data/Homo_sapiens.GRCh38.dna.toplevel.fa.gz
zcat Homo_sapiens.GRCh38.dna.toplevel.fa.gz | head -n 5
result:
>1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
The first command reads the gz file - no extraction is needed. Then we pipe the command head
which is going to read 5 lines or any other number.
The first N lines are available instantly.
Step 2: Read Specific Lines From Large Compressed Text File
In this step we are going to read specific lines from the same file. Let's say that we need to read lines: 5, 7, 11, 16. This time we are going to use command zcat
in combination with sed
:
zcat Homo_sapiens.GRCh38.dna.toplevel.fa.gz | sed -n '5p;7p;11p;16p;17q'
Where the line numbers are given as 5p
, 7p
.
The parameter 17q
at the end stops the reading after the last line. Otherwise you might crash your system for extremely large files. You need to provide a stop after the last page.
Note: If you run the command without 17q
you can stop the execution by combination of CTRL + C
Step 3: Read Range of Lines From Large Gzip File in Linux Mint
At the end let's find how to get a range of lines from the same large gzip file. There are two ways of getting this information:
zcat
+sed
zcat
+head
/tail
So let's assume that we need to read lines from 1100 up to 1105.
Read range of lines with zcat
+ head
/tail
Reading the pages between 1100 and 1105 is possible by next command:
zcat /mnt/x/Data/HumanGenome/Homo_sapiens.GRCh38.dna.toplevel.fa.gz | tail -n +1100 | head -n 5
result. This can be interpret as starting from line 1100 - read next 5 lines:
TTCCCCAGGTCCGGTGTTTTCTTACCCACCTCCTTCCCTCCTTTTTATAATACCAGTGAA
ACTTGGTTTGGAGCATTTCTTTCACATAAAGGTACAAATCATACTGCTAGAGTTGTGAGG
ATTTTTACAGCTTTTGAAAGAATAAACTCATTTTAAAAACAGGAAAGCTAAGGCCCAGAG
ATTTTTAAATGATATTCCCATGATCACACTGTGAATTTGTGCCAGAACCCAAATGCCTAC
TCCCATCTCACTGAGACTTACTATAAGGACATAAGGCATTTATATATATATATATTATAT
Read range of lines with zcat
+ sed
Another way of doing the same thing is by piping zcat
and sed
commands. This time the sed command has different syntax:
zcat /mnt/x/Data/HumanGenome/Homo_sapiens.GRCh38.dna.toplevel.fa.gz | sed -n '1100,1105p;1106q'
result:
TTCCCCAGGTCCGGTGTTTTCTTACCCACCTCCTTCCCTCCTTTTTATAATACCAGTGAA
ACTTGGTTTGGAGCATTTCTTTCACATAAAGGTACAAATCATACTGCTAGAGTTGTGAGG
ATTTTTACAGCTTTTGAAAGAATAAACTCATTTTAAAAACAGGAAAGCTAAGGCCCAGAG
ATTTTTAAATGATATTCCCATGATCACACTGTGAATTTGTGCCAGAACCCAAATGCCTAC
TCCCATCTCACTGAGACTTACTATAAGGACATAAGGCATTTATATATATATATATTATAT
ATACTATATATTTATATATATTACATATTATATATATAATATATATTATATAATATATAT
Step 4: Read Range of Lines From Large Gzip and Preprocessing
Finally lets find how to do some preprocessing of the information if needed. In this case we are going to read only a part from the first line.
This will be done by using the command - awk
:
zcat Homo_sapiens.GRCh38.dna.toplevel.fa.gz | awk '{ print $2; exit }'
result:
dna:chromosome