In this post you can find how to read first or any lines from a large gzip archive. I have a 1.1 GB .gz file which becomes a 65 GB text file after decompression. So let's find how we can do it in Linux Mint terminal without executing the file and external libraries.

I would like to read the first 5 lines and lines between 1000 and 1100. Decompression takes time and storage which is not acceptable for my needs.

All solutions are using the Linux Mint terminal.

Step 1: Read First Lines From Large Compressed File in Terminal

To start let's read the first N lines from the large archive. For this purpose we are going to use command - zcat in combination with command - head. The file is available on this path - /data/Homo_sapiens.GRCh38.dna.toplevel.fa.gz

zcat Homo_sapiens.GRCh38.dna.toplevel.fa.gz | head -n 5

result:

>1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

The first command reads the gz file - no extraction is needed. Then we pipe the command head which is going to read 5 lines or any other number.

The first N lines are available instantly.

Step 2: Read Specific Lines From Large Compressed Text File

In this step we are going to read specific lines from the same file. Let's say that we need to read lines: 5, 7, 11, 16. This time we are going to use command zcat in combination with sed:

zcat Homo_sapiens.GRCh38.dna.toplevel.fa.gz | sed -n '5p;7p;11p;16p;17q'

Where the line numbers are given as 5p , 7p.

The parameter 17q at the end stops the reading after the last line. Otherwise you might crash your system for extremely large files. You need to provide a stop after the last page.

Note: If you run the command without 17q you can stop the execution by combination of CTRL + C

Step 3: Read Range of Lines From Large Gzip File in Linux Mint

At the end let's find how to get a range of lines from the same large gzip file. There are two ways of getting this information:

  • zcat + sed
  • zcat + head/tail

So let's assume that we need to read lines from 1100 up to 1105.

Read range of lines with zcat + head/tail

Reading the pages between 1100 and 1105 is possible by next command:

zcat /mnt/x/Data/HumanGenome/Homo_sapiens.GRCh38.dna.toplevel.fa.gz | tail -n +1100 | head -n 5

result. This can be interpret as starting from line 1100 - read next 5 lines:

TTCCCCAGGTCCGGTGTTTTCTTACCCACCTCCTTCCCTCCTTTTTATAATACCAGTGAA
ACTTGGTTTGGAGCATTTCTTTCACATAAAGGTACAAATCATACTGCTAGAGTTGTGAGG
ATTTTTACAGCTTTTGAAAGAATAAACTCATTTTAAAAACAGGAAAGCTAAGGCCCAGAG
ATTTTTAAATGATATTCCCATGATCACACTGTGAATTTGTGCCAGAACCCAAATGCCTAC
TCCCATCTCACTGAGACTTACTATAAGGACATAAGGCATTTATATATATATATATTATAT

Read range of lines with zcat + sed

Another way of doing the same thing is by piping zcat and sed commands. This time the sed command has different syntax:

zcat /mnt/x/Data/HumanGenome/Homo_sapiens.GRCh38.dna.toplevel.fa.gz | sed -n '1100,1105p;1106q'

result:

TTCCCCAGGTCCGGTGTTTTCTTACCCACCTCCTTCCCTCCTTTTTATAATACCAGTGAA
ACTTGGTTTGGAGCATTTCTTTCACATAAAGGTACAAATCATACTGCTAGAGTTGTGAGG
ATTTTTACAGCTTTTGAAAGAATAAACTCATTTTAAAAACAGGAAAGCTAAGGCCCAGAG
ATTTTTAAATGATATTCCCATGATCACACTGTGAATTTGTGCCAGAACCCAAATGCCTAC
TCCCATCTCACTGAGACTTACTATAAGGACATAAGGCATTTATATATATATATATTATAT
ATACTATATATTTATATATATTACATATTATATATATAATATATATTATATAATATATAT

Step 4: Read Range of Lines From Large Gzip and Preprocessing

Finally lets find how to do some preprocessing of the information if needed. In this case we are going to read only a part from the first line.

This will be done by using the command - awk:

zcat Homo_sapiens.GRCh38.dna.toplevel.fa.gz |  awk '{ print $2; exit }'

result:

dna:chromosome

python_online_editor_2020