Hands On
In our previous tutorial we have generated a k=31 kmer distribution for a subset of reads, and we have analysed the same k=31 kmer distribution for the total reads sequenced for our chosen species. With the genomescope of our total results, we have estimated genome size, heterozygosity and repeat content. Now, we are ready to go and assemble those reads.
Well, it would be nice to check one more thing: the size distribution of our reads!
Imagine if we have a good genome coverage but our reads are all small? It would not be very useful to assemble across repeats
So let’s plot the size distribution of our
cp /home/ubuntu/Share/scripts/plot_fasta_length.py .
Now we can run the script:
python plot_fasta_length.py <species>.ccs.total.fasta.gz <species>.ccs.total.fasta.length.png
This might take a bit of time. So if you prefer, you can use tools such as nohup
or screen
to run the process in the background (e.g. nohup <command> &
).
Note: My script
plot_fasta_length.py
and also Shane’s scriptasmstats
are two of many options you can use to plot fasta lengths and generate statistics. Remember, internet is on your side in this journey. While you wait the script to run, have a look at other options: like this one or the stats from BBMap
Once you have plotted the reads length distribution, download it to your local computer and have a look.
a- How does the reads distribution look like?
b- Which length most of the reads have?
Now, gather all the results you have so far and let’s discuss them as a group!!