Part 5 - SALSA2 Scaffolding
SALSA2 Scaffolding
Right, so now you have learned about Chromosome conformation capture and scaffolding genomes with the Hi-C technology. Today you will start one of the softwares that is used to scaffold genomes with Hi-C that is called SALSA2. You can have a look here on what we will be running.
First you need to activate our SALSA environment:
conda activate salsa
Then run:
run_pipeline.py -h
See the help message? Nice!
Now let’s create a scaff
subdirectory in your species home directory so that we can work on scaffolding from there:
mkdir ~/<species_folder>/scaff/
cd ~/<species_folder>/scaff/
To run SALSA2 you need first to symlink 3 files to your working directory. All files will be inside the shared species directory, in a subdirectory called HiC
(/home/ubuntu/Share/<species_id>_data/HiC/
). The files are <species_id>_*purged.polish.fa
, <species_id>_*purged.polish.fa.fai
and merge.mkdup.bed
. Once you have symlinked all files, run:
run_pipeline.py -a <species_id>_*purged.polish.fa -l <species_id>_*purged.polish.fa.fai -b merge.mkdup.bed -e GATC,GANTC -i 5 -p yes -o out
Attention
Remember to replace<species_id>
with the ID of your species. For instance, for Vanessa atalanta that would beilVanAtal1
.
What exactly are you scaffolding? You will notice that the contigs you are scaffolding are called polish. In fact, if/when you run a summary statistics for them you will see that they have a different number of bases compared with the purged assembly you analyzed. Why is that? That is because at Darwin Tree of Life we used to polish the purged genomes before scaffolding (remember I showed you this in the lecture?). We have dropped polishing now, but you are working with data that was polished. :)
Right, SALSA2 will start running. However, since SALSA2 run should take a long time, we’ll stop SALSA2 run now with Ctr+C
. From this point on we’ll be using SALSA2 results generated in advance by your instructors. Those files can be found under /home/ubuntu/Share/<species_id>_data/HiC/out.break.salsa/
directory.
Now let’s generate assembly statistics for the genome prior Hi-C scaffolding, and after Hi-C scaffolding.
First let’s add the folder that contains asmstats
script to our PATH
:
export PATH=$PATH:/home/ubuntu/Share/scripts
And make sure our conda environment is activated:
conda activate eukaryotic_genome_assembly
Then symlink the file from the genome after Hi-C scaffolding to your working directory:
ln -s /home/ubuntu/Share/<species_id>_data/HiC/out.break.salsa/scaffolds_FINAL.fasta .
Now we can run asmstats
for both the genome prior Hi-C scaffolding (<species_id>_hicanu.purged.polish.fa
) and after Hi-C scaffolding (scaffolds_FINAL.fasta
):
asmstats <species_id>*.purged.polish.fa > <species_id>*.purged.polish.fa.stats
asmstats scaffolds_FINAL.fasta > scaffolds_FINAL.fasta.stats
Now I want you to download to your local machine the assembly Hi-C heatmaps for (i) prior and (ii) after scaffolding.
1-) Hi-C heatmap of your species contigs before they were scaffolded with SALSA2
The pre-scaffolding is a file that ends with *.pretext (not for drUrtUren1, see next sentence!) and it is located in your species shared directory (/home/ubuntu/Share/<species_id>_data/HiC/<species_id>.preScaf.pretext
).
The pre-scaffolding file for drUrtUren1 ends with *.hic and it is located in your species shared directory (/home/ubuntu/Share/<species_id>_data/HiC/drUrtUren1.preScaf.hic
). This is to be open at Juicer.
2-) Hi-C heatmap of your species scaffolds after the contigs were scaffolded with SALSA2
The post-scaffolding will be in the out.break.salsa
directory (/home/ubuntu/Share/<species_id>_data/HiC/out.break.salsa/
) and it is a file that ends in *.hic. This file is a heatmap image representation of the final scaffolded assembly (scaffolds_FINAL.fasta
).
Once you download the two files (pre and after scaffolding) to your local computer, you are going to use the PretextView program to open the pre-scaffolding heatmap, and the JuiceBox program to open the post-scaffolding heatmap. You should already have PretextView installed in your local computer. If not, follow this link to install it. You can run Juicebox using its website or (optionally) you can also run it locally. If you choose to run Juicebox locally, you will need to access this link to download the executable for Juicebox compatible with your operating system, and after you download it, just double click on the executable file to open Juicebox.
Using Juicebox
Using the webserver
Go do the Juicebox’s website. Click on Load Map > Local File
and locate your downloaded *.hic file.
Using locally installed Juicebox (optional)
Once you open Juicebox, click on File > Open
. A new window will open, then you should click on the Local...
button. Now you are ready to select the downloaded *.hic file. Once you do it, the heatmap should be plotted.
Using PretextView
After opening PretextView, click on the Load Map
button and then select the downloaded *.pretext file.
Now
Analyze all the results, discuss with your team and answer the questions in your presentation:
- What are the assembly statistics before scaffolding?
- What are the assembly statistics after scaffolding?
- Looking at the final agp file (
/home/ubuntu/Share/<species_id>_data/HiC/out.break.salsa/scaffolds_FINAL.agp
) (maybe you want to read a little about the AGP format here): what scaffolds were scaffolded with more than one component (a contig of a piece of a contig)? And what scaffolds are unchanged in relation to the contigs? - How do the Hi-C maps look prior and after scaffolding?
- Do you think you see a sex chromosome on the Hi-C heatmap? If so, point to it in your presentation.
After that, come back to the wider group.