Chapter 8 Hybrid Assembly Methods for Short and Long Reads
Objectives
- Understand the advantages of hybrid genome assemblies
- Perform hybrid genome assembly using Unicycler
- Evaluate and compare assembly quality metrics
In the modern era of cheaper sequencing and easily accessible genome sequencers, we can also include long reads in our assembly. The long reads will provide the assemblers with longer regions in which the different fragments can anchor and connect, creating, in theory, longer scaffolds and a more complete assembly. Conversely, these long reads usually have lower quality than short reads and they are less abundant, so using them by themselves may lead to an assembly with a larger percentage or errors than, for example, an assembly with only short reads.
So, the best of both worlds can be combined into what we call hybrid assemblies. Hybrid assemblies use both short and long reads to produce a high quality assembly (compred to an assembly of long reads only) with longer and fewer scaffolds than assemblies done with only short reads.
To perform hybrid genome assemblies, we need two main files:
- Our Illumina reads (Choose between raw reads or cleaned reads)
- Our Nanopore reads (Located at the
/course_data/BIOL209/nanopore_raw_data/
folder)
8.1 Hybrid assembly using Unicycler
The program unicycler
is combination of programs (also called a pipeline) to perform a hybrid assembly. Usually, a manual hybrid assembly (i.e. using each program separately) entails the following steps:
Spades
to make the short reads assembliesminiasm
andracon
to bridge between the scaffolds from the short read assembliesbwa
orbowtie
to map the reads to the new elongated scaffolds for error controlpilon
to correct the errors detected by the previous read mapping step
However, for prokariotic organisms as the ones we are trying to assemble, our Unicycler
program will do everything mentioned above and more, so hopefully we will have the best assambly possible using the highest amount of data.
To run Unicycler
with the long read and short read datasets, modify the following command with your FASTQ
reads:
/Smaug_SSD/bin/Unicycler-0.5.1/unicycler-runner.py --spades_path=/Smaug_SSD/bin/SPAdes-4.0.0-Linux/bin/spades.py --racon_path=/Smaug_SSD/bin/racon/build/bin/racon -1 illumina_reads_forward.fastq.gz -2 illumina_reads_reverse.fastq.gz -l /course_data/BIOL209/nanopore_raw_data/longread.fastq.gz -o unicycler_hyb_S167 -t 8
Where:
--spades_path=
is where SPADes binary is located at--racon_path=
is the location for RACON, a program that ‘generates genomic consensus which is of similar or better quality compared to the output generated by assembly methods which employ both error correction and consensus steps, while providing a speedup of several times compared to those methods. It supports data produced by both Pacific Biosciences and Oxford Nanopore Technologies.’-1
and-2
are the illumina reads-l
are the nanopore/PacBio reads-t
are the number of computing processors to use
- Let Unicycler run. Answer the following questions:
-
Fill the following table using the information from the hybrid
assembly and your previous best short read assembly (Hint: use the
output from
stats.sh
):
Source | Number of scaffolds | %GC | N50 | L90 |
---|---|---|---|---|
Short read assembly | ||||
Hybrid assembly |
- Use a Sketch assay to determine the taxonomic markup of your long read assembly and answer the following questions for the hybrid assembly:
-
What are the GENUS with higher WKID scores?
-
What is the WKID score of your assembly when compared to Synechococcus?
-
Summarize you results and what they mean
- Use
blobtools
to identify the taxonomic markup of your long read assembly and answer the following questions (Only map your short reads to the new assembly. Remember to use thebwa index long_read_assembly.fasta
before you map the reads usingbwa mem
)
-
Include the image here:
-
What is the taxonomical unit with the highest percentage of scaffolds in our assembly?
-
What is the percentage of scaffolds with assignment to Synechococcus?
Finally, extract all the contigs/scaffolds with Cyanobacteria ID. Save these scaffolds in a file called
final_cyano_assembly.fasta
Map the short reads to the
final_cyano_assembly.fasta
usingBWA
. NOTE: Please remember to do thebwa index final_cyano_assembly.fasta
codeUse the
pileup.sh
program to summarize the coverage results. Save the number of mapped reads and the percentage of reads mapped.
-
Use the
pileup.sh
program to summarize the coverage results and fill the following table.
Dataset | Number of reads | Percentage of mapped reads | Average coverage |
---|---|---|---|
Short read assembly | |||
Hybrid assembly |
- Did your long reads help improve the assembly? Justify your answer.