Chapter 8 Hybrid Assembly Methods for Short and Long Reads
Objectives
- Understand the advantages of hybrid genome assemblies
- Perform hybrid genome assembly using Unicycler
- Evaluate and compare assembly quality metrics
In the modern era of cheaper sequencing and easily accessible genome sequencers, we can also include long reads in our assembly. The long reads will provide the assemblers with longer regions in which the different fragments can anchor and connect, creating, in theory, longer scaffolds and a more complete assembly. Conversely, these long reads usually have lower quality than short reads and they are less abundant, so using them by themselves may lead to an assembly with a larger percentage or errors than, for example, an assembly with only short reads.
So, the best of both worlds can be combined into what we call hybrid assemblies. Hybrid assemblies use both short and long reads to produce a high quality assembly (compred to an assembly of long reads only) with longer and fewer scaffolds than assemblies done with only short reads.
To perform hybrid genome assemblies, we need two main files:
- Our Illumina reads (Choose between raw reads or cleaned reads)
- Our Nanopore reads (Located at the
/course_data/BIOL209/nanopore_raw_data/folder)
8.1 Hybrid assembly using Unicycler
The program unicycler is combination of programs (also called a pipeline) to perform a hybrid assembly. Usually, a manual hybrid assembly (i.e. using each program separately) entails the following steps:
Spadesto make the short reads assembliesminiasmandraconto bridge between the scaffolds from the short read assembliesbwaorbowtieto map the reads to the new elongated scaffolds for error controlpilonto correct the errors detected by the previous read mapping step
However, for prokariotic organisms as the ones we are trying to assemble, our Unicycler program will do everything mentioned above and more, so hopefully we will have the best assambly possible using the highest amount of data.
To run Unicycler with the long read and short read datasets, modify the following command with your FASTQ reads:
/Smaug_SSD/bin/Unicycler-0.5.1/unicycler-runner.py --spades_path=/Smaug_SSD/bin/SPAdes-4.0.0-Linux/bin/spades.py --racon_path=/Smaug_SSD/bin/racon/build/bin/racon -1 illumina_reads_forward.fastq.gz -2 illumina_reads_reverse.fastq.gz -l /course_data/BIOL209/nanopore_raw_data/longread.fastq.gz -o unicycler_hyb_S167 -t 8
Where:
--spades_path=is where SPADes binary is located at--racon_path=is the location for RACON, a program that ‘generates genomic consensus which is of similar or better quality compared to the output generated by assembly methods which employ both error correction and consensus steps, while providing a speedup of several times compared to those methods. It supports data produced by both Pacific Biosciences and Oxford Nanopore Technologies.’-1and-2are the illumina reads-lare the nanopore/PacBio reads-tare the number of computing processors to use
- Let Unicycler run. Answer the following questions:
-
Fill the following table using the information from the hybrid
assembly and your previous best short read assembly (Hint: use the
output from
stats.sh):
| Source | Number of scaffolds | %GC | N50 | L90 |
|---|---|---|---|---|
| Short read assembly | ||||
| Hybrid assembly |
- Use a Sketch assay to determine the taxonomic markup of your long read assembly and answer the following questions for the hybrid assembly:
-
What are the GENUS with higher WKID scores?
-
What is the WKID score of your assembly when compared to Synechococcus?
-
Summarize you results and what they mean
- Use
blobtoolsto identify the taxonomic markup of your long read assembly and answer the following questions (Only map your short reads to the new assembly. Remember to use thebwa index long_read_assembly.fastabefore you map the reads usingbwa mem)
-
Include the image here:
-
What is the taxonomical unit with the highest percentage of scaffolds in our assembly?
-
What is the percentage of scaffolds with assignment to Synechococcus?
Finally, extract all the contigs/scaffolds with Cyanobacteria ID. Save these scaffolds in a file called
final_cyano_assembly.fastaMap the short reads to the
final_cyano_assembly.fastausingBWA. NOTE: Please remember to do thebwa index final_cyano_assembly.fastacodeUse the
pileup.shprogram to summarize the coverage results. Save the number of mapped reads and the percentage of reads mapped.
-
Use the
pileup.shprogram to summarize the coverage results and fill the following table.
| Dataset | Number of reads | Percentage of mapped reads | Average coverage |
|---|---|---|---|
| Short read assembly | |||
| Hybrid assembly |
- Did your long reads help improve the assembly? Justify your answer.