Chapter 8 Hybrid Assembly Methods for Short and Long Reads

Objectives

Understand the advantages of hybrid genome assemblies
Perform hybrid genome assembly using Unicycler
Evaluate and compare assembly quality metrics

In the modern era of cheaper sequencing and easily accessible genome sequencers, we can also include long reads in our assembly. The long reads will provide the assemblers with longer regions in which the different fragments can anchor and connect, creating, in theory, longer scaffolds and a more complete assembly. Conversely, these long reads usually have lower quality than short reads and they are less abundant, so using them by themselves may lead to an assembly with a larger percentage or errors than, for example, an assembly with only short reads.

So, the best of both worlds can be combined into what we call hybrid assemblies. Hybrid assemblies use both short and long reads to produce a high quality assembly (compred to an assembly of long reads only) with longer and fewer scaffolds than assemblies done with only short reads.

To perform hybrid genome assemblies, we need two main files:

Our Illumina reads (Choose between raw reads or cleaned reads)
Our Nanopore reads (Located at the /course_data/BIOL209/nanopore_raw_data/ folder)

8.1 Hybrid assembly using `Unicycler`

The program unicycler is combination of programs (also called a pipeline) to perform a hybrid assembly. Usually, a manual hybrid assembly (i.e. using each program separately) entails the following steps:

Spades to make the short reads assemblies
miniasm and racon to bridge between the scaffolds from the short read assemblies
bwa or bowtie to map the reads to the new elongated scaffolds for error control
pilon to correct the errors detected by the previous read mapping step

However, for prokariotic organisms as the ones we are trying to assemble, our Unicycler program will do everything mentioned above and more, so hopefully we will have the best assambly possible using the highest amount of data.

To run Unicycler with the long read and short read datasets, modify the following command with your FASTQ reads:

/Smaug_SSD/bin/Unicycler-0.5.1/unicycler-runner.py --spades_path=/Smaug_SSD/bin/SPAdes-4.0.0-Linux/bin/spades.py --racon_path=/Smaug_SSD/bin/racon/build/bin/racon -1 illumina_reads_forward.fastq.gz -2 illumina_reads_reverse.fastq.gz -l /course_data/BIOL209/nanopore_raw_data/longread.fastq.gz -o unicycler_hyb_S167 -t 8

Where:

--spades_path= is where SPADes binary is located at
--racon_path= is the location for RACON, a program that ‘generates genomic consensus which is of similar or better quality compared to the output generated by assembly methods which employ both error correction and consensus steps, while providing a speedup of several times compared to those methods. It supports data produced by both Pacific Biosciences and Oxford Nanopore Technologies.’
-1 and -2 are the illumina reads
-l are the nanopore/PacBio reads
-t are the number of computing processors to use

Let Unicycler run. Answer the following questions:

Fill the following table using the information from the hybrid assembly and your previous best short read assembly (Hint: use the output from stats.sh):

Source	Number of scaffolds	%GC	N50	L90
Short read assembly
Hybrid assembly

Use a Sketch assay to determine the taxonomic markup of your long read assembly and answer the following questions for the hybrid assembly:

What are the GENUS with higher WKID scores?
What is the WKID score of your assembly when compared to Synechococcus?
Summarize you results and what they mean

Use blobtools to identify the taxonomic markup of your long read assembly and answer the following questions (Only map your short reads to the new assembly. Remember to use the bwa index long_read_assembly.fasta before you map the reads using bwa mem)

Include the image here:
What is the taxonomical unit with the highest percentage of scaffolds in our assembly?
What is the percentage of scaffolds with assignment to Synechococcus?

Finally, extract all the contigs/scaffolds with Cyanobacteria ID. Save these scaffolds in a file called final_cyano_assembly.fasta
Map the short reads to the final_cyano_assembly.fasta using BWA. NOTE: Please remember to do the bwa index final_cyano_assembly.fasta code
Use the pileup.sh program to summarize the coverage results. Save the number of mapped reads and the percentage of reads mapped.

Use the pileup.sh program to summarize the coverage results and fill the following table.

Dataset	Number of reads	Percentage of mapped reads	Average coverage
Short read assembly
Hybrid assembly

Did your long reads help improve the assembly? Justify your answer.

Chapter 8 Hybrid Assembly Methods for Short and Long Reads

8.1 Hybrid assembly using Unicycler

8.1 Hybrid assembly using `Unicycler`