Chapter 7 Assembly Methods for Short Read Sequencing
Objectives
- To understand the algorithms and programs used for De Novo assembly
- To test between different K-values and determine the best value for an assembly
- To compare between different assembly software for short read sequencing
- To identify metrics used to determine assembly quality
- To clean contaminants from genomic assemblies
Recommended: Create a folder inside of your data
folder called assembly
and run everything there.
7.1 Comparing between k-mers in genome assembly
First, lets test different k-mers and observe how it affects the assembly.
We will use SPADES
, one of the best assemblers available currently, to assemble our data.
- Using your cleaned and raw reads, choose one of the following K values (7, 21, 47) and run
SPADES
by modifying the following code
python3 /Smaug_SSD/bin/SPAdes-4.0.0-Linux/bin/spades.py -t 3 -k k_value -1 paired_ends_R1.fastq.gz -2 paired_ends_R2.fastq.gz -o spades_k_value
Using the SMAUG scheduler slurm
Since we are using so many jobs at the same time, we have to start using a scheduler. The scheduler works as a guardian of the computer that runs the jobs in the background without interfering with other processes in the computer.
So, to use the scheduler, we do the following:
sbatch -J name_of_the_job –wrap “command_to_be_run” -o name_of_log_file
Then, check if the job is being run by using the squeue command.
If the job is being run, you’ll see the job in your screen.
- Check the stats of your assembly by using the
stats.sh
program from BBMAP as such:
stats.sh in=spades_k_value/scaffolds.fasta
- Fill the following table.
Dataset | K-value | Number of scaffolds | %GC | N50 |
---|---|---|---|---|
Raw reads | 7 | |||
Raw reads | 21 | |||
Raw reads | 47 | |||
Cleaned reads | 7 | |||
Cleaned reads | 21 | |||
Cleaned reads | 47 |
- Answer the following questions:
- Define Kmer using your own words
- Define N50 and L90 using your own words
- Based on your results and the table you filled, what is the best K-mer value?
7.2 Using a multi-K-mer approach to avoid manual k-mer selection
As we can tell, K-mer size affects tje assemblies. That means we will have to test multiple K-mer values to find which combination of k-mers is the best one for our assembly.
Luckily, there are programs that will choose the number of optimal k-mers to be tested and will select the best assembly as a final product.
Unicycler
is an assembly program that doesn’t exactly that. Its a program that is specific for prokariotic organisms, so it fits our genomes perfectly.
The Unicycler
program is super cool as it uses a lot of different programs (such as SPADES
, Racon
, Pilon
and others) to optimize the assemblies for bacterial/archaeal organisms.
Installing Unicycler
Before you can run unicycler you have to install it:
conda create -n unicycler_env python=3.9 -y
conda install -c bioconda unicycler
conda init bash
bash
- Run
Unicycler
for your clean and raw reads by modifying the following command:
conda activate unicycler_env
/Smaug_SSD/bin/Unicycler-0.5.1/unicycler-runner.py spades_path=/Smaug_SSD/bin/SPAdes-4.0.0-Linux/bin/spades.py -1 paired_ends_R1.fastq.gz -2 paired_ends_R2.fastq.gz -o unycycler_raw/clean
After the assembly has finished, evaluate the scaffolds using
stats.sh
and create a table below with the results.Answer the following questions:
- What was the best K-mer value selected by Unicycler?
- What assembly was better? Any of the SPADES assemblies or the Unicycler?
- We can calculate the coverage for the best assembly by using
bwa
to map the reads to our best assembly as such:
bwa index best_assembly.fasta
bwa mem -t 4 best_assembly.fasta paired_ends_R1.fastq.gz paired_ends_R2.fastq.gz | samtools view -S -b | samtools sort -O bam -o mapped.bam
samtools index mapped.bam
Where best_assembly.fasta
is your selected assembly, and your output is mapped.bam
. We will learn more about bam
files later on the course.
And then, use the pileup.sh
program to summarize the coverage results as such:
pileup.sh in=mapped.bam out=stats.txt
- Create a table summarizing the best assembly with coverage information:
Dataset | Assembly software | Number of reads | Number of scaffolds | %GC | N50 | L90 | Average coverage |
---|---|---|---|---|---|---|---|
7.3 Cleaning the assembly and detecting contamination
So, you think we have the best assembly? NOPE. We still have some work to do.
We have a little problem with our data, and that is that these bacteria might have grown with some other species.
So, chances are the DNA extraction also includes other bacteria that are not Synechococcus. How can we check this?
A. We do a Sketch assay that will tell us if there is DNA from other species in our genomes B. We do a coverage and taxonomy detection assay to determine the number and type of contaminants
We will be doing both in this laboratory assignment.
7.3.1 Sketch assay
A Sketch assay is a method of rapidly comparing your assembly results k-mers against databases populated with different reference organisms (such as RefSeq from NCBI). The method it uses is to create a hash. A Hash is a function that converts a large object into a smaller one. Then, you compare this hashes against the hashes from the reference databases to see what your genome looks like. This means that you can see if your genome has contamination from other organisms very rapidly.
Lets do a sketch assay for the assemblies:
- Obtain the sketch for your assembled genome and compare it against the sketch of the RefSeq datasets using
sendsketch.sh
as such:
/Smaug_SSD/bin/bbmap/sendsketch.sh in=best_assembly.fasta refseq
Check the results of the program. The columns WKID KID ANI SSU Complt Matches Unique
represent the Weighted k-mer identity
, K-mer identity
, Average nucelotide identity
, 16S identity by alignment
, Genome completeness
, The number of shared kmers between query and ref
, The number of shared kmers between query and ref, and no other ref.
So, if your result is 100% in the WKID
against Synechococcus, that means that 100% of the k-mers from your assembly match the k-mers from Synechococcus.
- Answer the following questions:
- What are the GENUS with higher WKID scores?
- What is the WKID score of your assembly when compared to Synechococcus?
- Summarize you results and what they mean
7.4 Coverage and Taxonomy detection assay
Before we start do this:
conda deactivate
"${SHELL}" <(curl -L micro.mamba.pm/install.sh)
follow the instructions and then do the following to install blobtools
, the program we will use to assay for contamination.
conda create -n blobtools
conda activate blobtools
micromamba install bioconda::blobtools
Now we suspect that our sequences have a lot of environmental contaminants, but we don’t know which of the scaffolds are from our bacteria of interest or which ones are from contaminants. This is because the results from the sendsketch.sh
assay are across the entire assembly.To have a good assemblythat only respresnets our species of interest we should identify and remove all scaffolds that are not matching the taxa we are interested in.
To do this, we use blobtools
. Blobtools
determines the taxonomic assignment and the coverage for every single scaffold in your assembly!
Blobtools
needs three files:
- Your assembly in
FASTA
format - Your
BAM
file with mapped reads to your best assembly (mapped.bam
). This will assess the coverage for each scaffold, as contaminant scaffolds tend to have a different coverage than scaffolds from our species of interest. - A
NCBI BLAST
table where each scaffold is compared to the NCBI database. This will tell us the taxonomic assignment of each scaffold.
Then, blobtools
will put all the information together and provide us with two main results:
- A list of the taxonomic assignment and coverage of each scaffold
- A plot that shows the taxonomic assignment and coverage of each scaffold
Let’s, then, use blobplots and create the taxonomic and coverage assay for our assembled scaffolds.
As of now, we have two files:
- Your assembly in FASTA
format
- Your BAM
file with mapped reads to your best assembly (mapped.bam
) from step 8 in this lab
We are missing the BLAST taxonomic assignments
- Do a
BLAST
search of our scaffolds against NCBI’snt
database by first creating a new folder calledblobtools
and running everything there:
mkdir blobtools
cd blobtools
blastn -num_threads 4 -task megablast -query best_assembly.fasta -db /course_data/blastdb/nt -outfmt '6 qseqid staxids bitscore std' -max_target_seqs 1 -max_hsps 1 -evalue 1e-25 -out best_assembly.blast
Where best_assembly.fasta
is you assembly file, -db
is the nt
database from NCBI
- Answer the following questions using your knowledge from
MBB 101
-
What
blast
program are we using and what does it mean? What is thent
database? -
What do the
-max_target_seqs
and-max_hsps
flags mean? - What is the e-value used as threshold? What does e-value mean?
- Now, after
BLAST
is done we have all the elements to runblobltools
as such:
blobtools create -i best_assembly.fasta -b mapped.bam -t best_assembly.blast -o blobltools_out --db /Smaug_SSD/bin/blobtools/data/nodesDB.txt
Where -i
is your assembly, -b
is the BAM
file of reads mapped to the assembly, -t
is the BLAST
output, and -o
is the output name for the run.
Blobtools
needs a two additional steps in order to visualize and plot the results as this:
blobtools view -i blobltools_out.blobDB.json;
blobtools plot -i blobltools_out.blobDB.json
- Let’s download the plots in the
.png
files.
the blobtools_out.blobDB.json.bestsum.phylum.p8.span.100.blobplot.read_cov.bam0.png
shows the percentage of all taxonomic assignments in our scaffold.
- Answer the following questions:
- Include the image.
- What is the taxonomical unit with the highest percentage of scaffolds in our assembly?
- What is the percentage of scaffolds with assignment to Synechococcus?
The blobtools_out.blobDB.json.bestsum.phylum.p8.span.100.blobplot.bam0.png
file is a blobplot that shows three main things:
- In Top: A length (span) vs. GC content plot that shows the size of each scaffold versus its GC content. The colors represent the taxonomic assignment
- In the middle: A Bubbleplot of coverage vs GC content for each scaffold. Larger the circle, larger the scaffold. The colors represent the taxonomic assignment
- In the right: A coverage vs. length (span) plot that shows the coverage of each scaffold versus its length The colors represent the taxonomic assignment
What this plot shows us is the differences between all the scaffolds assembled. We can see, here, how the contaminants look versus our desired sequence.
- Answer the following questions:
- Include the image.
- What can you say about this plot?
- Back in the cluster, check the
blobtools_out.blobDB.json.bestsum.phylum.p8.span.100.blobplot.stats.txt
file. That file contains a summary of all the stats fromblobtools
. Copy and paste the results into this page: https://www.tablesgenerator.com/markdown_tables and copy and paste the result below:
- Add table here
Here’s what each column means:
name
: Best taxonomic hitcolour
: Color used for the plotscount_visible
: Number of scaffolds associated with the taxonomic hitcount_visible_perc
: Percentage of assembly scaffolds associated with the taxonomic hitspan_visible
: Number of bases associated with the taxonomic hitspan_visible_perc
: Percentage of total bases associated with the taxonomic hitn50
: N50 per taxonomic higc_mean
: Mean GC content per taxonomic hitgc_std
: Standard deviation of GC content per taxonomic hitbam0_mean
: Mean coverage per taxonomic hitbam0_std
: Standard deviation of coverage per taxonomic hitbam0_read_map
: Number of reads mapped to the taxonomic hitbam0_read_map_p
: Percentage of reads mapped to the taxonomic hit
- Answer the following questions:
- What is the most represented taxonomic hit in your dataset? How did you come to that conclusion?
- What about our Cyanobacteria? Can you summarize the results in your own words?
Finally, to clean the assembly we will extract the scaffolds that match Cyanobacteria. The information of the taxonomic assignment for each scaffold is found in the
blobtools_out.blobDB.table.txt
.Open the first 20 lines of the file using the program of your choice.
Here’s what each column mean:
name
: Name of the sequencelength
: Total length of the sequence, i.e. count(A, G, C, T, N)GC
: GC content percentage of the sequence, i.e. count(G, C)/count(A, G, C, T)N
: Number of N’s in the sequence, i.e. count(N)bam0
: Coverage from bam0 (see main header for filename)phylum.t.6%s
: The assigned taxonomy of the sequence at the taxonomic rank of “phylum” under the tax-rule “best-sum”phylum.s.7%s
: The sum of scores for the taxonomy of the sequence at the taxonomic rank of “phylum” under the tax-rule “best-sum”phylum.c.8
: The c-index for the taxonomy of the sequence at the taxonomic rank of “phylum” under the tax-rule “best-sum”
- Answer the following question:
- How can you use this file to extract the scaffolds of interest?
7.5 Extracting the scaffolds of interest
The final step of our assembly is to extract all the scaffolds that have cyanobacterial associations.
We can extract those scaffolds by using the blobtools seqfilter
program. This program can be ran as such:
blobtools seqfilter -i FASTA -l LIST -o PREFIX > Cynaobacteria_scaffolds.fasta
blobtools seqfilter -i assembly.fasta -l bplot/cyano_scaffs.txt -o Cyano_scaffold
Where -i
is your assembly, -l
is the list of sequence names you want to extract, and -o
is the new names for the new assembly.
So, we have the -i
and -o
elements.
- Answer the following question:
-
Using the command line tools we have learnt before, how do we
extract everything that is Cyanobacteria from the
blobtools_out.blobDB.table.txt
to create the required file by-l
? (Hint, use thegrep
andcut
commands in a pipe.)
This is an example of how the file should look like:
$ head list_of_cyano_scaffolds.txt
4
8
21
30
34
-
Run the
blobtools seqfilter
command and extract the cyanobacterial reads. Then use thestats.sh
command in the new extracted files (your final assembly). What are the stats? Create a table below. -
Finally, how can you check rapidly that there is no contamination in
the new assembly? (Hint: Use a program that compares
sketches
). Paste the result below. - Summarize your final assembly in your own words:
7.6 2025 Results:
S165 | S167 | 6167 | S166 | |||||
---|---|---|---|---|---|---|---|---|
Cyano scaff # | Cyano gen length | Cyano scaff # | Cyano gen length | Cyano scaff # | Cyano gen length | Cyano scaff # | Cyano gen length | |
Spades | 136 | 5.25 MB | 621 | 2.47 | 852 | 2.85 | 32 | 2.36 |
Unicycler | 67 | 6.16 MB | 29 | 2.37 | 62 | 2.52 | 31 | 2.37 |
Observations | Inconclusive data: Too many scaffolds and genome too long. | Unicycler is better: Fewer scaffolds, less contaminants, Shorter length? | Unicycler is better: Fewer scaffolds, less contaminants, | Pretty similar. Unicycler may be marginally better BUT if you only extract cyanobacteria, doesnt really matter. | ||||
Strategies | Alternative: Long read sequencing | Decent enough. Annotate! | Decent enough. Annotate! | Decent enough. Annotate! |