Chapter 5 Improving Illumina Read Quality
Objectives
- To understand quality control (QC) for sequencing experiments
- To identify common errors in illumina sequencing
- To propose solutions to QC issue
- To learn how to use Trimmomatic
5.1 Checking the S166 sample and trying to clean it as best as possible
We have looked at the raw reads from S166 and have discovered that it has some possible errors and elements that may results in a poor assembly, errors in some sequences, over represented regions of the genome or even the present of contaminants.
The main ones you mentioned in class were:
- A high percentage of adapters in the end of the R2 sequences
- Reduced quality in the last 5-7 bases of the files
- A set of reads that are smaller than the expected read length (150bp)
- Reads with lower average quality than the maximum quality
- A GC content distribution that is bimodal
So your homework is to come up with how would you clean your data based on what we learnt in class and the slides of the Short read sequencing class using Trimmomatic
.
To run trimmomatic
in the SMAUG, modify the following template:
java -jar /Smaug_SSD/bin/Trimmomatic-0.39/trimmomatic-0.39.jar PE -T 1 R1.fastq.gz R2.fastq.gz Processed_R1.fastq.gz Processed_R1.Unpaired.fastq.gz Processed_R2.fastq.gz Processed_R2.Unpaired.fastq.gz OPTIONS_FOR_TRIMMING
Where the R1.fastq.gz
and R2.fastq.gz
are your forward and reverse FASTQ files (NB0621_05_S166_R1_001.fastq.gz
and NB0621_05_S166_R2_001.fastq.gz
).
The OPTIONS_FOR_TRIMMING
are the commands that you will use for cleaning. You can use more than one at a time, and the order matters.
So, if you want to:
- Remove the adapters
- Remove all reads with less than 145 bp
The command will be:
java -jar /Smaug_SSD/bin/Trimmomatic-0.39/trimmomatic-0.39.jar PE -threads 1 NB0621_05_S166_R1_001.fastq.gz NB0621_05_S166_R1_001.fastq.gz Clean_S166_R1.fastq.gz Clean_S166_R1.Unpaired.fastq.gz Clean_S166_R2.fastq.gz Clean_S166_R2.Unpaired.fastq.gz ILLUMINACLIP:/Smaug_SSD/bin/Trimmomatic-0.39/adapters/TruSeq3-PE.fa:2:30:10:2:True MINLEN:145
5.2 What to turn in?
How would you clean the reads. This means:
- For each problem, plan a strategy (number of bases, quality threshold, adapters to be removed, etc.)
- What commands would you use for your sequence processing pipeline
For extra points that will count as one extra lab (100 points):
- The updated HTML of FASTQC before and after processing the reads
- The location of your processed reads within Smaug