Chapter 3 FASTQ files and understanding read qualities

Objectives

To learn about the most commonly used types of files for genome sequencing: FASTQ files
To understand the information included in FASTQ files
To identify good quality and bad quality reads from genome sequencing datasets

Today we will explore the previously Illumina sequenced genomes of the four Synechococcus samples the Ahlgren Lab has provided us to assemble and annotate.

These genomes have been sequenced using the Illumina HiSeq platform, with Paired Ends and a read length of 150bp.

Based on that information, answer the following questions:

How many FASTQ files per sample do you expect (one FASTQ per sample or two FASTQ per sample)? Justify your answer.
Do you expect all the reads to be of the same size? Which size?

3.1 Sample distribution

As mentioned in class, we have three main samples sequenced: S165 and S167. We will use S165 as an example this week and then you will have to present the results of S166 and S167 as part of your lab report.

We will use Smaug for our data analyses. Smaug, or the Great Worm, is the supercomputer we use in the T lab for mycological evolution. All the data is available in Smaug, so please be careful with the data provided.

To access Smaug you need to connect via http://140.232.222.14:8787/. The username to access Smaug is the same of your Clark ID, but without the @clarku.edu (i.e. My user is jtabima)

The files are stored in a folder at /course_data/BIOL209/raw_illumina_data/raw_reads/SAMPLE_NAME. EVERYTHING that you will do, run or program will be stored in those folders, so please be careful and DON’T ERASE THE GENOME FILES!

3.2 Basic information and creating backups

Inside of your sample folders, you should be able to find the genomes in FASTQ format.

What are the genome files? Add the names here

In your home folder (cd ~) create a folder called Genome_backup_SAMPLE_NAME and copy the genome FASTQ files there.

We know that FASTQ files are divided by four lines per each read:

Sequence information
DNA sequence
Spacer
Quality

The Sequence information header (line 1) starts with a @, followed by a set of strings. The first set of strings, between the @ and a colon (:) is my sequence identifier. Each read starts with that ID.

Using that information, answer the following questions:

What is your sample’s sequence ID?
Can you use this information to count the number of reads per sample? Add the code below and the result. (Remember, to look at a GZ file without uncompressing it, use the zcat command)
Are the number of reads between file R1 and R2 the same? Was this expected? Justify your answer.

Finally, summarize your results in this table:

Sample Name	Sequence ID	Number of Reads in R1	Number of Reads in R2
030121_38	NB551394	2649963	2649963
S165
S166
S167

3.3 Quality Control

High Throughput Sequencing technologies are not perfect. They can have various types of errors and contamination. Blindly using raw sequence data for downstream analysis is risky and will lead to poor, and/or inaccurate results.

Here is a very helpful paper on how to “diagnose” issues with sequence data and how to improve these problems. Reading it will help you answers the questions and gain an understanding of common problems with high-throughput sequencing:

Zhou, X. and Rokas, A., 2014. Prevention, diagnosis and treatment of high‐throughput sequencing data pathologies. Molecular ecology, 23(7), pp.1679-1700.

The paper is available in the Canvas page of BIOL209 as well.

3.3.1 Evaluating QC of FASTQ using FastQC

To evaluate the quality of our data and to “diagnose” any problems, we will use software called FastQC.

This webpage can be used to learn more about each metric presented in the FastQC output: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/

3.3.1.1 Running FastQC

Create a directory within you sample folder called QC
Running FastQC is super easy, just use the following command for each of your FASTQ files

/Smaug_SSD/bin/fastqc -o /your/output/directory/ sequence.file.fastq.gz

Answer the following question:

What are the outputs of FastQC?

Download the .html files to your local machine (i.e. your computer). The files should be readable by any Internet Browser.

3.3.1.2 Analizing and describing results

Once you’ve transferred the .html files to your computer, open it up in a web browser. You should see a nicely arranged page with 12 analyses (see the site for an example: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Each of these analyses can tell us something about the quality of our data.

Answer the following questions:

How many sequence reads makeup our dataset? Does this agree with your previous calculation?
What is the read length of our data? Is this result expected or do we see weird results?
What is the GC content of our data ? Does this match expectations?
Describe the sequence quality of the files (per base sequence quality, per tile sequence quality, per sequence quality score).
In general, are the forward and reverse reads of similar quality? If they differ, how do they differ?
Are there any issues with the various other metrics analyzed by FastQC (i.e. the red circles with the ‘X’ on the left under the ‘Summary’ heading)? Do you notice any patterns that are puzzling or troubling with any of these analyses?
Improving the Illumina Data: Outline rough plan to improve the Illumina Sequence data. You don’t need to provide specific program commands and parameters, but do take a look at some of the programs recommended by Zhou and Rokas (2019) to start thinking about how you would improve or filter the read data. By all means you can consider using other software if you find something interesting or useful.