Chapter 2 UNIX Challenge
Objectives
- To practice and remember basic coding skills that we will use during the semester.
- To refresh the basics of data analysis in the command line
- To re familiarize with the basic formats used in computational biology
Download the Synechococcus sp. WH 8101 files: Syn_WH8101.asmb.fasta
, Syn_WH8101.cds.fasta
and Syn_WH8101.aa.fasta
from the Canvas page and answer the following questions in an R Markdown file.
PLEASE: Include the code used for each question as code blocks or you will get points taken away!
-
How many lines are in each FASTA file?
-
How many sequences are in each FASTA file?
-
What kind of molecule is found in each FASTA file?
-
Why are the
Syn_WH101.asmb.fasta
andSyn_WH101.cds.fasta
files different to each other? -
Provide at least five different gene/protein names from the FASTA files in question
-
Count the occurrence of each nucleotide (A, T, C, and G) in the Whole Genome Files (WGS) of WH8101 and compare it with the nucleotide content for the Coding Sequence file of WH8101
-
In which of the three files can you find the genes
WP_174719562.1
,WP_130130567.1
, andWP_130130558.1
-
Find the function for the genes
WP_130130185.1
andWP_130130145.1
in theSyn_WH8101.aa.fasta
and add them in aR markdown table
-
How many genes with the
hypothetical protein
function can you find in theSyn_WH8101.aa.fasta
file? -
Provide two more protein functions (annotations) and the number of genes associated with them
-
How would you extract all the information related to sequence names for each file?