In the links
below, [pgm]
indicates a link with most of the information filled in; e.g. the
program name, query, and
library. [seq] links go to the
NCBI, for more information about the sequence. In general, you should
click [pgm] links, but not [seq] links.
Identifying homologs and non-homologs; effects of scoring matrices and algorithms; using domain annotations
Use the FASTA search page [pgm] to compare Honey bee
glutathione transferase D1 NP_001171499/ H9KLY5_APIME [seq] (gi|295842263) to the PIR1 Annotated protein sequence database. Be sure to press , not .
Take a look at the output.
How long is the query sequence?
How many sequences are in the PIR1 database?
What scoring matrix was used?
What were the gap penalties? (what is the penalty for a one-residue gap? two residues? -- this is a "trick" question)
What are each of the numbers after the description of the library sequence? Which one is best for inferring homology?
How similar is the highest scoring sequence? What is the difference between %_id and %_sim? Why is there no 100% identity match?
Looking at an alignment, where are the boundaries of the alignment (the best local region)? How many gaps are in the best alignment? The second best?
Homologs, non-homologs, and the statistical control.
What is the highest (worst) E()-value shown? What should the
highest (worst) E()-value calculated in the search be (approximately)?
Which alignment has the worst statistically significant (E()<0.001) score? Do you think this sequence is likely to be homologous?
What is the highest scoring (most significant) non-homolog? (The non-homolog with the
highest alignment score, or the lowest E()-value.) Why do you think it is not homologous?
Look for positive evidence (e.g. a non-homologous domain) for non-homology.
You can use the domain diagrams (colors) to identify distant homologs, and, by elimination, the highest scoring non-homolog. You can also use the Sequence Lookup link to Uniprot to look at Domains and Families.
If the
statistical estimates are accurate, what should the E()-value for the
highest non-homolog (the highest score by chance) be? (This is a
control for statistical accuracy.)
What is the E()-value of the most distant homolog shown (based on displayed domain content)? Could there be more distant homologs in the database?
How would you confirm that your candidate non-homolog was truly
unrelated? (Hint - compare your candidate non-homolog
with SwissProt or QFO78/Uniprot Ref for a more comprehensive test.)
Domains and alignment regions
There are three parts to the domain display, the domain structure of
the query (top) sequence (if available), the domain structure of the library (bottom)
sequence, and the domain alignment boundaries in the middle (inside the
alignment box). The boundaries and color of the alignment domain
coloring match the Region: sub-alignment scores.
Note that the alignment of Honey bee GSTD1
and SSPA_ECO57 includes portions of both the N-terminal and
C-terminal domains, but neither domain is completely aligned. Why might
the alignments not include the complete domains?
Is your explanation for the partial domain alignment consistent the
the belief that domains have a characteristic length because they represent compact folding units? How might you
test whether a complete domain is present?
Note that this Honey bee glutathione transferase shares significant
similarity with both sequences from bacteria (SSPA_ECO57, stringent
starvation protein) and mammals. How might you test whether the
stringent starvation protein is homologous to glutathione
transferases? (Hint - compare your candidate non-homolog
with SwissProt for a more comprehensive test.)
In a new window, repeat the
GSTD1 search [pgm] using the BLASTP62/-11/-1
scoring matrix that
BLAST uses.
Re-examine the honey bee/SSPA_ECO57 alignment and compare the alignment that you got with BLOSUM50 (the first search) with the alignment you get with BLASTP62.
(In the afternoon lecture and workshop, we will discuss how
different scoring matrices have different preferred evolutionary
distances. BLOSUM50 is more sensitive for alignmets that are about
25% identical, while BLOSUM62 works best for 30% identical
alignments. Keep that in mind while answering the following
questions.)
Do both Glutathione transferase domains align in the honey bee
protein?? Does that imply that one of the domains is not present?
Look at the alignments to the homologs above and
below SSPA_ECO57. Based on those. aligments, do you think
the Glutathione-S-Trfase C-like domain is really missing from the
honey bee protein?
Why did the alignment become shorter?
Why might a domain appear to be present in the first (BLOSUM50) search, but not in the second (BLOSUM62)?
Do the same Honey bee GSTD1 search (295842263) using the Course
BLAST [pgm] WWW page.
Take a look at the output.
How long is the query sequence?
How many sequences are in the PIR1 database?
What scoring matrix was used?
What were the gap penalties?
What are the numbers after the description of the library sequence? Which one is best for inferring homology?
Looking at an alignment, where are the boundaries of the alignment (the best local region)?
What is the highest scoring non-homolog?
How do the BLASTP E()-values compare with the
FASTA (BLASTP62) E()-values for the distantly related
mammalian and plant sequences?
One of the slides this afternoon will show an alignment between
human and bovin PCBP2 (Poly(rC) binding protein 2). One of the
missing regions of the human protein is from residues 267-279 in
PCB2_HUMAN (Q15366) when aligning A0A3Q1LX34_BOVIN.
Use the align two sequences [pgm] link from the FASTA web site to align PCB2_HUMAN (enter into top box) with A0A3Q1LX34_BOVIN (lower box).
Note the locations (coordinates) of the two gaps in PCB2_HUMAN.
Go to the UCSC Genome web site to look at cow PCB2 gene. These coordinates display the cow genomic sequence from Chr5:26,543,511-26,562,251 (from the end of exon 10 to the beginning of exon 11)
Use the View menu option to get the DNA sequence of the cow PCB2 gene. First select View/DNA Sequence, then click the Get DNA option at the bottom of the page.
Copy the DNA sequence, and use the Use the align two sequences [pgm] link PCB2_HUMAN with cow PCB2 DNA sequence, that you copied/pasted from the UCSC Get DNA link.
Use TFASTX to compare residues 267-289 of PCB2_HUMAN to the DNA
sequence from the cow PCBP2 gene by pasting in the cow genomic sequence from Chr5:26543511:26562251
(from the end of exon 10 to the beginning of exon 11):
Compare the subset range of PCBP2_HUMAN 267-279 to the cow genomic
region using "Compare sequences" with TFASTX [pgm] using the "VT20" and
"BLASTP62" matrices. How long are the alignments? The percent
identity? What the bit scores? What are the E()-values of the two
alignments?
Try the same comparison with residues 169-198 of PCB2_HUMAN
using VT20 to align exon 7 of PCB2_HUMAN to the cow
genome region. Can you find the missing 4 amino acids at 169-172?