fasta.bioch.virginia.edu/mol_evol     tinyurl.com/fasta-demo

Programming for Biology -- 2025 - Similarity Searching Exercises


These exercises use programs on the FASTA WWW Search page and the BLAST WWW Search page.

In the links below, [pgm] indicates a link with most of the information filled in; e.g. the program name, query, and library. [seq] links go to the NCBI, for more information about the sequence. In general, you should click [pgm] links, but not [seq] links.


Identifying homologs and non-homologs; effects of scoring matrices and algorithms; using domain annotations

Use the FASTA search page [pgm] to compare Honey bee glutathione transferase D1 NP_001171499/ H9KLY5_APIME [seq] (gi|295842263) to the PIR1 Annotated protein sequence database. Be sure to press , not .

  1. Take a look at the output.

    1. How long is the query sequence?
    2. How many sequences are in the PIR1 database?
    3. What scoring matrix was used?
    4. What were the gap penalties? (what is the penalty for a one-residue gap? two residues? -- this is a "trick" question)
    5. What are each of the numbers after the description of the library sequence? Which one is best for inferring homology?
    6. How similar is the highest scoring sequence? What is the difference between %_id and %_sim? Why is there no 100% identity match?
    7. Looking at an alignment, where are the boundaries of the alignment (the best local region)? How many gaps are in the best alignment? The second best?

  2. Homologs, non-homologs, and the statistical control.

    1. What is the highest (worst) E()-value shown? What should the highest (worst) E()-value calculated in the search be (approximately)?
    2. Which alignment has the worst statistically significant (E()<0.001) score? Do you think this sequence is likely to be homologous?
    3. What is the highest scoring (most significant) non-homolog? (The non-homolog with the highest alignment score, or the lowest E()-value.) Why do you think it is not homologous? Look for positive evidence (e.g. a non-homologous domain) for non-homology.

      You can use the domain diagrams (colors) to identify distant homologs, and, by elimination, the highest scoring non-homolog. You can also use the Sequence Lookup link to Uniprot to look at Domains and Families.

    4. If the statistical estimates are accurate, what should the E()-value for the highest non-homolog (the highest score by chance) be? (This is a control for statistical accuracy.)
    5. What is the E()-value of the most distant homolog shown (based on displayed domain content)? Could there be more distant homologs in the database?
    6. How would you confirm that your candidate non-homolog was truly unrelated? (Hint - compare your candidate non-homolog with SwissProt or QFO78/Uniprot Ref for a more comprehensive test.)

  3. Domains and alignment regions

    1. There are three parts to the domain display, the domain structure of the query (top) sequence (if available), the domain structure of the library (bottom) sequence, and the domain alignment boundaries in the middle (inside the alignment box). The boundaries and color of the alignment domain coloring match the Region: sub-alignment scores.
    2. Note that the alignment of Honey bee GSTD1 and SSPA_ECO57 includes portions of both the N-terminal and C-terminal domains, but neither domain is completely aligned. Why might the alignments not include the complete domains?
    3. Is your explanation for the partial domain alignment consistent the the belief that domains have a characteristic length because they represent compact folding units? How might you test whether a complete domain is present?

  4. Note that this Honey bee glutathione transferase shares significant similarity with both sequences from bacteria (SSPA_ECO57, stringent starvation protein) and mammals. How might you test whether the stringent starvation protein is homologous to glutathione transferases? (Hint - compare your candidate non-homolog with SwissProt for a more comprehensive test.)

  5. In a new window, repeat the GSTD1 search [pgm] using the BLASTP62/-11/-1 scoring matrix that BLAST uses.
    Re-examine the honey bee/SSPA_ECO57 alignment and compare the alignment that you got with BLOSUM50 (the first search) with the alignment you get with BLASTP62. (In the afternoon lecture and workshop, we will discuss how different scoring matrices have different preferred evolutionary distances. BLOSUM50 is more sensitive for alignmets that are about 25% identical, while BLOSUM62 works best for 30% identical alignments. Keep that in mind while answering the following questions.)

    1. Do both Glutathione transferase domains align in the honey bee protein?? Does that imply that one of the domains is not present? Look at the alignments to the homologs above and below SSPA_ECO57. Based on those. aligments, do you think the Glutathione-S-Trfase C-like domain is really missing from the honey bee protein?
    2. Why did the alignment become shorter?
    3. Why might a domain appear to be present in the first (BLOSUM50) search, but not in the second (BLOSUM62)?
  6. Do the same Honey bee GSTD1 search (295842263) using the Course BLAST [pgm] WWW page.
    1. Take a look at the output.
      1. How long is the query sequence?
      2. How many sequences are in the PIR1 database?
      3. What scoring matrix was used?
      4. What were the gap penalties?
      5. What are the numbers after the description of the library sequence? Which one is best for inferring homology?
      6. Looking at an alignment, where are the boundaries of the alignment (the best local region)?

    2. What is the highest scoring non-homolog?

    3. How do the BLASTP E()-values compare with the FASTA (BLASTP62) E()-values for the distantly related mammalian and plant sequences?

  7. One of the slides this afternoon will show an alignment between human and bovin PCBP2 (Poly(rC) binding protein 2). One of the missing regions of the human protein is from residues 267-279 in PCB2_HUMAN (Q15366) when aligning A0A3Q1LX34_BOVIN.
    1. Use the align two sequences [pgm] link from the FASTA web site to align PCB2_HUMAN (enter into top box) with A0A3Q1LX34_BOVIN (lower box). Note the locations (coordinates) of the two gaps in PCB2_HUMAN.
    2. Go to the UCSC Genome web site to look at cow PCB2 gene. These coordinates display the cow genomic sequence from Chr5:26,543,511-26,562,251 (from the end of exon 10 to the beginning of exon 11) Use the View menu option to get the DNA sequence of the cow PCB2 gene. First select View/DNA Sequence, then click the Get DNA option at the bottom of the page. Copy the DNA sequence, and use the Use the align two sequences [pgm] link PCB2_HUMAN with cow PCB2 DNA sequence, that you copied/pasted from the UCSC Get DNA link.
    3. Use TFASTX to compare residues 267-289 of PCB2_HUMAN to the DNA sequence from the cow PCBP2 gene by pasting in the cow genomic sequence from Chr5:26543511:26562251 (from the end of exon 10 to the beginning of exon 11): Compare the subset range of PCBP2_HUMAN 267-279 to the cow genomic region using "Compare sequences" with TFASTX [pgm] using the "VT20" and "BLASTP62" matrices. How long are the alignments? The percent identity? What the bit scores? What are the E()-values of the two alignments? Try the same comparison with residues 169-198 of PCB2_HUMAN using VT20 to align exon 7 of PCB2_HUMAN to the cow genome region. Can you find the missing 4 amino acids at 169-172?

    Where to get the FASTA package: github.com/wrpearson/fasta36

    The "normal" FASTA WWW site:

    Contact Bill Pearson: wrp@virginia.edu