A few notes on protein sequence analysis

A branch of protein bioinformatics looks at the sequence-structure relationship between protein structures and their encoding sequences. This is most prominent in homology modelling, where we align a query sequence to a database of sequences with known structures, pick the hit with the highest similarity and cast the structure to the query sequence. These different steps can be a field on its own: sequence alignment, similarity measure and structural remodelling. Here I mention a number of tools used in my current research group:

Sequence alignment

Dynamic programming approaches like Needleman-Wunsch and Smith-Waterman algorithms make use of substitution matrices (like BLOSUM and PAM) and gap penalties for mismatches. We are however in favour of the hidden Markov models. HMMER is a very fast tool to perform multiple sequence alignment. For our use case, as in ANARCI (antigen receptor numbering and receptor classification), we align a query antibody or T cell receptor sequence to a database of sequences of the respective type, select the top hit, and use the numbering scheme associated with the hit to annotate the query.

Similarity measure

How we parameterise gap penalties and substitution matrices will bias the similarity score returned for each of the alignment hits. BLOSUM62 were derived from the observed amino acid substitution frequency in the local alignments of conserved regions of protein families in BLOCKS database; PAM examines the point accepted mutations during a given evolutionary interval. Environment-specific substitution tables (ESST) incorporates knowledge of the structural constraints of different amino acids to compute the substitution likelihood. ESSTs have been developed for membrane proteins and antibodies to improve alignments, thus the selection of a good structural template.