Feature extraction and clustering

In the published Protein Databank (PDB) entries, we recognise that, except for CDRH3, all other CDR loops adopt only a few structural conformations. It could be due to the preference of the original antibody’s germline – consider the fact that the PDB is heavily skewed towards humanised/engineered antibodies which attract pharmaceutical interest, where the CDR loops formed after the antigen exposure and maturation in other species are grafted onto the human antibody framework. Some frameworks may be predisposed to particular structural conformations in their acceptable CDR loops. The limited structural conformations are named “canonical forms” of the CDRs.

In the structural space, we can consider the dihedral angles or the all-atom root-mean-square-distance between two CDR loops as the distance measure for clustering them into canonical forms. After that, simple clustering by minimising the average distance between two clusters can be used. For instance, unweighted pair group method with arithmetic mean (UPGMA) ignores the size of the cluster, but attempts to reduce the average distance within the cluster member. It is an agglomerative method (bottom-up approach) which eliminates the need to define the desired number of clusters.

The one thing that makes protein folding a burning problem to solve is that, techniques used for recovering protein structures have their limitations: a) cost; b) time; c) whether the protein can be expressed and crystallised; and d) the resolution is not desirable. It would be good if we can have ALL the structures in the world, yet to retreat a small step back, we wish to use what we can readily obtain, i.e. sequence information of the protein to predict the structural conformation it might adapt.

Based on the structural clusters, we would like to look at the sequence and ask: can we predict the canonical form just by looking at the sequences of the CDR loops in the cluster? In some cases, there might be more than one sequence which adopt the same structure (hence classified into the same canonical cluster). We devise sequence similarity as the distance measure, but this time, since the physical distance is position-dependent, we would consider the minimal distance between any member of the cluster and the point to be assigned (single linkage). Otherwise, we can also use affinity propagation which considers the connectivities between points and devise the best scenario for passing messages. From these sequence clusters, we look at the member sequences and attempt to recover an “average” or generic sequence with a degree of conservation of a certain residue at a certain position of the loop. This way, we might be able to avoid the usual Hidden Markov Model for sequence-based canonical cluster assignment, and use a more transparent method instead.