Abstract
Tandemly repeated structural motifs within a protein form highly stable super secondary structural folds and provide multiple binding sites giving rise to diverse protein-protein/DNA/RNA interactions. Thus, these proteins provide a repertoire of cellular functions such as cell-cycle regulation, transcriptional regulation, cell differentiation, apoptosis, plant defense, bacterial invasion, etc.The evolutionary conservation and disease association of repeat proteins show their structural and functional importance and suggest the need for their identification. The multiple copies of repeats originate by the intragenic duplication and recombination events resulting in identical repeating units, both at the sequence and structure level. During the course of evolution, individual copies of the repeat accumulate mutations resulting in undetectable sequence similarity within repeating units, though the structural fold remains conserved. The traditional approaches of repeat identification and classification have mainly been based on identifying sequence patterns, and have limited coverage, especially for repeats with low sequence conservation. Thus, identification and classification of repeats at the structural level is desirable.
In this thesis, we propose computationally efficient graph based approaches for the identification of repeats in protein structures, and use them to build a database of structural repeat proteins. The three dimensional topology of a protein structure is well captured by a graph, making it possible to analyze a complex protein structure as a mathematical entity. We observe that the principal eigen spectra of the adjacency matrix (eigenvector centrality) is able to capture the conserved residue interaction pattern within and between the repeating units, which are responsible for the regular tertiary fold of the repeat domain. The efficacy of the measure is tested on one of the most abundant repeat families, Ankyrin, by developing a rule based approach based on the observed pattern in eigenvector centrality profile of all known members of the family. It is evaluated on a benchmark dataset and the prediction compared with UniProt annotation, and other state-of-the-art sequence and structure based approaches.
We extended the approach by developing a generalized algorithm, named PRIGSA, for the identification of structural repeats using the eigenvector centrality profile (Alevc) and secondary structure architecture (SSA). The algorithm identifies members of known families by comparing the Alevc profile and SSA of a query structure with pre-computed profiles of known families, and also de novo captures novel patterns to identify previously uncharacterized novel repeat proteins. The performance of the approach is tested on other known benchmark datasets and the performance compared with two repeat identification methods. The algorithm is improved to develop PRIGSA2 algorithm by incorporating structure based validation of predicted repeats by structural alignment and checking the relative orientation of secondary structures within the repeat copy. The predictions of the approach for 13 well-represented repeat families are compared with annotation in UniProt, comprising both solenoid and non-solenoid repeats.
The PRIGSA2 algorithm is executed on all protein chains reported in the Protein Data Bank to build a database of structural repeats in proteins, DbStRiPs. The de novo predicted repeats are clustered and manually curated. We report one novel repeat family, Left-handed Beta Helix (LbH), and 31 De novo Protein Repeat Clusters (DPRCs). We anticipate the clusters of structurally similar repeat proteins, DPRCs, may lead to novel families in subsequent updates of DbStRiPs with the increase in structural data. In DbStRiPs, 11,901 repeats are reported in 10,816 PDB chains, which are categorized into known protein repeat families (KPRF) (74%), DPRC (6%) or Unclassified (20%). The repeat families and clusters are grouped in the database based on Kajava’s structural classification, and the unclassified repeats are grouped based on the secondary structure composition of the repeating unit.
We also developed a web server for network analysis of protein structures, NAPS, providing a comprehensive platform for construction and analysis of protein structure graphs. Five types of networks can be constructed and various types of analyses including node/edge centrality, shortest paths, k-clique and graph spectral analyses can be carried out. The web server provides browser independent interactive platform for visual analysis of protein structure and network along with options to download the results in suitable formats.