Bioinformatics

Protein structure predictions

Modern biological experimentation requires computational techniques of different kinds to enable large-scale and high-throughput studies. For example, structural genomics efforts aim to understand the function of proteins from their 3-D structures, which are either determined by experimental methods (e.g, X-ray crystallography and NMR spectroscopy) or predicted by computational methods (e.g., comparative modelling, fold recognition, and ab initio prediction). Proteomics efforts (in the most inclusive sense of the term) as another example, aim to understand the functional consequences of the collection of proteins that is present in a cell, or tissue, at a given time - particularly where differences are observed between healthy and disease states.

In both examples the data, and the analytical methodology applied to them, are obviously central to accomplishing the aims of these scientific domains. In addition, however, a framework is required that allows researchers to access the data, interpret the data, and exchange knowledge with one another. Most of the infrastructures currently in use enable straightforward access to centrally stored experimental data via large databases, and to tools and services either via web servers or repositories. Their existence and availability has undoubtedly been seminal for establishing the important position held by applied bioinformatics research in many biological experimental laboratories today; and will still play a role in the future. However, their versatility (e.g., with respect to heterogenic data types) and expandability (e.g., with respect to the ease by which data is shared amongst different research groups) are limited. In our view there is still ample room for improvement, even re-invention, of the framework underpinning biological and bioinformatics research interactions.

The value of integrating resources and researchers more effectively has been recognised by many others in the field, and recently a number of new infrastructures have emerged that address some of the shortcomings and foster interactions. Many of them are focused exclusively on facilitating bioinformatics research within specific domains, and range from work-bench style environments to peer to peer based systems for small specialised group. In our own research we were interested in realising a more general framework, with OpenKnowledge.

In this context, any experimental protocol that is followed when one, or several, researchers are undertaking a bioinformatics experiment can be viewed as a series of interactions between the researcher(s), the databases from which the data are obtained, and the tools that are applied to derive secondary information from this data. Many bioinformatics protocols can be represented as consecutive interactions, or steps in a workflow. Moreover, an improved/novel framework should build on existing network connections (such as the internet, or the Grid). Accordingly the developments by the myGrid project group, such as the Computer-Aided Software Engineering (CASE) tool Taverna, currently play the most prominent role in the sector of automated experimentation enactment in bioinformatics. However, one of the weaknesses of this design is that, while it facilitates reproducible research, it cannot be extended to facilitate as effective sharing of resources (tools, data, knowledge) as this is conceivable across a peer to peer network.

Below we describe how OpenKnowledge P2P infrastructure was used to enact bioinformatics analyses, involving consistency checking amongst comparable data from different bioinformatics programs (ranked lists of short amino acid sequences that could have yielded a given tandem mass spectrum) and different databases (atomic coordinates of modelled 3-D structures of yeast proteins), and peer ranking of different protein identification tools by MS/MS, respectively.
- 1. Protein Identification by MS/MS in Proteomics: MS/MS involves multiple steps of mass selection or analysis and has been widely used to identify peptides and analyze complex mixtures of proteins. The two most frequently used computational approaches to recognizing sequences from mass spectra are
  (a) peptide fragment fingerprinting approach, in which spectrum analysis is performed specifically for candidate proteins extracted from a database by building theoretical model spectra (from theoretical proteins) and comparing the experimental spectra with the theoretical model spectra. This approach is not suitable for the proteins with missing post-translational modifications (PTMs) and from unsequenced genomes.
  (b) de novo sequencing approach, in which the inferences of the peptide sequences or partial sequences is independent of the information extracted from pre-exsiting protein or DNA databases. Sequence similarity search algorithms are specially developed to compare the inferred complete or partial sequences with theoretical sequences. Once a protein has been sequenced by de novo methods, one can look for related proteins in a GTDB using a matching algorithm such as MS-Blast.
  
  Consistency-checking of de novo sequencing tools: In this scenario, we investigate the possibility of improving the accuracy of peptide sequence identification through consistency-checking of the results from different de novo sequencing methods for MS/MS interpretation. The automated harvesting of results from different de novo sequencing tools for a target mass spectra and collation to allow easy comparison of answers to cache the consistency between them was realized through the OpenKnowledge framework. Three example de novo sequencing tools, including web server (PEAKS) and local programs (PepNovo and Lutefisk), with developed OpenKnowledge Components (OKCs), play the role as data source to provide peptide identifications answers which will then be compared by peer data comparer to obtain a re-renked list of candidate peptide sequences at three different confidence levels. Recursion function is employed to allow submission and analysis of multiple target mass spectra.
  
  Peer Ranking of protein identification by MS/MS tools: There is a level of confusion surrounding the selection of specific protein identification approach using validated MS/MS data, and specific protein identification tool with MS/MS data, for specific tasks. The peer ranking algorithm integrated into the OpenKnowledge system helps to evaluate the relative popularity/importance of different protein sequence identification tools employing different approaches, including PFF approach (Subscribed example peer MASCOT and OMSSA are available to play this role), and a combination of de novo sequencing (performed by example peer PepNovo or Lutefisk) followed by database similarity searching approach (performed by example peer MS-BLAST).
- 2. Protein Structure Prediction
  
  Protein structure prediction is one of the best-known goals pursued by bioinformaticians. A protein’s three-dimensional (3-D) structure contributes crucially to understanding its function, to targeting it in drug discovery and enzyme design, etc. However, there is a continually widening gap between the number of protein amino acid sequences that are deduced rapidly through the ongoing genomics efforts, and the number of proteins for which atomic coordinates of their 3-D structures are deposited in the Protein Data Bank (PDB), i.e. those that are determined by structural biological techniques. To bridge this gap many computational biology research groups have been focussing on developing ever improving methodology to produce structural models for proteins based on their amino acid sequences. Still the resulting methods are far from perfect, and there is no one method that is always producing an accurate model. However, particularly in comparative modelling cases (where a protein with known structure can be used as a template for a protein of interest, based on similarity between their sequences), high-quality modelled structures can be useful resources for biological research. Consistency checking and consensus building are commonly used strategies in the field to select high quality models from the pool of available models produced by different methods.
  
  Consistency-checking of 3-D Models for Yeast Protein Structure Prediction: Similar to the consistency-checking of de novo sequencing by MS/MS experiment, in this experiment we aimed to check consistency among pre-computed comparative models from three public repositories, for the proteins encoded by the genome of the budding yeast Saccharomyces cerevisiae. The three example public repositories, SWISS-MODEL, ModBase, and SAM-T20, provide protein 3-D models generated by different structure prediction approach/pipelines. Systematic retrieval and comparison of the three data sources were done over yeast proteins. The results were made available as a new resource called the Comparison of Yeast 3-D Structure Prediction (CYSP).
- Experiments in the field of proteomics
  
  OK-omics
  OK-omics is a new form of knowledge sharing for expression proteomics with the aim of (1) augmenting significantly the percentage of peptides and proteins to be sequenced and identified by means of mass-spectrometry-based analysis, and (2) reducing significantly the sequencing and identification time needed. For this we combine current bioinformatics techniques for proteomics with novel multiagent system architectures and distributed knowledge coordination mechanisms in peer-to-peer networks, which have been developed in the context of the OpenKnowledge project.
  
  Peer-to-Peer Proteomics
  It is an important problem in proteomics to identify known and new protein sequences using high-throughput methods. Protein sequences are usually stored in public databases. However, these protein sequences are mostly inferred by the direct translation of gene sequences, not directly determined by physical experiments. This means that neither proteins with post-translation modifications (PTM) nor proteins whose genomes have not been sequenced would find exact matches in such databases. An efficient experimental technique for the identification of proteins is mass spectrometry (MS). However, among other factors the following issues complicate this task:
  - the number of admitted PTMs can multiply the volume of results to be analysed;
  - bad quality and noise in mass spectra increase uncertainty of interpretation; and
  - database errors in sequence annotations can lead to misinterpretation.
  These obstacles indirectly produce a huge amount of uninterpreted data (for instance, non-matching mass spectra or low-scoring de novo interpreted sequences), which are likely to be trashed. The unmatched data could be due to peptides derived from novel proteins, allelic or species-derived variants of known proteins, or PTMs. Nowadays this uninterpreted data is seldom accessible to other groups involved in the identification of the same or homologous proteins. If we compare data coming from different laboratories then we would be able to eventually discover new matches and useful data. We envision many advantages with this new approach, as other laboratories (peers) could provide the missing information for an incomplete spectrum or sequence satisfying the process of identification; or even more, matches could help to recognise new proteins and identify PTMs.
  
  We have drawn a new scenario where the information to be searched is no longer centralised in a few repositories, but where information gathered from experiments in peer proteomics laboratories can be searched by fellow researcher. In order not to centralise all the data into a single repository, with all the problems that entails, we believe it is better to maintain the information locally in each of the laboratories. Thus, this decentralised data storage needs a decentralised searching mechanism, and the use of agent-regulated P2P technologies developed in OpenKnowledge aims at addressing this need. A P2P network provides methods for accessing distributed resources with minimal maintenance cost. It also provides scalable techniques to search through large amounts of resources scattered through the network. Furthermore, joining or leaving the network becomes a simple task. These properties of P2P networks make the technology an ideal candidate to implement our search through proteomics laboratories. Other distributed storage systems such as distributed databases or federated storage services have been developed with efficiency in mind, and the maintenance cost and joining cost for these solutions is very high. A proteomics laboratory acting as a peer in a P2P network will share its complete or partial data repository so that other peers and itself can benefit from it.
  The Experiment
  We have carried out an experiment in which we set up a P2P network of nine proteomics laboratories from ProteoRed, Spain’s National Proteomics Institute, each handling its own database of sequences and spectra. When queried, a laboratory looked for matches between the input sequence or spectrum and the information collected in its database. For our test data we have decided to use preexisting MS/MS data reservoirs from the 2006 ABRF (Association of Biomolecular Resource Facilities) test sample. It consists of a mixture of 48 purified and recombinant proteins (plus an unknown number of protein contaminants) extensively tested during the ABRF Proteomics Standards Research Group 2006 worldwide survey. 78 laboratories participated in the analysis of these mixtures. Among these, only 35% could correctly identify more than 40 protein components. Thus, the sample, being relatively handy for the purpose of testing the OK system, is of a complexity not far from that found in real proteomics work.
References:
- D. Gerloff, X. Quan, C. Walton, D. Robertson, M. Schorlemmer, J. Abian, C. Sierra, and L. Bernacchioni Bioinformatics scenarios. Deliverable D6.1, OpenKnowledge, 2006.
- J. Abian, M. Atencia, P. Besana, L. Bernacchioni, D. Gerlof, S. Leung, J. Magasin, A. Perreau de Pinninck, X. Quan, D. Robertson, M. Schorlemmer, J. Sharman, and C. Walton. Bioinformatics interaction models. Deliverable D6.3, OpenKnowledge, 2008.
- A. Perreau de Pinninck, C. Sierra, C. Walton, D. de la Cruz, D. Robertson, D. Gerloff, E. Jaen, Q. Li, J. Sharman, J. Abian, M. Schorlemmer, P. Besana, S. Leung, and X. Quan Summative Report on Bioinformatics Case Studies. Deliverable D6.4, OpenKnowledge, 2008.
Back to Testbeds