Google+ Badge

Total Pageviews

Introduction (I)

There is a short introduction below the posts (scroll to bottom)

Thursday, 9 August 2012

Signals, C-term TMs and other feature mark-ups

I've been a long time fan of InterPro and even managed to pen a review of  its early incarnation.  While I have also been impressed by InterProScan, like most users I suspect, I simply assumed the  results for Signal peptides and C-terminal transmembrane domains were generated on the fly by  SignalP and TMHMM  as part of the scan run.  However, from using it over some time on a large set of  sequences, some of which I had assembled or extended/corrected myself,  I began to notice equivocal results.  In particular, certain  proteins would have no signal or TM in the InterProScan but were clear positives not only from stand-alone runs of both algorithms at  CBS but also in the Ensembl mark-up of the same ORF.  For an example you can see an example the InterProScan 4.8 output for pasted-in sequences for the paralogoues  BACE1 (upper) and BACE2 (lower). The upper outpout has "lost" the signal and TM that are both  clearly displayed in the lower.




My guess was this could have been a scoring threshold issue, so I eventualy sent a query to EBI feedback. The explanation from the InterPro team was as follows; If the scan detects a 100% UniProt match the signal and TM detection is turned off,  but for any other sequence they will show the results in the GUI  (a local vs institutional licensing issue).  This still  did not explain the discordnace I recorded for the two runs in the diagram above, until I did a BLAST check on UniProt and, sure enough, I had one residue difference from my pasted-out FASTA string for BACE2 ! 

So where does this leave us ?  Firstly I was informed that the imminent InterProScan 5.0 interface should not show this confusing behaviour.  Secondly,  for an interim workround you just need to change one residue to get the complete mark-up!

For reasons that I hope will be expanded on in print, in the not too distant future (but has already been alluded to in this Sea-squirt-ur-bace, post) one of the hallmarks of  "BACE-likeness"  is an aspartyl protease domain flanked by a signal peptide at the N-terminal end and a transmembrane domain on the C-terminal end (hence my vexation over the inconsistency of the scan results).   As a corroborative cross-check on our zoo of homologoues I looked around at other sources of GUI mark-up for sequence features.  A panel of results is shown below for the human BACE1 sequence.




Including the BACE1 InteProScan above in fig.1, each of the  renderings brings out different features. Starting at the top SMART  does a good job on numbering intron/exon boundaries so we see the C-TM is encompased by exon 9 (but Ensembl also has this information). However, it produces a false-negative for the signal. The Ensembl mark-up (2nd) gets all the property sections (sig, TM and low-complexity) but only runs PRINTS and Pfam rather than the other InterPro options.  Pfam (3rd) is concordant on the property sections but for signals and TMs they use Phobios, the stand-alone output of which is the 4th graphic.  Simply as section delineations the Phobious mark-up is redundant with some of the others but note that it can be important to see the complete scoring pattern for equivocal results (e.g. signals vs N-term TMs).   Pfam picks up both active-site residues and, uniquely, shows 2D disulphide connectivity  (presumably parsed from UniProt and what we used to be able to see in the old Expasy records).  Note the NCBI Conseved Domain (5th trace) server marks-up features extracted from PDB structures, namely extrapolating human inhibitor ligand binding interaction positions and the substrate flap. Interestingly it picks up simple composition runs but strips them out for searching. Comparing these five with the InterProScan in figure 1 the distal active site does not match the PROSITE motif (but both are picked up in BACE2). There is also no low-complexity mark-up in the GUI.  

We can generate a 6th aggregation set of feature mark-ups extracted directly from a UniProtKB/Swiss-Prot entry (except that it appears only to be an option presented with BLAST output) that you can see below.


It should be noted these are derived from expert curation imposed on top of the algorithmic results (it would be interesting to know in how many cases curators "override" these as false negatives or false positives, particularly regarding their exact boundaries).  We get some unique information here such as the protease propeptide (I have seen this in some InterProScan outputs but this also seems inconsistant) and the (experimentally verified positive) glycosylation positions.  You also get a additional "region" feature that, in this case, is the (blue) cytoplasmic tail, distal to the (yellow) TM that is a reported interaction site. 

The 7th set is from the Distributed Annotation System (DAS)  DASTY server  (below) 



This is largely parsed from the UniProt entry and you can take a look at the details.  However, there are couple of unique bits again.  One is the outputs (warning flagged) from what may be an older version of the sequence but two residues shorter than the current signal (!).  The second is a set of PRIDE peptide "hooks" supported by MS data.

In conclusion, we see here the usual bioinformatics multi-stop-shop dilemma, in this case finding no less than seven "shops" for additive protein feature comparison  (and by no means complete either).  Users also need to be circumspect about the difference between de-novo (never eyeballed) algorithm outputs, those where curators have passed (blessed ?) that output as annotation (e.g. for a Swiss-Prot entry) and which features may have experimental support. For example, even though  SignalP  has a very high specificity there are still only a small fraction of Swiss-Prot entries where the peptide removal has been experimentally verified by N-terminal sequencing or Mass Spectrometry approaches.  

No comments: