Decoy peptides were removed before FDR calculation. When a target decoy method was used to estimate the FDR, the target and decoy databases were searched together.
This section demonstrates the relative performance of the de novo sequencing and database search approaches when analyzing the same data set. For each peptide reported by Mascot 2. It can be seen that the best separation of the target and decoy matches is achieved by a combination of both the database search score and the CAA score, clearly indicating the effectiveness of using de novo sequencing results in the peptide scoring.
Each data point represents a peptide found by Mascot database search. The x axis is the Mascot score, and the y axis is the number of matching amino acids with the de novo sequencing result CAA score. For a better view of the data density, a small random number between 0 and 0. The best separation of target and decoy matches is achieved by combining the CAA and Mascot scores dashed line.
For Mascot to confidently identify a peptide, the required spectrum quality is different when databases of different sizes are used. As a result, the relative performance of de novo sequencing and database search varies.
When the P. The basic assumption of the target decoy and the decoy fusion methods is that the score distribution of the false target hits and the decoy hits are similar. Therefore the number of decoy hits can be used to estimate the number of false target hits. Unfortunately there is no effective way to verify this assumption, because it is difficult to assess whether a target hit is true or false. Thus, the following simulated experiment was conducted to verify the assumption.
The CID data set was searched against the P. The peptides identified by all three engines were considered as correct. A simulated database was created by keeping these peptides unchanged in the P. When a search engine is used to search in this simulated database, the peptides that do not have significant five or more amino acids overlap with the unchanged peptides can be safely regarded as false hits.
Thus, by using the simulated database as the target, the score distribution of the false target hits and the decoy hits can be compared.
Both decoy fusion and target decoy methods were examined, and the results are shown in Fig. The score distribution of the false target hits and the decoy hits when the simulated protein database was used. The height of each bar represents the number of PSMs around the corresponding score. The decoy fusion method has no such problem.
The target decoy method produced fewer decoy hits than the false target hits, which might cause FDR underestimation. The result of Fig. The search with each of the three engines used the same set of parameters: The parent ion mass error tolerance was 15 ppm, and fragment ion mass error tolerance was 0. Up to three missed cleavages were allowed in one peptide, and at most one end of each peptide could violate the enzyme cleavage rule. Recently, the Percolator program has been developed to improve Mascot database search results by rescoring with a rigorous machine learning method It is not a self-contained database search engine.
Nevertheless, a comparison with the combination of Mascot and Percolator was also conducted. The x axis represents the number of peptide spectrum matches kept from the target sequences, and the y axis represents the corresponding FDR. The detail of this comparison is included in the supplemental materials. The first conclusion from Figs. In terms of the total number of peptides identified, many search engines outperformed Mascot on the ETD data set in the iPRG study mentioned above However, it is possible that the FDR estimation method used by the iPRG study and the relative experience of users in operating different software tools might have affected the above ranking.
More details are provided in the full report of the iPRG study This inaccuracy comes from two sources that are due to the fact that the decoy sequences are introduced as separate entries of the database.
First, the protein shortlisting step may select more target proteins than the decoy proteins. This causes the false identifications in later steps to fall in the target proteins with a higher probability. The decoy fusion method avoids this problem by combining the target and decoy sequences in the same protein entry. This increases the scores of the random peptide matches in the highly confident target proteins. Consequently, more false hits will be reported from the target proteins than from the decoy proteins.
By fusing the target and decoy sequences together, the score increment is applied equally to the target and decoy peptide hits. Thus, the score distributions of the false target hits and decoy hits remain the same. There were different opinions in the literature regarding the use of protein information in the peptide scoring function. On one hand, the protein information may compromise the reliability of the target decoy validation method and thus was not used in PeptideProphet 17 and is no longer used in the Mascot Percolator On the other hand, Bern et al.
We argue that the use of the protein information is appropriate. By limiting the search on a protein database, a database search engine makes the implicit assumption that each peptide sequence appears in the sample with equal probability, prior to the search. Such prior probability should be updated when another peptide from the same protein is identified with high confidence.
This will surely contribute toward the peptide identification sensitivity, but the use of the protein information does require a more robust result validation method than the standard target decoy approach.
The decoy fusion method proposed in this paper provides a very simple alternative to solve this problem. This is different from the approach used in Percolator, where the scoring function is retrained for each experiment after the search is completed, and the target and decoy peptides found by the search become known. Although the retraining may further improve the sensitivity, it exposes the decoy information to the scoring function.
This creates a risk of impairing the FDR estimation method. De novo sequencing was historically thought to be slow and to require spectra with higher mass accuracy. Therefore it has been mostly used when the protein database was unavailable. Thanks to the recent development in computer algorithms and continuous improvement of computers, the speed is no longer an issue for de novo sequencing. The high mass accuracy has also become available because of the development of new mass spectrometers such as the Orbitrap.
This makes de novo sequencing a viable choice for every mass spectrometry analysis in proteomics. De novo sequencing and database search should not anymore be regarded as two separate approaches that are used in different circumstances. Instead, they should work together to provide better sensitivity and accuracy in proteomics analysis, as illustrated in this paper. Additionally, the spectra that produce highly confident de novo sequencing tags but no database hits are likely from novel or modified peptides.
The net outcome is an increase in both sensitivity and accuracy and an overall superior performance to other commonly used search engines. We are grateful to Dr. Christine Vogel and Dr. Taejoon Kwon for providing the CID data set. The costs of publication of this article were defrayed in part by the payment of page charges. Section solely to indicate this fact. This article contains supplemental material. National Center for Biotechnology Information , U.
Journal List Mol Cell Proteomics v. Mol Cell Proteomics. Published online Dec Gilles A. Author information Article notes Copyright and License information Disclaimer. Received Apr 25; Revised Dec 4. This article has been cited by other articles in PMC. Abstract Many software tools have been developed for the automated identification of peptides from tandem mass spectra.
The details of these steps are discussed in the following sections. Open in a separate window. Protein Shortlisting In this step, the algorithm uses the de novo sequence tags to select a short list of proteins from the protein database. Peptide Shortlisting All of the peptide sequences digested in silico from the protein shortlist are compared against the input spectra to find peptide spectrum matches PSMs.
Peptide Scoring A more sophisticated scoring function is used to rerank the sequence candidates for each spectrum. Result Validation A modified target decoy approach, called decoy fusion, is used to estimate the FDR at any given score threshold.
The Effectiveness of de Novo Sequencing in Database Search This section demonstrates the relative performance of the de novo sequencing and database search approaches when analyzing the same data set. Comparing the Target Decoy and Decoy Fusion Methods The basic assumption of the target decoy and the decoy fusion methods is that the score distribution of the false target hits and the decoy hits are similar.
De Novo Sequencing and Database Search De novo sequencing was historically thought to be slow and to require spectra with higher mass accuracy. Acknowledgments We are grateful to Dr.
Rapid Commun. Mass Spectrom. Frank A. Fischer B. Taylor J. Perkins D. Electrophoresis 20 , — [ PubMed ] [ Google Scholar ]. Eng J. Craig R. Bioinformatics 20 , — [ PubMed ] [ Google Scholar ].
Geer L. Proteome Res. Chalkley R. New developments in protein prospector allow for reliable and comprehensive automatic analysis of large datasets. Proteomics 4 , — [ PubMed ] [ Google Scholar ]. Cox J. Kim S. Bell A. Kapp E. Proteomics 5 , — [ PubMed ] [ Google Scholar ]. Askenazi M. J Biomol Tech. Brosch M. Keller A. Elias J. Has PDF. Publication Type. More Filters. A parallel algorithm for de novo peptide sequencing. Biology, Computer Science. BMC Systems Biology.
Peptide identification by tandem mass spectra: an efficient parallel searching. Highly Influenced. View 5 excerpts, cites methods. Lessons in de novo peptide sequencing by tandem mass spectrometry. Mass spectrometry reviews. View 8 excerpts, cites background. View 10 excerpts, cites methods and background. Current protocols in protein science. Algorithms for the de novo sequencing of peptides from tandem mass spectra. Expert review of proteomics. Computer Science, Biology.
Journal of proteome research. View 4 excerpts, cites methods. Computer Science, Medicine. Implementation and uses of automated de novo peptide sequencing by tandem mass spectrometry. Analytical chemistry. There are several computer programs that can match peptide tandem mass spectrometry data to their exactly corresponding database sequences, and in most protein identification projects, these programs … Expand.
Sequence database searches via de novo peptide sequencing by tandem mass spectrometry.