P- vs Q- vs PEP-values in Mass Spec Database Search
We often see both P and Q-values in literature and search results by mass spec software, but how many of us actually know how these numbers are generated? OK, you may say you know that P-value is probability. But what does this really mean? Can you explain?
For peptide spectra matching (PSM), let’s say you have an experimental MS/MS spectra X with 60 peaks and a theoretical MS/MS spectra Y (derived from protein database sequences) with 32 peaks. If 23 peaks match between these two spectrum, how likely does this kind of situation occur by chance?
Actually, this is not enough to calculate the probability of such an event. Because you need to know how many possible positions in the spectra. To be simple, let’ think about this scenario. Spectra X’s peaks range from 350 to 700, whereas spectra Y ranges from 400 to 850. You have an overlap between 400 to 700 in m/z. If you have mass tolerance of 0.5m/z, you divide this range by 0.5 which is (700-400)/0.5=600. This number is the number of bins or positions you have in the overlapping range.
Calculate Possible Occasions for Experimental MS/MS
The likelihood to have 23 peaks match from spectra X (60 peaks) and spectra Y (32 peaks) within 600 possible peak positions is calculated by
In this formula, denominator calculates all possible combination to pick l(60) peaks out of N(600) positions. The numerator is a product of two numbers, that are the number of possible combination of picking n(23) positions out of m(32) and the number of possible combinations of picking 37 out of 568. The first combination is essentially how many possibilities to have 23 peaks in the correct database mass spectra which has total 32 peaks. The second combination is how many possibilities the unmatched peaks fall in the rest of bins in the range. The P-value calculated from this formula is 7.72xE-35. This is a very very small number and it means that the scenario described here is a highly unlikely event by random matching. But is this true?
Peak matching is not a completely random event
In fact, spectra matching does not behave like random matching. Let’s look at one example. In this example, theoretical spectra of peptide KGHHEAELKPLAQSHATK (+2) had 18 matches with an experimental spectra. This experimental spectra is searched against a database and examined for the number of peaks matched to the spectrum in the database.
You can see the distribution of matched peaks (total area under the line should be 1) has shifted towards more right compared to theoretical peak matching distribution (use the same number of peaks with the same range as experimental spectra).
This means that real spectrum more likely match to the theoretical spectrum derived from the databases than pure random matching.
Because peptides are composed of amino acids with distinct molecular weight. Therefore, the bins in the MS/MS have unequal likelihood to match to experimental spectrum. In fact, the smaller the m/z, the more distinct m/z due to smaller number of amino acid combinations. To correct the experimental distribution of p-value, you can shift the curve by 2 matches in this case.
P-value distribution of random matching follows poisson distribution
Poisson distribution is the probability distribution of a given number of events occurring in some fixed amount of time. Smaller average events, the sharper the peak distribution. Also poisson distribution is only in a positive range as there is no negative number for events. The number of peaks (events) matching to database spectrum for many peptides appear to follow this type of distribution (but not all peptides, of course!). OMSSA uses poisson distribution for P-value modeling.
However, calculating P-value strictly based on the number of matched peaks is obviously not the best method. There are many properties in MS/MS spectrum which provide useful information for identification. Peak height intensity, intensity distribution, presence of both b & y ion pair are examples. Good matching often comes with matching intense mass peaks. Mass spec search programs (e.g. X! tandem) take these into account to generate scores. The probability distribution based on scores is used to calculate P-values. There is an excellent PowerPoint slide how to calculate hyperscore for X!tandem.
I want to point out that each search program calculates P-value distribution using specific formula. What is critical is how each formula handles right-tail of P-value distribution which will be important for real-matching spectrum.
Q- & PEP-value
How about Q-value? Q-value is defined as minimum false discovery rate (FDR). If you search MS/MS spectrum against a decoy database, you know that all PSMs are incorrect. If you search against true databases, you get the mixture of correct and wrong hits.
As you can see the graph above, you can guess the number of falsely identified hits in blue line by deducting the hits from the decoy database search in red. For example, if you draw a line and you see only one wrong hit from decoy search, and 99 hits from true search, you get 1% FDR. Q-value is associated with each PSM, meaning every PSM has Q-value. The Q-value 0.05 means that there are 1 in 20 of higher ranked PSMs that are likely wrong.
Posterior error probability (PEP) is defined as a local probability for the PSM with a given score, which is calculated by the ratio of hits from decoy and true database search with a given score. In the figure above, it is the ratio of height for a given score in red and red+blue line. For example, if the score that gives an equal number of matches in both decoy and true databases, PEP will be 0.5, meaning at this score the chance of being incorrect is 50%. This concept is important especially when you find a PSM that was found with 5% FDR, but its PEP is 20%.
There is an excellent review which is easy to read and understand these concepts published by Käll and Noble et al., in 2007.