Confidence Estimation for Rankprop

July 2019 - Present

Rankprop is a ranking algorithm that exploits global network structure of similarity relationships among proteins in a database by performing network propagation on a protein similarity network with weighted edges. It is similar to the PageRank algorithm, which was the first algorithm used in Google search engine.

More details about Rankprop can be found here

Rankprop was developed for the task of detecting homology proteins from a protein database given a query protein. This task usually invovles two steps:

  1. Generating a high-quality protein similarity ranking, where the top of the ranking contains proteins that are similar to the query protein.
  2. Finds a cutoff point between positive and negative homology predictions.

Rankprop effectively solved the 1st step of the problem by generating superior protein similarity rankings compared to other methods such as BLAST. However, for the 2nd step, there wasn’t a clear solution that Rankprop or network propagation can offer.

In this research project, we applied the knockoff filter to compute a confidence estimation, q-value, for all the proteins within the Rankprop ranking. Using q-values, we can select homology prediction cutoffs that meets desired false discovery rate (FDR), and therefore address Rankprop’s cutoff selection problem in an error-controlled fasion.

Knockoff filter is statistical tool proposed by Rina Foygel Barber and Emmanuel Candès in 2015, which controls the False Discovery Rate (FDR) in a variable selection task. The knockoff FDR control invovles two steps:

  1. Generating a set of “knockoff variables” based on the original variables.
  2. Use the knockoff variables as a negative control to identify those trully important original variables, by computing q-values.
The idea of original variables vs. knockoff variables

More details about the knockoff filter can be found here

In order to apply the knockoff filter on Rankprop, we proposed a novel method that generates knockoff for network data. In this project, we applied this method to generate “knockoff” protein similarity network given the original one.

Conceptual representation of generating network knockoff

Using either the original network or knockoff network, we can generate a Rankprop ranking. We call the ranking generated with original network the original ranking and the ranking generated from knokocff network the knockoff ranking.
Each protein within the original ranking would then “compete” with their knockoff counterpart, in terms of the Rankprop score. In principle, those real homologies would stand out from the “competition”, and therefore have a low q-value. Lastly, using the computed q-value, we can find FDR-controlled cutoff for protein homology prediction.

Conceptual representation of generating network knockoff

Related Links: