In this notebook, we compare the performance of various group-based mutual exclusivity tests on simulated data.
import sys
sys.path.append("../lib")
import numpy
import pandas
import nbsupport.simulations
import nbsupport.plots
import nbsupport.roc
import nbsupport.tcga
import nbsupport.util
We compared three variants of the group-based DISCOVER test to a large number of recently published methods: MEMo, muex, MEGSA, mutex, TiMEx, and CoMEt. Some of these methods (MEMo, MEGSA, mutex, CoMEt) combine a statistical test for mutual exclusivity with an algorithm to select groups of genes to test. In this comparison, we already have preselected groups of genes, and so only consider the mutual exclusivity tests.
MEMo is described in the following paper.
Ciriello, G., Cerami, E., Sander, C. & Schultz, N. Mutual exclusivity analysis identifies oncogenic network modules. Genome Res 22, 398–406 (2012), doi:10.1101/gr.125567.111.
The MEMo permutation test uses information about which genes have been found recurrently altered. Since our simulated data are based on the TCGA breast cancer copy number gains, we use the recurrent gains of TCGA's GISTIC analysis.
amp_genes = numpy.concatenate(nbsupport.tcga.read_gistic_output("../data/tcga/amp_genes.conf_99.BRCA.txt").values())
from nbsupport import memo
The muex method is described in the following paper.
Szczurek, E. & Beerenwinkel, N. Modeling mutual exclusivity of cancer mutations. PLoS Comput Biol 10, e1003503 (2014), doi:10.1371/journal.pcbi.1003503.
For the comparison described in the DISCOVER paper, we used the R package available from https://www1.ethz.ch/bsse/cbg/software/muex. At the time of submission, this web page was no longer accessible. The following import will use the muex R package (via rpy2) if it has already been installed, and resort to a Python-based reimplementation otherwise.
from nbsupport import muex
muex.__version__
The MEGSA method is described in the following paper.
Hua, X. et al. MEGSA: A Powerful and Flexible Framework for Analyzing Mutual Exclusivity of Tumor Mutations. Am J Hum Genet 98, 442-455 (2016), doi:10.1016/j.ajhg.2015.12.021.
from nbsupport import megsa
We use the R implementation made available by the authors, which can be downloaded from http://dceg.cancer.gov/tools/analysis/megsa/. The following command downloads the MEGSA source code, and imports it via rpy2.
megsa.install("../data/downloads")
The mutex method is described in the following paper.
Babur, Ö. et al. Systematic identification of cancer driving signaling pathways based on mutual exclusivity of genomic alterations. Genome Biol 16, (2015), doi:10.1186/s13059-015-0612-6.
A Java implementation of the mutex method is available from https://github.com/pathwayanddataanalysis/mutex. The following module provides a Python implementation of the statistical test, which we have verified to give equivalent results to the Java version.
from nbsupport import mutex
The TiMEx method is described in the following paper.
Constantinescu, S. et al. TiMEx: a waiting time model for mutually exclusive cancer alterations. Bioinformatics 32, 968-975 (2016), doi:10.1093/bioinformatics/btv400.
An R implementation of TiMEx is available from https://github.com/cbg-ethz/TiMEx. The following module imports this R package via rpy2. For it to work, the TiMEx R package needs to be installed first.
from nbsupport import timex
timex.__version__
The CoMEt method is described in the following paper.
Leiserson, M.D. et al. CoMEt: a statistical approach to identify combinations of mutually exclusive alterations in cancer. Genom Biol 16, (2015), doi:10.1186/s13059-015-0700-7.
A software implementation of CoMEt is available from https://github.com/raphael-group/comet. Unfortunately, the computational complexity of the CoMEt test is so high that for some of the gene sets in this comparison the test will never finish. To deal with this, we use a patched version of the original implementation that interrupts the CoMEt exact test after 1 minute and returns the $P$ value obtained with the binomial approximation. The original CoMEt implementation uses a set of heuristics to decide between the exact test and the approximation, but we found those to be inadequate for this comparison.
from nbsupport import comet
Another measure we needed to take to deal with CoMEt's computational complexity is that we used IPython's parallel computing support. For this to work, an IPython cluster needs to be started. The following code connects to such a cluster, but will only succeed if a cluster is found running on the local machine.
import ipyparallel
import os
client = ipyparallel.Client()
lview = client.load_balanced_view()
add_to_search_path_cmd = "__import__('sys').path.append('%s')" % os.path.abspath(os.path.join(nbsupport.__path__[0], ".."))
client[:].execute(add_to_search_path_cmd, block=True).wait()
The group-based DISCOVER test can be used with three different statistics: coverage, exclusivity, and impurity. A comparison of these statistics is presented below.
import discover
In this analysis, we generate simulated alteration matrices based on the TCGA breast cancer data. To these matrices we add groups of mutually exclusive genes, and also identify groups of independently altered genes of similar size and alteration frequency to serve as negative cases. We generate 10 such simulated data sets and plot averaged ROC, calibration, and sensitivity curves for each of the methods.
dataFile = "../data/tcga/tcga-pancan12.h5"
cn = pandas.read_hdf(dataFile, "/data/BRCA/cn")
gains = cn == 2
nbsupport.util.set_random_seed()
pvalues = []
for i in xrange(10):
print i
events = nbsupport.simulations.generate_independent_alterations(gains.sum(1), gains.sum(0))
events, pos_groups, neg_groups = nbsupport.simulations.add_mutex_groups(events)
groups = numpy.concatenate([neg_groups, pos_groups])
# DISCOVER tests
print " DISCOVER"
event_matrix = discover.DiscoverMatrix(pandas.DataFrame(events))
pvalues_discover_coverage = [discover.groupwise_discover_test(event_matrix[group], "coverage") for group in groups]
pvalues_discover_impurity = [discover.groupwise_discover_test(event_matrix[group], "impurity") for group in groups]
pvalues_discover_exclusivity = [discover.groupwise_discover_test(event_matrix[group], "exclusivity") for group in groups]
# MEMo
print " MEMo"
selected_genes = numpy.unique(numpy.concatenate([
gains.index.get_indexer(amp_genes),
numpy.concatenate(pos_groups),
numpy.concatenate(neg_groups)]))
pvalues_memo = memo.memo_test(events, selected_genes, groups)
# muex
print " muex"
pvalues_muex = [muex.muex(events[group]).pvalue for group in groups]
# MEGSA
print " MEGSA"
pvalues_megsa = [megsa.megsa(events[group]) for group in groups]
# mutex
print " mutex"
pvalues_mutex = [mutex.mutex(events[group]) for group in groups]
# TiMEx
print " TiMEx"
pvalues_timex = timex.timex_test(events, groups)
# comet
print " comet"
pvalues_comet = lview.map(comet.comet_test, [events[group] for group in groups], block=True)
pvalues.append({
"cov": pvalues_discover_coverage,
"imp": pvalues_discover_impurity,
"exc": pvalues_discover_exclusivity,
"memo": pvalues_memo,
"muex": pvalues_muex,
"megsa": pvalues_megsa,
"mutex": pvalues_mutex,
"timex": pvalues_timex,
"comet": pvalues_comet
})
%matplotlib inline
labels = numpy.repeat([0, 1], [100, 100])
method_params_discover = [
("cov", "#d21716", "Coverage"),
("exc", "#3979ff", "Exclusivity"),
("imp", "#2a3441", "Impurity")
]
method_params = [
("imp", "#2a3441", "DISCOVER"),
("muex", "#edb132", "muex"),
("memo", "#33a02c", "MEMo"),
("megsa", "#e31a1c", "MEGSA"),
("mutex", "#377eb8", "mutex"),
("timex", "#b15928", "TiMEx"),
("comet", "#6a3d9a", "CoMEt")
]
First, we compare the three variants of the DISCOVER test.
nbsupport.plots.plot_average_roc_curves(pvalues, labels, method_params_discover)
Then we compare the impurity-based DISCOVER test with the other methods.
nbsupport.plots.plot_average_roc_curves(pvalues, labels, method_params)
A calibration curve plots the significance level $\alpha$ against the false positive rate. For a statistical test, the significance level should approximate the false positive rate. In such a case, the calibration curve would be a diagonal line.
nbsupport.plots.plot_average_calibration_curves(pvalues, labels, method_params_discover)
nbsupport.plots.plot_average_calibration_curves(pvalues, labels, method_params)
A sensitivity curve plots the significance level $\alpha$ against the true positive rate. Since we are most interested in the sensitivity at low values of $\alpha$, the x-axis is plotted on a log-scale.
nbsupport.plots.plot_average_sensitivity_curves(pvalues, labels, method_params_discover)
nbsupport.plots.plot_average_sensitivity_curves(pvalues, labels, method_params)