In this notebook, we identify de novo gene sets with mutually exclusive alterations. To do so, we cluster genes based on the results of the pairwise mutual exclusivity analysis. The resulting clusters are tested with the groupwise mutual exclusivity test.
import sys
sys.path.append("../lib")
%matplotlib inline
import matplotlib.pyplot as plt
import networkx
import numpy
import scipy.sparse
import scipy.stats
import discover
import corclust
import nbsupport.io
import nbsupport.plots
import nbsupport.tcga
from collections import Counter
from itertools import combinations, imap, product
from nbsupport.stats import fdr
from pandas import match
We use the results of the pairwise mutual exclusivity analysis obtained in the notebook Pairwise analysis.
dataFile = "../data/tcga/tcga-pancan12.h5"
result_mutex = nbsupport.io.load_pairwise_result(dataFile, "/results/mutex")
events = nbsupport.io.load_discover_matrix(dataFile, "/models/combined")
To identify candidate gene sets to test for groupwise mutual exclusivity, we use the overlapping correlation clustering algorithm described by Bonchi et al.
Bonchi, F., Gionis, A. & Ukkonen, A. Overlapping correlation clustering. Knowledge and information systems 35, 1–32 (2013), doi:10.1007/s10115-012-0522-9.
This algorithms clusters the nodes (i.e. genes in our case) of a graph based on the presence/absence and weights of the edges between them. We will describe the way in which we compute the edge weights below.
We compute our edge weights based on two features.
The edge weights are based on the model of Chang et al.
Chang, Y.-T., Leahy, R. M. & Pantazis, D. Modularity-based graph partitioning using conditional expected models. Physical Review E 85, 016109 (2012), doi:10.1103/PhysRevE.85.016109.
Two genes are connected by an edge in the mutual exclusivity graph if they were found mutually exclusive at a maximum FDR of 10%.
with numpy.errstate(invalid="ignore"):
observed_graph = networkx.Graph((numpy.asarray(result_mutex.qvalues) < 0.1).astype(int))
In the mutual exclusivity graph, an edge between two genes can be absent for two reasons. Either the two genes are located on the same chromosome (and thus were not tested), or the gene pair was tested but not found to be mutually exclusive. By specifying a permissible graph, we inform the model about whether edges are missing by design (located on the same chromosome), or missing because of no mutual exclusivity.
permissible_graph = networkx.Graph(numpy.isfinite(numpy.asarray(result_mutex.qvalues)).astype(int))
isolated_nodes = networkx.isolates(observed_graph)
permissible_graph.remove_nodes_from(isolated_nodes)
observed_graph.remove_nodes_from(isolated_nodes)
observed_edges = numpy.array([permissible_graph.edges().index(e) for e in observed_graph.edges_iter()])
We first compute $\Sigma_\mathrm{x}$, the edge weight covariance matrix. Our edge weights will be based on the pairwise mutual exclusivity $P$-values. Adjacent edges—which correspond to pairwise tests where one gene is tested in both—will have a non-zero covariance. Taking into account the mathematics behind the Poisson-Binomial test, this covariance for the edges g1-g2 and g2-g3 can be shown to be:
$\sum_{j=1}^n p_{g1,j} p_{g2,j} p_{g3,j} (1 - p_{g2,j})$
where $p_{i,j}$ is the alteration probability for gene $i$ in tumour $j$.
bg = events.bg
sigma_x = scipy.sparse.lil_matrix((permissible_graph.number_of_edges(),) * 2)
for i, genes in enumerate(permissible_graph.edges_iter()):
sigma_x[i, i] = (bg[list(genes)].prod(0) * (1 - bg[list(genes)].prod(0))).sum()
for i, genes1 in enumerate(imap(set, permissible_graph.edges_iter())):
for j, genes2 in enumerate(imap(set, permissible_graph.edges_iter())):
if i > j and len(genes1 & genes2) > 0:
sigma_x[i, j] = (bg[list(genes1 | genes2)].prod(0) * (1 - bg[list(genes1 & genes2)].prod(0))).sum() / (numpy.sqrt(sigma_x[i, i]) * numpy.sqrt(sigma_x[j, j]))
sigma_x[j, i] = sigma_x[i, j]
for i, genes in enumerate(permissible_graph.edges_iter()):
sigma_x[i, i] = 1
sigma_x = sigma_x.tocsr()
Since the model of Chang et al. assumes normally distributed edge weights, we transform the $P$-values using the Normal quantile function.
i, j = zip(*observed_graph.edges())
edge_weights = scipy.stats.norm.ppf(1 - numpy.asarray(result_mutex.pvalues)[i, j])
edge_weights[numpy.isnan(edge_weights)] = scipy.stats.norm.ppf(1 - numpy.asarray(result_mutex.pvalues)[j, i])[numpy.isnan(edge_weights)]
The cells below estimate the parameters described in the Chang et al. paper.
k = networkx.incidence_matrix(observed_graph).dot(edge_weights)
H = networkx.incidence_matrix(permissible_graph)
$\Sigma_{\mathrm{xk}} = \Sigma_\mathrm{x} \mathrm{H}^T$
sigma_xk = sigma_x.dot(H.T)
$\Sigma_\mathrm{k} = \mathrm{H} \Sigma_\mathrm{x} \mathrm{H}^T$
sigma_k = H.dot(sigma_x).dot(H.T)
$E(\mathrm{x} \mid \mathrm{Hx} = \mathrm{k}) = \mu_{\mathrm{x} \mid \mathrm{k}} = \mu_\mathrm{x} + \Sigma_\mathrm{xk} \Sigma_\mathrm{k}^{-1} (\mathrm{k} - \mu_\mathrm{k})$
where
$\Sigma_\mathrm{xk} = \Sigma_\mathrm{x} \mathrm{H}^T$
Moreover, under the null hypothesis, $\mu_\mathrm{x} = 0$. Therefore, the above expression reduces to:
$E(\mathrm{x} \mid \mathrm{Hx} = \mathrm{k}) = \Sigma_\mathrm{xk} \Sigma_\mathrm{k}^{-1} \mathrm{k}$
exp = sigma_xk.dot(scipy.sparse.linalg.inv(sigma_k)).dot(k)
To get the variance of the edge weight distributions, we estimate (part of) the conditional covariance matrix:
$\Sigma_{\mathrm{x} \mid \mathrm{k}} = \Sigma_\mathrm{x} - \Sigma_\mathrm{xk} \Sigma_\mathrm{k}^{-1} \Sigma_\mathrm{kx}$
sigma_k_inv = scipy.sparse.linalg.inv(sigma_k)
edge_variances = numpy.array([
sigma_x[i, i] - sigma_xk[i].dot(sigma_k_inv).dot(sigma_xk[i].T)[0, 0]
for i in xrange(permissible_graph.number_of_edges())])
From the computations in the previous section, we obtained—for each edge in the mutual exclusivity graph—an expected weight and a corresponding variance. We also have observed weights for all edges: the quantile-transformed $P$-values. The weights that we will use for the correlation clustering will now be the probability that the observed weight is greater than the expected weight, based on the Normal distribution.
weight_matrix = numpy.zeros(result_mutex.qvalues.shape)
for (i, j), mean, var, obs in zip(observed_graph.edges_iter(),
exp[observed_edges],
edge_variances[observed_edges],
edge_weights):
diff_dist = scipy.stats.norm(obs - mean, numpy.sqrt(1 + var))
weight = diff_dist.sf(0)
weight_matrix[i, j] = weight
weight_matrix[j, i] = weight
for i, j in product(observed_graph.nodes_iter(), repeat=2):
if i != j and result_mutex.qvalues.index[i].split("_")[0] == result_mutex.qvalues.columns[j].split("_")[0]:
weight_matrix[i, j] = 1
Since genes located close to each other tend to have similar copy number profiles, they will also be mutually exclusive with similar sets of genes. This may lead to many clusters that are essentially equivalent, just with a different gene from the same genomic segment. To avoid this, we group genes if they are located in the same recurrently altered copy number segment.
segments = {}
for eventType in ["amp", "del"]:
peaks = nbsupport.tcga.read_gistic_output("../data/tcga/%s_genes.conf_95.pancan12.txt" % eventType)
segments.update({ gene.strip("[]"): "/".join([seg, eventType]) for seg in peaks for gene in peaks[seg] })
groups = networkx.Graph()
groups.add_nodes_from(events.rownames[observed_graph.nodes()])
for gene in events.rownames:
if not gene.endswith("_mut"):
groups.add_edge(gene, segments[gene.split("_")[0]])
for gene1, gene2 in combinations(events.rownames, 2):
if gene1.split("_")[0] == gene2.split("_")[0]:
groups.add_edge(gene1, gene2)
group_items = [
x
for x
in [numpy.intersect1d(match(list(c.intersection(events.rownames)), events.rownames), observed_graph.nodes())
for c in networkx.connected_components(groups)]
if len(x) > 0]
grouped_weight_matrix = numpy.zeros((len(group_items),) * 2)
for i, group1 in enumerate(group_items):
for j, group2 in enumerate(group_items):
grouped_weight_matrix[i, j] = weight_matrix[group1[:, numpy.newaxis], group2].mean()
The clustering algorithm of Bonchi et al. is an approximate algorithm that is sensitive to the initial solution with which the algorithm is started. For this reason, we run the algorithm 100 times with random initial solutions and consider all gene clusters that are found more than two times.
numpy.random.seed(1234)
num_clusters = 50
clusters = Counter()
for i in xrange(100):
result, cost = corclust.multiCorrClust(grouped_weight_matrix, num_clusters, 5, silent=True)
clusters.update([tuple(sorted(i
for i, labels
in enumerate(map(lambda x: set(x.nonzero()[0]), result))
if j in labels))
for j in xrange(num_clusters)])
gene_sets = [[events.rownames[group_items[i]].astype(str) for i in group]
for group, count
in clusters.iteritems()
if count > 2]
We test the gene sets obtained in the previous section, and show alteration plots for those groups that are found mutually exclusive at a maximum FDR of 1%.
def filter_gene_set(genes):
mut = genes[numpy.char.endswith(genes, "_mut")]
cn = genes[~numpy.char.endswith(genes, "_mut")]
if len(cn) > 0:
cn = cn[:1]
return numpy.concatenate([mut, cn])
pValues = numpy.array(
[discover.groupwise_discover_test(
events[numpy.concatenate(map(filter_gene_set, genes))], "impurity")
for genes in gene_sets])
Note that in the mutual exclusivity plots below, some genes have co-occurring rather than mutually exclusive alterations. These genes are located in the same copy numer segment and therefore have correlated alteration profiles.
order = pValues.argsort()
sig_gene_sets = order[pValues[order] < 0.01]
for i in sig_gene_sets:
nbsupport.plots.event_plot(events[numpy.concatenate(gene_sets[i])])
plt.title("P = %.2g" % pValues[i])
plt.show()