In this notebook, we perform pairwise co-occurrence and mutual exclusivity analyses of PanCan12 data using the DISCOVER test.
import sys
sys.path.append("../lib")
import nbsupport.io
import nbsupport.tcga
import nbsupport.util
import numpy
import pandas
The code below assumes that genomic data for the PANCAN12 studies have been downloaded from Firehose and imported into an HDF5 file. These steps are performed in the notebook Download PanCan12 data.
dataFile = "../data/tcga/tcga-pancan12.h5"
Below we collect the copy number and mutation matrices for the 12 PANCAN studies, and combine them into one combined matrix. Tumours are only included if both copy number and mutation data are available.
mut = [pandas.read_hdf(dataFile, "/data/{}/mut".format(study)) for study in nbsupport.tcga.PANCAN12_STUDIES]
cancerTypesMut = pandas.Series(
numpy.repeat(nbsupport.tcga.PANCAN12_STUDIES, [matrix.shape[1] for matrix in mut]),
numpy.concatenate([matrix.columns for matrix in mut]))
mut = pandas.concat(mut, 1)
mut.fillna(0, inplace=True)
mut.columns = mut.columns.str.slice(0, 15)
cancerTypesMut.index = cancerTypesMut.index.str.slice(0, 15)
cn = [pandas.read_hdf(dataFile, "/data/{}/cn".format(study)) for study in nbsupport.tcga.PANCAN12_STUDIES]
cancerTypesCN = pandas.Series(
numpy.repeat(nbsupport.tcga.PANCAN12_STUDIES, [matrix.shape[1] for matrix in cn]),
numpy.concatenate([matrix.columns for matrix in cn]))
cn = pandas.concat(cn, 1)
cn.columns = cn.columns.str.slice(0, 15)
cancerTypesCN.index = cancerTypesCN.index.str.slice(0, 15)
cn, cancerTypesCN, mut, cancerTypesMut = nbsupport.util.align_columns(
cn, pandas.DataFrame(cancerTypesCN).T,
mut, pandas.DataFrame(cancerTypesMut).T)
cancerTypesCN = cancerTypesCN.iloc[0]
cancerTypesMut = cancerTypesMut.iloc[0]
mut.shape
cn.shape
Background alteration probability matrices are estimated per alteration type (i.e. copy number gain and loss, and mutation) separately, and based on the alteration matrices with all genes. In estimating these background matrices, we stratify for cancer type.
from discover.util import disableStackLimit
disableStackLimit()
import discover
mutations = discover.DiscoverMatrix(mut, strata=cancerTypesMut)
gains = discover.DiscoverMatrix(cn == 2, strata=cancerTypesCN)
losses = discover.DiscoverMatrix(cn == -2, strata=cancerTypesCN)
nbsupport.io.save_discover_matrix(dataFile, "/models/mut", mutations)
nbsupport.io.save_discover_matrix(dataFile, "/models/gains", gains)
nbsupport.io.save_discover_matrix(dataFile, "/models/losses", losses)
genes = pandas.read_table("../data/tcga/selected-genes.txt")
def extract_submatrix(x, rownames, suffix):
result = x[numpy.in1d(x.rownames, rownames)]
result.rownames = numpy.char.add(result.rownames.astype(str), "_" + suffix)
return result
We make an alteration probability matrix by combining the relevant rows of the alteration-type specific matrices.
events = discover.row_stack([
extract_submatrix(gains, genes.gene[genes.type == "gain"], "gain"),
extract_submatrix(losses, genes.gene[genes.type == "loss"], "loss"),
extract_submatrix(mutations, genes.gene[genes.type == "mut"], "mut")])
The chromosomes on which the genes are located are needed so that we can test pairs of genes only if they are located on different chromosomes.
chrom = pandas.Series(
genes.chrom[pandas.match(numpy.char.partition(events.rownames.astype(str), "_")[:, 0], genes.gene)].values,
events.rownames)
order_by_chrom = chrom.values.argsort()
events = events[order_by_chrom]
chrom = chrom[order_by_chrom]
nbsupport.io.save_discover_matrix(dataFile, "/models/combined", events)
We first test for mutual exclusivity.
result_mutex = discover.pairwise_discover_test(events, chrom)
result_mutex
nbsupport.io.save_pairwise_result(dataFile, "/results/mutex", result_mutex)
And then for co-occurrence.
result_cooc = discover.pairwise_discover_test(events, chrom, "greater")
result_cooc
nbsupport.io.save_pairwise_result(dataFile, "/results/cooc", result_cooc)
No pairs are significant at a maximum FDR of 1%, but as can be seen relaxing the significance threshold to 3% allows us to recover a co-occurrence of TP53 mutation and MYC gain.
result_cooc.significant_pairs(1).sort_values("pvalue").head(10)
Here, we repeat the co-occurrence and mutual exclusivity analyses using the Binomial test.
def estimate_binomial_background(events, strata=None):
if strata is None:
return numpy.repeat(events.mean(1)[:, numpy.newaxis], events.shape[1], 1)
else:
strata = numpy.asarray(strata)
result = numpy.empty(events.shape)
for x in numpy.unique(strata):
result[:, strata == x] = estimate_binomial_background(events[:, strata == x])
return result
events_binom = discover.DiscoverMatrix(
pandas.DataFrame(events.events),
pandas.DataFrame(estimate_binomial_background(events.events, strata=cancerTypesMut)))
result_mutex_binom = discover.pairwise_discover_test(events_binom, chrom)
result_mutex_binom
result_cooc_binom = discover.pairwise_discover_test(events_binom, chrom, "greater")
result_cooc_binom