Pairwise analysis

In this notebook, we perform pairwise co-occurrence and mutual exclusivity analyses of PanCan12 data using the DISCOVER test.

In [1]:
import sys
sys.path.append("../lib")
In [2]:
import nbsupport.io
import nbsupport.tcga
import nbsupport.util
In [3]:
import numpy
import pandas

Data

The code below assumes that genomic data for the PANCAN12 studies have been downloaded from Firehose and imported into an HDF5 file. These steps are performed in the notebook Download PanCan12 data.

In [4]:
dataFile = "../data/tcga/tcga-pancan12.h5"

Below we collect the copy number and mutation matrices for the 12 PANCAN studies, and combine them into one combined matrix. Tumours are only included if both copy number and mutation data are available.

Mutations

In [5]:
mut = [pandas.read_hdf(dataFile, "/data/{}/mut".format(study)) for study in nbsupport.tcga.PANCAN12_STUDIES]
In [6]:
cancerTypesMut = pandas.Series(
    numpy.repeat(nbsupport.tcga.PANCAN12_STUDIES, [matrix.shape[1] for matrix in mut]),
    numpy.concatenate([matrix.columns for matrix in mut]))
In [7]:
mut = pandas.concat(mut, 1)
In [8]:
mut.fillna(0, inplace=True)
In [9]:
mut.columns = mut.columns.str.slice(0, 15)
cancerTypesMut.index = cancerTypesMut.index.str.slice(0, 15)

Copy number

In [10]:
cn = [pandas.read_hdf(dataFile, "/data/{}/cn".format(study)) for study in nbsupport.tcga.PANCAN12_STUDIES]
In [11]:
cancerTypesCN = pandas.Series(
    numpy.repeat(nbsupport.tcga.PANCAN12_STUDIES, [matrix.shape[1] for matrix in cn]),
    numpy.concatenate([matrix.columns for matrix in cn]))
In [12]:
cn = pandas.concat(cn, 1)
In [13]:
cn.columns = cn.columns.str.slice(0, 15)
cancerTypesCN.index = cancerTypesCN.index.str.slice(0, 15)
In [14]:
cn, cancerTypesCN, mut, cancerTypesMut = nbsupport.util.align_columns(
    cn, pandas.DataFrame(cancerTypesCN).T,
    mut, pandas.DataFrame(cancerTypesMut).T)
In [15]:
cancerTypesCN = cancerTypesCN.iloc[0]
cancerTypesMut = cancerTypesMut.iloc[0]
In [16]:
mut.shape
Out[16]:
(18271, 3386)
In [17]:
cn.shape
Out[17]:
(24174, 3386)

Background estimation

Background alteration probability matrices are estimated per alteration type (i.e. copy number gain and loss, and mutation) separately, and based on the alteration matrices with all genes. In estimating these background matrices, we stratify for cancer type.

In [18]:
from discover.util import disableStackLimit
disableStackLimit()
In [19]:
import discover
In [20]:
mutations = discover.DiscoverMatrix(mut, strata=cancerTypesMut)
In [21]:
gains = discover.DiscoverMatrix(cn == 2, strata=cancerTypesCN)
In [22]:
losses = discover.DiscoverMatrix(cn == -2, strata=cancerTypesCN)
In [23]:
nbsupport.io.save_discover_matrix(dataFile, "/models/mut", mutations)
nbsupport.io.save_discover_matrix(dataFile, "/models/gains", gains)
nbsupport.io.save_discover_matrix(dataFile, "/models/losses", losses)

Analysis

In [24]:
genes = pandas.read_table("../data/tcga/selected-genes.txt")
In [25]:
def extract_submatrix(x, rownames, suffix):
    result = x[numpy.in1d(x.rownames, rownames)]
    result.rownames = numpy.char.add(result.rownames.astype(str), "_" + suffix)
    return result

We make an alteration probability matrix by combining the relevant rows of the alteration-type specific matrices.

In [26]:
events = discover.row_stack([
        extract_submatrix(gains, genes.gene[genes.type == "gain"], "gain"),
        extract_submatrix(losses, genes.gene[genes.type == "loss"], "loss"),
        extract_submatrix(mutations, genes.gene[genes.type == "mut"], "mut")])

The chromosomes on which the genes are located are needed so that we can test pairs of genes only if they are located on different chromosomes.

In [27]:
chrom = pandas.Series(
    genes.chrom[pandas.match(numpy.char.partition(events.rownames.astype(str), "_")[:, 0], genes.gene)].values,
    events.rownames)
In [28]:
order_by_chrom = chrom.values.argsort()
In [29]:
events = events[order_by_chrom]
In [30]:
chrom = chrom[order_by_chrom]
In [31]:
nbsupport.io.save_discover_matrix(dataFile, "/models/combined", events)

We first test for mutual exclusivity.

In [32]:
result_mutex = discover.pairwise_discover_test(events, chrom)
In [33]:
result_mutex
Out[33]:
Pairwise DISCOVER mutual exclusivity test
alternative hypothesis: observed overlap is less than expected by chance

number of pairs tested: 77305
proportion of true null hypotheses: 0.919293815154
number of significant pairs at a maximum FDR of 0.01: 183
In [34]:
nbsupport.io.save_pairwise_result(dataFile, "/results/mutex", result_mutex)

And then for co-occurrence.

In [35]:
result_cooc = discover.pairwise_discover_test(events, chrom, "greater")
In [36]:
result_cooc
Out[36]:
Pairwise DISCOVER co-occurrence test
alternative hypothesis: observed overlap is greater than expected by chance

number of pairs tested: 77305
proportion of true null hypotheses: 1.0
number of significant pairs at a maximum FDR of 0.01: 0
In [37]:
nbsupport.io.save_pairwise_result(dataFile, "/results/cooc", result_cooc)

No pairs are significant at a maximum FDR of 1%, but as can be seen relaxing the significance threshold to 3% allows us to recover a co-occurrence of TP53 mutation and MYC gain.

In [38]:
result_cooc.significant_pairs(1).sort_values("pvalue").head(10)
Out[38]:
gene1 gene2 pvalue qvalue
53164 TP53_mut MYC_gain 8.454899e-07 0.029552
48749 WWOX_loss CASP8_mut 5.059779e-06 0.091703
53009 TP53_mut CCNE1_gain 4.793359e-05 0.601844
48760 WWOX_loss DNMT3A_mut 1.440971e-04 1.000000
27630 LRRK2_mut NFATC1_loss 1.687369e-04 1.000000
34655 IRS2_gain LPP_mut 3.442378e-04 1.000000
72997 ZNF300_mut AKD1_mut 3.902306e-04 1.000000
53079 TP53_mut MECOM_gain 4.585426e-04 1.000000
7525 RERE_loss FBXW7_mut 5.299846e-04 1.000000
48819 WWOX_loss FBXW7_mut 5.898698e-04 1.000000

Binomial test

Here, we repeat the co-occurrence and mutual exclusivity analyses using the Binomial test.

In [39]:
def estimate_binomial_background(events, strata=None):
    if strata is None:
        return numpy.repeat(events.mean(1)[:, numpy.newaxis], events.shape[1], 1)
    else:
        strata = numpy.asarray(strata)
        result = numpy.empty(events.shape)
        for x in numpy.unique(strata):
            result[:, strata == x] = estimate_binomial_background(events[:, strata == x])
        return result
In [40]:
events_binom = discover.DiscoverMatrix(
    pandas.DataFrame(events.events),
    pandas.DataFrame(estimate_binomial_background(events.events, strata=cancerTypesMut)))
In [41]:
result_mutex_binom = discover.pairwise_discover_test(events_binom, chrom)
In [42]:
result_mutex_binom
Out[42]:
Pairwise DISCOVER mutual exclusivity test
alternative hypothesis: observed overlap is less than expected by chance

number of pairs tested: 77305
proportion of true null hypotheses: 1.0
number of significant pairs at a maximum FDR of 0.01: 3
In [43]:
result_cooc_binom = discover.pairwise_discover_test(events_binom, chrom, "greater")
In [44]:
result_cooc_binom
Out[44]:
Pairwise DISCOVER co-occurrence test
alternative hypothesis: observed overlap is greater than expected by chance

number of pairs tested: 77305
proportion of true null hypotheses: 0.57484040559
number of significant pairs at a maximum FDR of 0.01: 21627