Pairwise analysis¶

In this notebook, we perform pairwise co-occurrence and mutual exclusivity analyses of PanCan12 data using the DISCOVER test.

import sys
sys.path.append("../lib")

import nbsupport.io
import nbsupport.tcga
import nbsupport.util

import numpy
import pandas

Data¶

The code below assumes that genomic data for the PANCAN12 studies have been downloaded from Firehose and imported into an HDF5 file. These steps are performed in the notebook Download PanCan12 data.

dataFile = "../data/tcga/tcga-pancan12.h5"

Below we collect the copy number and mutation matrices for the 12 PANCAN studies, and combine them into one combined matrix. Tumours are only included if both copy number and mutation data are available.

Mutations¶

mut = [pandas.read_hdf(dataFile, "/data/{}/mut".format(study)) for study in nbsupport.tcga.PANCAN12_STUDIES]

cancerTypesMut = pandas.Series(
    numpy.repeat(nbsupport.tcga.PANCAN12_STUDIES, [matrix.shape[1] for matrix in mut]),
    numpy.concatenate([matrix.columns for matrix in mut]))

mut = pandas.concat(mut, 1)

mut.fillna(0, inplace=True)

mut.columns = mut.columns.str.slice(0, 15)
cancerTypesMut.index = cancerTypesMut.index.str.slice(0, 15)

Copy number¶

cn = [pandas.read_hdf(dataFile, "/data/{}/cn".format(study)) for study in nbsupport.tcga.PANCAN12_STUDIES]

cancerTypesCN = pandas.Series(
    numpy.repeat(nbsupport.tcga.PANCAN12_STUDIES, [matrix.shape[1] for matrix in cn]),
    numpy.concatenate([matrix.columns for matrix in cn]))

cn = pandas.concat(cn, 1)

cn.columns = cn.columns.str.slice(0, 15)
cancerTypesCN.index = cancerTypesCN.index.str.slice(0, 15)

cn, cancerTypesCN, mut, cancerTypesMut = nbsupport.util.align_columns(
    cn, pandas.DataFrame(cancerTypesCN).T,
    mut, pandas.DataFrame(cancerTypesMut).T)

cancerTypesCN = cancerTypesCN.iloc[0]
cancerTypesMut = cancerTypesMut.iloc[0]

mut.shape

(18271, 3386)

cn.shape

(24174, 3386)

Background estimation¶

Background alteration probability matrices are estimated per alteration type (i.e. copy number gain and loss, and mutation) separately, and based on the alteration matrices with all genes. In estimating these background matrices, we stratify for cancer type.

from discover.util import disableStackLimit
disableStackLimit()

import discover

mutations = discover.DiscoverMatrix(mut, strata=cancerTypesMut)

gains = discover.DiscoverMatrix(cn == 2, strata=cancerTypesCN)

losses = discover.DiscoverMatrix(cn == -2, strata=cancerTypesCN)

nbsupport.io.save_discover_matrix(dataFile, "/models/mut", mutations)
nbsupport.io.save_discover_matrix(dataFile, "/models/gains", gains)
nbsupport.io.save_discover_matrix(dataFile, "/models/losses", losses)

Analysis¶

genes = pandas.read_table("../data/tcga/selected-genes.txt")

def extract_submatrix(x, rownames, suffix):
    result = x[numpy.in1d(x.rownames, rownames)]
    result.rownames = numpy.char.add(result.rownames.astype(str), "_" + suffix)
    return result

We make an alteration probability matrix by combining the relevant rows of the alteration-type specific matrices.

events = discover.row_stack([
        extract_submatrix(gains, genes.gene[genes.type == "gain"], "gain"),
        extract_submatrix(losses, genes.gene[genes.type == "loss"], "loss"),
        extract_submatrix(mutations, genes.gene[genes.type == "mut"], "mut")])

The chromosomes on which the genes are located are needed so that we can test pairs of genes only if they are located on different chromosomes.

chrom = pandas.Series(
    genes.chrom[pandas.match(numpy.char.partition(events.rownames.astype(str), "_")[:, 0], genes.gene)].values,
    events.rownames)

order_by_chrom = chrom.values.argsort()

events = events[order_by_chrom]

chrom = chrom[order_by_chrom]

nbsupport.io.save_discover_matrix(dataFile, "/models/combined", events)

We first test for mutual exclusivity.

result_mutex = discover.pairwise_discover_test(events, chrom)

result_mutex

Pairwise DISCOVER mutual exclusivity test
alternative hypothesis: observed overlap is less than expected by chance

number of pairs tested: 77305
proportion of true null hypotheses: 0.919293815154
number of significant pairs at a maximum FDR of 0.01: 183

nbsupport.io.save_pairwise_result(dataFile, "/results/mutex", result_mutex)

And then for co-occurrence.

result_cooc = discover.pairwise_discover_test(events, chrom, "greater")

result_cooc

Pairwise DISCOVER co-occurrence test
alternative hypothesis: observed overlap is greater than expected by chance

number of pairs tested: 77305
proportion of true null hypotheses: 1.0
number of significant pairs at a maximum FDR of 0.01: 0

nbsupport.io.save_pairwise_result(dataFile, "/results/cooc", result_cooc)

No pairs are significant at a maximum FDR of 1%, but as can be seen relaxing the significance threshold to 3% allows us to recover a co-occurrence of TP53 mutation and MYC gain.

result_cooc.significant_pairs(1).sort_values("pvalue").head(10)

Binomial test¶

Here, we repeat the co-occurrence and mutual exclusivity analyses using the Binomial test.

def estimate_binomial_background(events, strata=None):
    if strata is None:
        return numpy.repeat(events.mean(1)[:, numpy.newaxis], events.shape[1], 1)
    else:
        strata = numpy.asarray(strata)
        result = numpy.empty(events.shape)
        for x in numpy.unique(strata):
            result[:, strata == x] = estimate_binomial_background(events[:, strata == x])
        return result

events_binom = discover.DiscoverMatrix(
    pandas.DataFrame(events.events),
    pandas.DataFrame(estimate_binomial_background(events.events, strata=cancerTypesMut)))

result_mutex_binom = discover.pairwise_discover_test(events_binom, chrom)

result_mutex_binom

Pairwise DISCOVER mutual exclusivity test
alternative hypothesis: observed overlap is less than expected by chance

number of pairs tested: 77305
proportion of true null hypotheses: 1.0
number of significant pairs at a maximum FDR of 0.01: 3

result_cooc_binom = discover.pairwise_discover_test(events_binom, chrom, "greater")

result_cooc_binom

Pairwise DISCOVER co-occurrence test
alternative hypothesis: observed overlap is greater than expected by chance

number of pairs tested: 77305
proportion of true null hypotheses: 0.57484040559
number of significant pairs at a maximum FDR of 0.01: 21627

	gene1	gene2	pvalue	qvalue
53164	TP53_mut	MYC_gain	8.454899e-07	0.029552
48749	WWOX_loss	CASP8_mut	5.059779e-06	0.091703
53009	TP53_mut	CCNE1_gain	4.793359e-05	0.601844
48760	WWOX_loss	DNMT3A_mut	1.440971e-04	1.000000
27630	LRRK2_mut	NFATC1_loss	1.687369e-04	1.000000
34655	IRS2_gain	LPP_mut	3.442378e-04	1.000000
72997	ZNF300_mut	AKD1_mut	3.902306e-04	1.000000
53079	TP53_mut	MECOM_gain	4.585426e-04	1.000000
7525	RERE_loss	FBXW7_mut	5.299846e-04	1.000000
48819	WWOX_loss	FBXW7_mut	5.898698e-04	1.000000