An introduction to DISCOVER

Introduction

DISCOVER is a novel statistical test for detecting co-occurrence and mutual exclusivity in cancer genomics data. Unlike traditional approaches used for these tasks such as Fisher’s exact test, DISCOVER is based on a statistical model that takes into account the overall tumour-specific alteration rates when deciding whether alterations co-occur more or less often than expected by chance. This improved model prevents spurious associations in co-occurrence detection, and increases the statistical power to detect mutual exclusivities.

This vignette introduces the DISCOVER package for mutual exclusivity and co-occurrence analysis. Two variants of the DISCOVER test are implemented by this package. The first variant tests pairs of genes for either co-occurrence or mutual exclusivity. The second is a mutual exclusivity test for gene sets larger than two. We will illustrate the use of both tests by performing a mutual exclusivity and co-occurrence analysis of somatic mutations in the TCGA breast cancer samples.

Set up

First, we load the package and the breast cancer mutation data.

In [1]: import discover

In [2]: import discover.datasets

In [3]: mut = discover.datasets.load_dataset("brca_mut")

A small fragment of the mutation matrix is shown below. It takes the form of a binary matrix, i.e. only 0 and 1 are allowed as values.

In [4]: mut.iloc[:5, :5]
Out[4]: 
        TCGA-A1-A0SB-01A-11D-A142-09  ...  TCGA-A1-A0SG-01A-11D-A142-09
A1CF                               0  ...                             0
A2M                                0  ...                             0
A2ML1                              0  ...                             0
A4GALT                             0  ...                             0
A4GNT                              0  ...                             0

[5 rows x 5 columns]

An important ingredient of the DISCOVER test is the estimation of a background matrix. This is how tumour-specific alteration rates can be used by the test. To estimate this background matrix, a DiscoverMatrix object is created and passed the mutation matrix. Depending on the size of the matrix this might take some time. This estimation step only needs to be performed once.

In [5]: events = discover.DiscoverMatrix(mut)

Note that for estimating this background matrix, it is important that the full mutation matrix is provided. Even if only a subset of genes will subsequently be used in the analysis, a whole-genome view of the mutations is required for this first step. Indeed, for this example we will only look for mutual exclusivity between genes that are mutated in at least 25 tumours.

In [6]: subset = mut.sum(1) > 25

Pairwise tests

We can now test for mutual exclusivity between all pairs of genes using the pairwise_discover_test() function. Its only required argument is an instance of the DiscoverMatrix class like we created above. In order to only analyse frequently mutated genes, we select the corresponding rows before passing the matrix to the function. Again, depending on the size of the matrix, this function might take some time.

In [7]: result_mutex = discover.pairwise_discover_test(events[subset])

We can get a quick overview of the results of this analysis simply by printing the result object.

In [8]: result_mutex
Out[8]: 
Pairwise DISCOVER mutual exclusivity test
alternative hypothesis: observed overlap is less than expected by chance
FDR estimation method: discrete Benjamini-Hochberg

number of pairs tested: 3403
proportion of true null hypotheses: 0.9289580077979672
number of significant pairs at a maximum FDR of 0.01: 24

To get the pairs of genes which are significantly mutually exclusive, we can use the significant_pairs() method. This method, takes an optional FDR threshold as argument. Below, we use the default of 1%.

In [9]: result_mutex.significant_pairs()
Out[9]: 
     gene1   gene2        pvalue        qvalue
0   ARID1A    TP53  1.707243e-04  3.720331e-03
1     CDH1    FAT3  2.367711e-04  4.229131e-03
2     CDH1   GATA3  1.013590e-05  2.693327e-04
3     CDH1  MAP3K1  1.953715e-04  4.005451e-03
4     CDH1    TP53  6.351085e-17  5.110747e-16
5    CSMD1  MAP3K1  5.207150e-04  9.568184e-03
6      DMD  MAP3K1  1.392095e-04  3.208008e-03
7     FAT3  PIK3CA  3.152159e-04  5.572551e-03
8    GATA3  MAP3K1  1.278235e-04  3.089057e-03
9    GATA3   MUC16  4.622530e-04  8.737277e-03
10   GATA3    MUC4  6.633671e-05  1.711976e-03
11   GATA3  PIK3CA  1.149757e-06  2.737986e-05
12   GATA3   SYNE1  9.833285e-05  2.462245e-03
13   GATA3    TP53  5.470815e-15  3.138120e-14
14   GATA3     TTN  1.943861e-04  4.005451e-03
15   GATA3   XIRP2  2.429499e-04  4.229131e-03
16  MAP2K4    TP53  4.315364e-05  1.111633e-03
17  MAP3K1   MUC16  4.783076e-04  8.737277e-03
18  MAP3K1   NCOR1  1.527650e-05  3.835688e-04
19  MAP3K1     NEB  2.431826e-04  4.229131e-03
20  MAP3K1    TP53  1.206969e-10  8.509467e-10
21  MAP3K1     TTN  4.145603e-05  1.111633e-03
22    MUC4   MUC5B  2.043576e-04  4.048233e-03
23  PIK3CA    TP53  1.224573e-10  8.509467e-10

Groupwise test

In the above results, we can identify a set of four genes—TP53, CDH1, GATA3, and MAP3K1—for which all pairwise combinations are found mutually exclusive. We can use the groupwise_discover_test() to asses whether this set of genes is also mutually exclusive as a group. This we can do by simply passing the subset of the DiscoverMatrix that corresponds to the given genes.

In [10]: genes = ["TP53", "CDH1", "GATA3", "MAP3K1"]

In [11]: discover.groupwise_discover_test(events[genes])
Out[11]: 7.819622483127188e-22