An introduction to DISCOVER

Sander Canisius

2021-07-27

An introduction to DISCOVER

Introduction

DISCOVER is a novel statistical test for detecting co-occurrence and mutual exclusivity in cancer genomics data. Unlike traditional approaches used for these tasks such as Fisher’s exact test, DISCOVER is based on a null model that takes into account the overall tumour-specific alteration rates when deciding whether alterations co-occur more or less often than expected by chance. This improved null model prevents spurious associations in co-occurrence detection, and increases the statistical power to detect mutual exclusivities.

This vignette introduces the DISCOVER R package for mutual exclusivity and co-occurrence analysis. Two variants of the DISCOVER test are implemented by this package. The first variant tests pairs of genes for either co-occurrence or mutual exclusivity. The second is a mutual exclusivity test for gene sets larger than two. We will illustrate the use of both tests by performing a mutual exclusivity and co-occurrence analysis of somatic mutations in the TCGA breast cancer samples.

Set up

First, we load the package and the breast cancer mutation data.

library(discover)

data("BRCA.mut")

A small fragment of the mutation matrix is shown below. It takes the form of a binary matrix, i.e. only 0 and 1 are allowed as values.

BRCA.mut[1:5, 1:5]
##        TCGA-A1-A0SB-01A-11D-A142-09 TCGA-A1-A0SD-01A-11D-A10Y-09
## A1CF                              0                            0
## A2M                               0                            0
## A2ML1                             0                            0
## A4GALT                            0                            0
## A4GNT                             0                            0
##        TCGA-A1-A0SE-01A-11D-A099-09 TCGA-A1-A0SF-01A-11D-A142-09
## A1CF                              0                            0
## A2M                               0                            0
## A2ML1                             0                            0
## A4GALT                            0                            0
## A4GNT                             0                            0
##        TCGA-A1-A0SG-01A-11D-A142-09
## A1CF                              0
## A2M                               0
## A2ML1                             0
## A4GALT                            0
## A4GNT                             0

An important ingredient of the DISCOVER test is the estimation of a background matrix. This is how tumour-specific alteration rates can be used by the test. To estimate this background matrix, a discover.matrix object is created and passed the mutation matrix. Depending on the size of the matrix this might take some time. This estimation step only needs to be performed once.

events <- discover.matrix(BRCA.mut)

Note that for estimating this background matrix, it is important that the full mutation matrix is provided. Even if only a subset of genes will subsequently be used in the analysis, a whole-genome view of the mutations is required for this first step. Indeed, for this example we will only look for mutual exclusivity between genes that are mutated in at least 25 tumours.

subset <- rowSums(BRCA.mut) > 25

Pairwise tests

We can now test for mutual exclusivity between all pairs of genes using the pairwise.discover.test function. Its only required argument is an instance of the discover.matrix class like we created above. In order to only analyse frequently mutated genes, we select the corresponding rows before passing the matrix to the function. Again, depending on the size of the matrix, this function might take some time.

result.mutex <- pairwise.discover.test(events[subset, ])

We can get a quick overview of the results of this analysis simply by printing the result object.

result.mutex
## Pairwise DISCOVER mutual exclusivity test
## alternative hypothesis: observed overlap is less than expected by chance
## FDR estimation method: discrete Benjamini-Hochberg
## 
## number of pairs tested: 3403
## proportion of true null hypotheses: 0.9289582
## number of significant pairs at a maximum FDR of 0.01 : 24

The print method for the pairwise.discover.out class optionally takes an argument with which the false discovery rate (FDR) threshold is selected. By default, a maximum FDR of 1% is used. Below, we get a summary of the test results using a somewhat more liberal threshold of 5%.

print(result.mutex, fdr.threshold=0.05)
## Pairwise DISCOVER mutual exclusivity test
## alternative hypothesis: observed overlap is less than expected by chance
## FDR estimation method: discrete Benjamini-Hochberg
## 
## number of pairs tested: 3403
## proportion of true null hypotheses: 0.9289582
## number of significant pairs at a maximum FDR of 0.05 : 80

To get the pairs of genes which are significantly mutually exclusive, we can use the as.data.frame method. This method, too, takes an optional FDR threshold as argument. Below, we use the default of 1%.

as.data.frame(result.mutex)
##     gene1  gene2      p.value      q.value
## 5    TP53   CDH1 6.352400e-17 5.111111e-16
## 14   TP53  GATA3 5.470907e-15 3.138212e-14
## 21   TP53 MAP3K1 1.206967e-10 8.509531e-10
## 24   TP53 PIK3CA 1.224563e-10 8.509531e-10
## 12 PIK3CA  GATA3 1.149762e-06 2.738010e-05
## 3   GATA3   CDH1 1.013659e-05 2.693347e-04
## 19  NCOR1 MAP3K1 1.527660e-05 3.835712e-04
## 22    TTN MAP3K1 4.145602e-05 1.111640e-03
## 17   TP53 MAP2K4 4.315501e-05 1.111640e-03
## 11   MUC4  GATA3 6.633657e-05 1.711987e-03
## 13  SYNE1  GATA3 9.833334e-05 2.462258e-03
## 9  MAP3K1  GATA3 1.278237e-04 3.089072e-03
## 7  MAP3K1    DMD 1.392087e-04 3.208025e-03
## 1    TP53 ARID1A 1.707227e-04 3.720349e-03
## 15    TTN  GATA3 1.943877e-04 4.005469e-03
## 4  MAP3K1   CDH1 1.953800e-04 4.005469e-03
## 23  MUC5B   MUC4 2.043568e-04 4.048251e-03
## 2    FAT3   CDH1 2.367769e-04 4.229152e-03
## 16  XIRP2  GATA3 2.429527e-04 4.229152e-03
## 20    NEB MAP3K1 2.431841e-04 4.229152e-03
## 8  PIK3CA   FAT3 3.152131e-04 5.572576e-03
## 10  MUC16  GATA3 4.622582e-04 8.737315e-03
## 18  MUC16 MAP3K1 4.783099e-04 8.737315e-03
## 6  MAP3K1  CSMD1 5.207149e-04 9.568223e-03

Groupwise test

In the above results, we can identify a set of four genes—TP53, CDH1, GATA3, and MAP3K1—for which all pairwise combinations are found mutually exclusive. We can use the groupwise.discover.test to asses whether this set of genes is also mutually exclusive as a group. This we can do by simply passing the subset of the discover.matrix that corresponds to the given genes.

genes <- c("TP53", "CDH1", "GATA3", "MAP3K1")
groupwise.discover.test(events[genes, ])
## Groupwise DISCOVER mutual exclusivity test
## 
## method: impurity
## P = 7.821035e-22
plot(events[genes, ])

Session info

sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C           LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] discover_0.9.4
## 
## loaded via a namespace (and not attached):
##  [1] compiler_4.1.0    magrittr_2.0.1    htmltools_0.5.1.1 tools_4.1.0      
##  [5] yaml_2.2.1        stringi_1.7.3     rmarkdown_2.9     highr_0.9        
##  [9] knitr_1.33        stringr_1.4.0     digest_0.6.27     xfun_0.24        
## [13] rlang_0.4.11      evaluate_0.14