Gene selection

In this notebook, we make a selection of genes that will be used in the pairwise co-occurrence and mutual exclusivity analyses. Genes are selected if they are (1) located in a recurrently altered copy number segment and included in a list of known cancer genes, or (2) included in a list of mutational driver genes.

In [1]:
import sys
sys.path.append("../lib")
In [2]:
import numpy
import pandas
In [3]:
import nbsupport.tcga
In [4]:
segments = {}

for segment, genes in nbsupport.tcga.read_gistic_output("../data/tcga/del_genes.conf_95.pancan12.txt").iteritems():
    segments["_".join([segment, "loss"])] = genes

for segment, genes in nbsupport.tcga.read_gistic_output("../data/tcga/amp_genes.conf_95.pancan12.txt").iteritems():
    segments["_".join([segment, "gain"])] = genes
In [5]:
cancer_genes = pandas.read_table("../data/tcga/cancer-genes.tsv")
In [6]:
entrez_gene_info = pandas.read_table("../data/entrez/seq_gene.md.gz", usecols=[1, 9, 12], compression="gzip", low_memory=False)
entrez_gene_info = entrez_gene_info[entrez_gene_info.group_label == "GRCh37.p13-Primary Assembly"]
In [7]:
mut_genes = pandas.read_csv("../data/tcga/mutational-drivers.csv")
In [8]:
high_conf_drivers = mut_genes["Gene Symbol"][mut_genes["Putative Driver Category"] == "High Confidence Driver"]
In [9]:
mut_drivers = pandas.DataFrame.from_items([
        ("gene", high_conf_drivers),
        ("chrom", numpy.r_[entrez_gene_info.chromosome.values, numpy.nan][pandas.match(high_conf_drivers, entrez_gene_info.feature_name)]),
        ("type", "mut")])
In [10]:
selected_genes = numpy.union1d(
    numpy.union1d(
        numpy.intersect1d(numpy.concatenate(segments.values()), mut_genes["Gene Symbol"]),
        numpy.intersect1d(numpy.concatenate(segments.values()), cancer_genes.symbol)),
    [g[0].strip("[]") for g in segments.itervalues() if len(g) == 1])
In [11]:
rows = []
for gene in selected_genes:
    segment = next(seg for seg, genes in segments.iteritems() if gene in map(lambda s: s.strip("[]"), genes))
    rows.append((gene, segment[:max(segment.find("p"), segment.find("q"))],  segment.rsplit("_")[-1]))

cn_drivers = pandas.DataFrame(rows, columns=["gene", "chrom", "type"])

The DUX4 gene has incorrectly been assigned to chromosome 10 in the PanCan GISTIC output. In reality, it is on chromosome 4, so we remove it from the list of copy number driver genes.

In [12]:
cn_drivers = cn_drivers[cn_drivers.gene != "DUX4"]
In [13]:
drivers = pandas.concat([cn_drivers, mut_drivers]).sort_values("gene")
In [14]:
drivers.reset_index(drop=True, inplace=True)

The following genes have missing chromosome annotation in the cancer driver gene list.

In [15]:
gene2chrom = {
    "AKD1": "6",
    "MLL": "11",
    "MLL2": "12",
    "MLL3": "7"
}
In [16]:
for i, row in drivers[drivers.chrom.isnull()].iterrows():
    drivers.chrom[i] = gene2chrom[row.gene]
In [17]:
drivers.to_csv("../data/tcga/selected-genes.txt", sep="\t", index=False, na_rep="NA")