Information

Open databases for copy number variations similar to TCGA

Open databases for copy number variations similar to TCGA


We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

The Cancer Genome Atlas(TCGA) has open data for copy number variation(CNV) from at least 10k different cancer patients. They offer two types of data, CNV data from tumor and CNV data from normal tissue samples. Is there any other open databases that offer CNV data from at least one cancer type?


The ICGC has CNV data for many different cancer types. It has many both restricted and open data sets. The DCC releases page will let you hunt through them - those that are public are easily downloaded. They also have matched expression, SNV, DNA methylation, and structural mutation data for many samples.


CODEX2: full-spectrum copy number variation detection by high-throughput DNA sequencing

High-throughput DNA sequencing enables detection of copy number variations (CNVs) on the genome-wide scale with finer resolution compared to array-based methods but suffers from biases and artifacts that lead to false discoveries and low sensitivity. We describe CODEX2, as a statistical framework for full-spectrum CNV profiling that is sensitive for variants with both common and rare population frequencies and that is applicable to study designs with and without negative control samples. We demonstrate and evaluate CODEX2 on whole-exome and targeted sequencing data, where biases are the most prominent. CODEX2 outperforms existing methods and, in particular, significantly improves sensitivity for common CNVs.


Open databases for copy number variations similar to TCGA - Biology

Genomic sequence variation

http://www.1000genomes.org/
Data collection and a catalog of human variation

dbVar and Database of Genomic Variants

Online Mendelian Inheritance in Man

http://www.omim.org/about
OMIM is a comprehensive, authoritative compendium of human genes and genetic phenotypes that is freely available and updated daily. The full-text, referenced overviews in OMIM contain information on all known mendelian disorders and over 12,000 genes. OMIM focuses on the relationship between phenotype and genotype. It is updated daily, and the entries contain copious links to other genetics resources.

The Exome Aggregation Consortium (ExAC)

http://exac.broadinstitute.org/
ExAC is a coalition of investigators seeking to aggregate and harmonize exome sequencing data from a variety of large-scale sequencing projects, and to make summary data available for the wider scientific community. The data set provided on this website spans 61,486 unrelated individuals sequenced as part of various disease-specific and population genetic studies. We have removed individuals affected by severe pediatric disease, so this data set should serve as a useful reference set of allele frequencies for severe disease studies. All of the raw data from these projects have been reprocessed through the same pipeline, and jointly variant-called to increase consistency across projects.

Encyclopedia Of DNA Elements (ENCODE) Project

http://encodeproject.org/
Links to ENCODE2 uniformly processed histone mark data: https://sites.google.com/site/anshulkundaje/projects/encodehistonemods
Links to other ENCODE2 uniformly processed data: http://genome.ucsc.edu/ENCODE/downloads.html
Data collection, integrative analysis, and a comprehensive catalog of
all sequence-based functional elements

Roadmap Epigenomics Project (NIH Common Fund)

International Human Epigenome Consortium (IHEC)

http://www.ihec-epigenomes.org/
Data collection and reference maps of human epigenomes for key
cellular states relevant to health and diseases

###Human BodyMap Viewable with Ensemble (http://www.ensembl.org/index.html) or the
Integrated Genomics Viewer (http://www.broadinstitute.org/igv/)
Gene expression database from Illumina, from RNA-seq data

###Cancer CellLine Encyclopedia (CCLE) http://www.broadinstitute.org/ccle/home
Array based expression data, CNV, mutations, perturbations over huge collection of cell lines

###FANTOM5 Project http://fantom.gsc.riken.jp/
http://fantom.gsc.riken.jp/5/sstar/Data_source
Large collection of CAGE based expression data across multiple species (time-series and perturbations)

http://www.ebi.ac.uk/gxa/
Database supporting queries of condition-specific gene expression on
a curated subset of the Array Express Archive.

GNF Gene Expression Atlas

Viewable at BioGPS (http://biogps.org/#goto=welcome)
GNF (Genomics Institute of the Novartis Research Foundation) human and mouse gene expression array data.

http://www.proteinatlas.org/
Protein expression profiles based on immunohistochemistry for a large number of human tissues, cancers and cell lines, subcellular localization, transcript expression levels

http://www.uniprot.org/
A comprehensive, freely accessible database of protein sequence and
functional information

http://www.ebi.ac.uk/interpro/
An integrated database of protein classification, functional domains,
and annotation (including GO terms).

Protein Capture Reagents Initiative

http://commonfund.nih.gov/proteincapture/
Resource generation: renewable, monoclonal antibodies and other reagents that target the full range of proteins

Knockout Mouse Program (KOMP)

The Connectivity Map (CMAP)

http://www.broadinstitute.org/cmap/
The Connectivity Map (also known as cmap) is a collection of genome-wide transcriptional expression data from cultured human cells treated with bioactive small molecules and simple pattern-matching algorithms that together enable the discovery of functional connections between drugs, genes and diseases through the transitory feature of common gene-expression changes. You can learn more about cmap from our papers in Science and Nature Reviews Cancer.

Library of Integrated Network-based Cellular Signatures (LINCS)

https://commonfund.nih.gov/LINCS/
Data collection and analysis of molecular signatures that describe how
different types of cells respond to a variety of perturbing agents

Genomic of drug sensitivity in cancer

http://www.cancerrxgene.org/
Mutation, CNV, Affy expression and drug sensitivity in

The Drug Gene Interaction database (DGIdb)

Molecular Libraries Program (MLP)

https://commonfund.nih.gov/molecularlibraries/index.aspx
Access to the large-scale screening capacity necessary to identify small molecules that can be optimized as chemical probes to study the functions of genes, cells, and biochemical pathways in health and disease

http://www.brain-map.org/
Data collection and an online public resources integrating extensive gene expression and neuroanatomical data for human and mouse, including variation of mosue gene expression by strain.

http://braincloud.jhmi.edu/
BrainCloud is a freely-available, biologist-friendly, stand-alone application for exploring the temporal dynamics and genetic control of transcription in the human prefrontal cortex across the lifespan. BrainCloud was developed through collaboration between the Lieber Institute and NIMH

The Human Connectome Project

http://www.humanconnectomeproject.org/
Data collection and integration to create a complete map of the structural and functional neural connections, within and across individuals

Geuvadis RNA sequencing project of 1000 Genomes samples

http://www.geuvadis.org/web/geuvadis
mRNA and small RNA sequencing on 465 lymphoblastoid cell line (LCL) samples from 5 populations of the 1000 Genomes Project: the CEPH (CEU), Finns (FIN), British (GBR), Toscani (TSI) and Yoruba (YRI).

http://www.broadinstitute.org/achilles Project Achilles is a systematic effort aimed at identifying and cataloging genetic vulnerabilities across hundreds of genomically characterized cancer cell lines. The project uses a genome-wide shRNA library to silence individual genes and identify those genes that affect cell survival. Large-scale functional screening of cancer cell lines provides a complementary approach to those studies that aim to characterize the molecular alterations (mutations, copy number alterations, etc.) of primary tumors, such as The Cancer Genome Atlas. The overall goal of the project is to link cancer genetic dependencies to their molecular characteristics in order to Identify molecular targets and guide therapeutic development.

Human Ageing Genomic Resources

The Cancer Genome Atlas (TCGA)

http://cancergenome.nih.gov/
Data collection and a data repository, including cancer genome sequence data

International Cancer Genome Consortium (ICGC)

http://www.icgc.org/
Data collection and a data repository for a comprehensive description of genomic, transcriptomic and epigenomic changes of cancer

Genotype-Tissue Expression (GTEx) Project

https://commonfund.nih.gov/GTEx/
Data collection, data repository, and sample bank for human gene expression and regulation in multiple tissues, compared to genetic variation

Knockout Mouse Phenotyping Program (KOMP2)

https://commonfund.nih.gov/KOMP2/
Data collection for standardized phenotyping of a genome-wide collection of mouse knockouts

Database of Genotypes and Phenotypes (dbGaP)

http://www.ncbi.nlm.nih.gov/gap
Data repository for results from studies investigating the interaction of genotype and phenotype

NHGRI Catalog of Published GWAS

http://www.genome.gov/gwastudies/
Public catalog of published Genome-Wide Association Studies

Clinical Genomic Database

http://research.nhgri.nih.gov/CGD/
A manually curated database of conditions with known genetic causes, focusing on medically significant genetic data with available interventions.

NHGRI's Breast Cancer information core

http://www.ncbi.nlm.nih.gov/clinvar/
ClinVar is designed to provide a freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence. ClinVar collects reports of variants found in patient samples, assertions made regarding their clinical significance, information about the submitter, and other supporting data. The alleles described in submissions are mapped to reference sequences, and reported according to the HGVS standard. ClinVar then presents the data for interactive users as well as those wishing to use ClinVar in daily workflows and other local applications. ClinVar works in collaboration with interested organizations to meet the needs of the medical genetics community as efficiently and effectively as possible.

Human Gene Mutation Database (HGMD)

http://www.hgmd.cf.ac.uk/ac/
The Human Gene Mutation Database (HGMD®) represents an attempt to collate known (published) gene lesions responsible for human inherited disease

NHLBI Exome Sequencing Project (ESP) Exome Variant Server

http://evs.gs.washington.edu/EVS/
The goal of the NHLBI GO Exome Sequencing Project (ESP) is to discover novel genes and mechanisms contributing to heart, lung and blood disorders by pioneering the application of next-generation sequencing of the protein coding regions of the human genome across diverse, richly-phenotyped populations and to share these datasets and findings with the scientific community to extend and enrich the diagnosis, management and treatment of heart, lung and blood disorders.

http://ghr.nlm.nih.gov/
Genetics Home Reference is the National Library of Medicine's web site for consumer information about genetic conditions and the genes or chromosomes related to those conditions.

http://www.ncbi.nlm.nih.gov/books/NBK1116/
GeneReviews are expert-authored, peer-reviewed disease descriptions presented in a standardized format and focused on clinically relevant and medically actionable information on the diagnosis, management, and genetic counseling of patients and families with specific inherited conditions.

Global Alzheimer's Association Interactive Network (GAAIN)

http://www.gaain.org/
The Global Alzheimer’s Association Interactive Network (GAAIN) is a collaborative project that will provide researchers around the globe with access to a vast repository of Alzheimer’s disease research data and the sophisticated analytical tools and computational power needed to work with that data. Our goal is to transform the way scientists work together to answer key questions related to understanding the causes, diagnosis, treatment and prevention of Alzheimer’s and other neurodegenerative diseases.
In 2013, obtained WGS data for the largest cohort of 800 Alzheimer's patients

The Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium

http://web.chargeconsortium.com/
The Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium was formed to facilitate genome-wide association study meta-analyses and replication opportunities among multiple large and well-phenotyped longitudinal cohort studies. They also have DNA methylation data alongside WGS and Exome Seq.

The NIMH Center for Collaborative Genomic Studies on Mental Disorders


Results

Comprehensive epigenomic profiling in both BLCA lines and primary tumors

In this project, we performed RNA-Seq, ChIP-Seq for Histone 3 lysine 27 acetylation (H3K27ac), Assay for Transposase-Accessible Chromatin using sequencing (ATAC-Seq), and genome-wide chromatin confirmation capture experiments (Hi-C) on 4 bladder cancer cell lines (Fig. 1a), two of which (RT4 and SW780) were previously annotated as luminal and the two others (SCABER and HT1376) that were characterized as basal [8, 25]. Based on the RNA-Seq data generated in this study, we used a previously reported molecular subtyping approach [26] to confirm assignment to luminal and basal states. Our results confirmed RT4 and SW780 as belonging to the Luminal-papillary subtype, while SCABER and HT1376 belong to the Basal/squamous subtype (Additional file 1: Table S1). Each experiment in bladder cancer cell lines has at least two biological replicates (Additional file 2: Table S2) and we observed a high correlation between the two replicates (Additional file 3: Table S3). More importantly, we performed the same set of experiments on four patient muscle-invasive bladder tumors as well. By using the same molecular subtyping method, we determined their subtypes as the following: T1 is Luminal-papillary, T3 is Stroma-rich, and T4 and T5 are basal/squamous.

Luminal and basal transcriptional BLCA subtypes are associated with distinct promoter and distal enhancers’ activity at the epigenetic level. a Overall design of the study. b Differential expression gene (DEG) analysis of luminal cell lines (RT4 and SW780) and basal cell lines (SCABER and HT1376) shows 427 basal-specific upregulated genes and 524 luminal-specific upregulated genes. c Heatmap of differential H3K27ac ChIP-Seq at promoters (left). Signal H3K27ac intensity profiles for each cluster of BLCA cells (right). d Genome browser signal tracks for a panel of luminal and basal genes. Shown here are the tracks of H3K27ac ChIP-Seq, ATAC-Seq, and RNA-Seq data in RT4, SW780, SCABER, and HT1376 cells. e Promoter H3K27ac and its associated RNA-Seq signals for selected luminal and basal genes shows remarkable similarity. f Integrated H3K27ac peaks at distal enhancers and RNA-Seq gene expression association model identifies putative enhancers and gene regulation. Top 10,000 most variable enhancers (left heatmap) are plotted along with their corresponding gene expression (right heatmap). g Correlations of genome-wide H3K27ac signals between the bladder cancer cell lines and tumor samples demonstrate similarity of enhancer landscape

Luminal and basal transcriptional BLCA subtypes are associated with distinct promoter and distal enhancers activity at the epigenetic level

Enrichment of H3K27ac signals has been used to predict both active promoters and distal enhancers [27, 28]. Therefore, we first performed ChIP-Seq for H3K27ac in all four cell types and four patient samples. We observed that biologic replicates following H3K27ac ChIP-seq always clustered together, indicating our results are highly reproducible (Additional file 4: Figure S1A). Further, we found that two luminal subtypes (RT4 and SW780) clustered together, while two basal (SCABER and HT1376) cell lines are grouped together as well (Additional file 4: Figure S1A). These clustering results suggest global epigenomic profiles accurately reflect cell identity. The hierarchical clustering in the cell lines based on H3K27ac signals was also mirrored by global mRNA expression by RNA-Seq data (Additional file 4: Figure S1B). We performed differential gene expression analysis on the two groups of cell types (RT4 and SW780 vs. SCABER and HT1376) and identified 427 basal-specific (Additional file 5: Table S4) and 524 luminal-specific genes (Fig. 1b, Additional file 6: Table S5).

Next, we examined promoter usage based on H3K27ac signals at known genes. We confirmed that promoter H3K27ac intensities are remarkably similar to gene expression (Fig. 1c), and clustering analysis based on promoter H3K27ac intensity was able to distinguish luminal and basal models of BLCA (Additional file 4: Figure S1C). For example, we observed that two luminal subtype BLCA cell lines RT4 and SW780 have similar H3K27ac patterns at luminal genes FOXA1, GATA3, and PPARG (Fig. 1d, e), while the two basal cell lines share similar promoter marks at genes encoding the basal/squamous markers KRT5/14. Interestingly, although based on global gene expression, HT1376 is classified as a basal/squamous subtype, it shows a similar promoter H3K27ac pattern at luminal genes (GATA3, KRT7/8/18, Fig. 1e).

Distal H3K27ac peaks from gene promoter regions have been used as markers for active enhancers [27, 29]. We took the same approach here, and on average, we predicted 59,466 (40,731–78,506) enhancers in each cell line (Additional file 7: Table S6). To link the distal enhancers to their target genes, we performed a correlation-based distal-enhancer peak-gene association as described in [30] and identified the top 10,000 variable distal enhancers that show significant correlation to its linked gene (correlation ≥0.5, p < 0.01 a total of 58,509 satisfied our criteria Fig. 1f and Additional file 8: Table S7). We observed that the enhancers show clear clustering according to different cell types, and their target genes show similar cell-type-specific patterns (Fig. 1f and Additional file 4: Figure S1D). Moreover, to understand the clinical relevance of our findings, we performed H3K27ac ChIP-Seq in four muscle invasive bladder patient samples. Our results show a remarkable correlation of tumor cell lines (Fig. 1g). In summary, we show in these cell lines and in a limited tumor cohort that epigenetic regulation is correlated with molecular subtype assignment.

Distinct sets of transcription factor motifs are enriched in luminal and basal BLCA-associated cis DNA regulatory regions

We performed ATAC-Seq in RT4, SW780, SCABER, and HT1376 cell lines to evaluate their open chromatin status in the genome. On average, in each cell line, we identified 32,000 open chromatin regions (Fig. 2a and Additional file 9: Table S8). Among them, 40.8% of open chromatin regions were located at promoter regions, while 59.2% were located at distal regions. Overall, > 90% of the open chromatin promoter regions overlap with H3K27ac (Additional file 4: Figure S2A, S2C–D). The overlap of distal ATAC-Seq peaks and H3K27ac is lower (Additional file 4: Figure S2A and Additional file 10: Table S9), at least partially due to the different numbers of peaks in different datasets. Genome-wide correlation of ATAC-Seq showed that HT1376 and SCABER clustered together with 80% similarity (Additional file 4: Figure S2E) compared to luminal RT4 (

65%). We noted that this observation agrees with the RNA-Seq-based clustering and H3K27ac-based clustering (Additional file 4: Figure S1A and B).

Distinct sets of transcription factor motifs are enriched in luminal and basal BLCA-associated cis DNA regulatory regions. a A comprehensive and a distinct set of distal ATAC-Seq signals at three clusters (luminal specific, basal specific, and shared) and corresponding H3K27ac signals. b TF motif analysis results is shown here as a ranked plot (left) and motifs (right), where for luminal-specific (top) and basal-squamous-specific open chromatin enhancers (bottom). c FOXA1 and GATA3 bound open chromatins located at distal enhancers of RT4/luminal cell line is depicted here in three groups: FOXA1 only, GATA3 only, and FOXA1 and GATA3 binding sites. d Gene ontology analysis of pathways for each group of binding sites (FOXA1 only, FOXA1 and GATA3, and GATA3 only). e Observed occurrence of TF motifs (AP-1, FOX Forkhead, and GATA) is shown here at distal enhancers and promoters of three groups. f Genome-wide open chromatins of BLCA cell lines show similarity with TCGA bladder tumors [30]

Next, we performed motif analysis of these open chromatin regions (Additional file 11: Table S10). We observed that binding sites for CTCF and AP-1 complex are enriched in all cell lines (Fig. 2b and Additional file 4: Figure S2G). Further ranking of TF motifs by enrichment p-value revealed luminal open chromatin regions (shared between RT4 and SW780) were enriched with binding motifs for GRHL2, TP53, and TP63 while basal open chromatins (shared between SCABER and HT1376) were enriched for TEAD1/4 and KLF factor (Fig. 2b) binding motifs. GRHL2 [31] was previously reported to be a luminal gene, thereby validating our findings. Interestingly, binding motifs for AP-1 complex proteins FOSL1/2, JUN/JUNB, ATF3, and BATF TFs [32] were the topmost enriched motifs for both luminal and basal-squamous open chromatins. We then comprehensively mapped all the enriched TF motifs in luminal, basal-squamous and shared open chromatins of distal enhancers to examine the relationship between TFs and BLCA subtypes (Additional file 11: Table S10). We discovered that at distal enhancers, the luminal BLCA subtypes are associated with previously reported steroid hormone receptor TFs. On the other hand, basal-squamous open chromatin areas at enhancers show enrichment of previously unreported factors MADS box TF MEF2C and the homeobox TF OTX2. Not surprisingly, luminal pioneering TFs such as forkhead transcription factors (FOXA1/2/3, FOXF1, FOXK1, FOXM1), and GATA TFs (GATA3/4/6) were enriched in luminal-associated enhancers with an open chromatin conformation. More surprisingly, forkhead and GATA motifs were also identified as being associated with open chromatin at enhancer elements across cell lines (Additional file 11: Table S10). While FOXA1 and GATA3 are known to have low expression in basal bladder cancer cell lines and tumors, the enrichment of forkhead and GATA motifs in open chromatins across BLCA cell lines suggests compensation by Forkhead and GATA factors other than FOXA1 and GATA3. In addition, Forkhead and GATA motif enrichment across cell lines in areas of open chromatin may indicate luminal-specific TFs are poised to bind to these areas of open chromatin. Furthermore, FOXA1 and GATA3 are known to play a role in the development of urothelium [31] suggesting that their binding sites may be primed early during development. We also discovered that the stem-cell-associated pioneering TFs such as KLF factors (KLF10/14), ATF factors (ATF1/2/4/7), and NANOG were enriched in basal-associated enhancers. This is interesting because there exists a progenitor cell population within basal urothelium that can contribute to urothelial development and differentiation [33, 34].

FOXA1 and GATA3 bind at luminal open chromatins at regulatory distal enhancers to drive expression of luminal-specific genes

We hypothesized that TFs such as FOXA1 and GATA3 bind at the open chromatin region to pioneer luminal enhancers and activate associated gene expression. To test this hypothesis, we performed GATA3 ChIP-Seq in the RT4 luminal BLCA cell line and obtained FOXA1 ChIP-Seq in RT4 cells from our previously published work (Additional file 12: Table S11) [8]. As predicted, luminal TFs FOXA1 and GATA3 showed enriched binding at the open chromatin loci of luminal-associated (FOXA1, GATA3, PPARG, FGFR3, and FABP4) distal enhancers (Fig. 2c). More specifically, we discovered 1325 distal enhancers that show co-binding of both FOXA1 and GATA3 in RT4 (Fig. 2c). Similarly, FOXA1 and GATA3 showed enriched binding at open chromatin loci of luminal marker genes (FOXA1, ERBB3, KRT19, GPX2, and FABP4) promoters (Additional file 4: Figure S2F).

GO term analysis of genes proximal to these distal enhancer sites showed regulation of TGF beta production, epithelium development, regulation of transcription involved in cell fate commitment, and cell-cell adhesion biological processes (cadherin binding and adherens junction assembly) as terms associated with FOXA1. In addition, regulation of cellular component, cell size, and apical plasma membrane biological processes were terms identified with GATA3-bound genes proximal to these distal enhancers, suggesting a strong involvement of both TFs in commitment to cell fate and luminal differentiation (Fig. 2d). In regard to proximal genes associated with distal enhancers bound by both FOXA1 and GATA3, terms identified were associated with various developmental processes and the regulation of mucus secretion and fat cell differentiation, both important metabolic attributes of differentiated urothelium (Fig. 2d).

We then proceeded with the motif analysis of FOXA1 only, GATA3 only, and co-bound sites. Surprisingly, AP1-complexes were enriched specifically in all distal enhancers in addition to FOXA or GATA motifs (Fig. 2e). The order of binding of these three factors remains to be investigated. Finally, to understand the clinical relevance of our findings, we compared our four BLCA cell lines to the TCGA muscle-invasive bladder tumor ATAC-Seq data [30] and discovered that the genome-wide open chromatin profile in our cell lines is clustered with distinct clusters of tumors (Fig. 2f), suggesting that the open chromatin regions in these cell lines share similar patterns with patient tumors.

Luminal and basal subtypes of BLCA show potentially distinct 3D genome organizations

Previous studies have shown that 3D chromatin organization is associated with epigenetic activation or silencing of genes in cells [35]. For example, the majority of heterochromatin is known to be compressed in nuclei and located near the lamina-associated periphery of the nuclear envelope [35]. To obtain initial insights into the genome-wide 3D landscape of luminal and basal BLCA, we performed high-resolution Hi-C experiments on all four cell lines (at least 800 M reads, each) and five bladder tumor patients (> 800 M reads, each) (Additional file 4: Figure S3). We used our recently developed software, Peakachu [36], which is a machine learning-based chromatin loop detection approach, to predict loops at 10Kb bin resolution. First, we identified an average of 56,315 loops (range between 38,271 and 69,032) in the four cell lines (prob> 0.8 Additional file 13: Table S12). Then, by using the probability score output from Peakachu, we assigned subtype-specific chromatin loops as shown in the Aggregate Peak Analysis (APA, Fig. 3a and Additional file 14: Table S13) [37]. Based on our approach, we observed more potentially luminal-specific loops in RT4 and SW780 (2299) relative to the basal BLCA models SCABER and HT1376 (2144). We then compared each of these categories with loops detected in five patient samples (Fig. 3b):

30–40% of luminal-assigned and basal-assigned 3D chromatin loops identified in the cell lines were observed in these five tumor samples.

Luminal and basal subtypes of bladder cancers show potentially distinct 3D genome organizations. a Hi-C loop analysis of luminal and basal-squamous cell lines show distinct luminal loops and basal-squamous loops. b Contacts identified in luminal and basal-squamous cell lines are shared and validated in five bladder cancer tumor samples. c Genome-browser tracks for selected luminal gene (FOXA1) and basal gene (KRT5) that contain enhancer-promoter loops are shown here. Arcs indicate the predicted chromatin loops using Hi-C data. d The type of contacts based on the overlap of contact location at either enhancer (H3K27ac at distal region) or promoter (H3K27ac and H3K4me3 at promoter) in each cell line is shown. E-P, enhancer-promoter loops E-E, enhancer-enhancer loops P-P, promoter-promoter loops E-N, enhancer-non regulatory loops P-N, promoter-non regulatory loops None, non-regulatory loops. e Enrichment of FOXA1 (left axis) and GATA3 (right axis) binding sites in RT4 (luminal) cells is shown here at its loop anchors

Finally, we examined enhancer and promoter loops in each category for their association with subtype-specific gene expression. Examples are shown in Fig. 3c, in which we found that the luminal gene FOXA1 and the basal gene KRT5 showed increased number of enhancer-promoter loops in luminal and basal cell lines, respectively. Overall, we observed that

40% of the chromatin loops exist between enhancers and promoters (Fig. 3d). Furthermore, we found a significant enrichment of FOXA1 and GATA3 binding sites at these loop anchors, indicating the involvement of these pioneer factors in the regulation of the 3D genome (Fig. 3e). This finding is in agreement with previous studies reporting the enrichment of FOXA1 binding sites in enhancer-promoter loops [38].

Copy number variation (CNV) and chromatin loops in bladder cancer

A hallmark of cancer is large structural variations (SVs), which includes inversions, deletions, duplications, and translocations. Recently, it has been shown that alteration in CNVs and SVs can lead to the alterations in 3D genome structure, including the formation of new topologically associated domains (“neo-TADs”) [39] and resultant “enhancer hijacking [40].” Neo-TADs refer to scenarios where an SV event leads to the formation of new chromatin domains, which can in turn affect the expression profiles of the genes located in those regions. In the “enhancer-hijacking” model, altered 3D genome organization results in abnormal enhancer interaction, with enhancers brought in close proximity to the wrong target gene (usually an oncogene) resulting in inappropriate target activation.

We first systematically identified copy number variations (CNVs) and SV events using the Hi-C data with HiNT [41] and the Hi-Cbreakfinder [42] software. We identified tens of large SVs, including inversions, deletions, and translocations (Fig. 4a, b, Additional file 4: Figures S4–S5, Additional file 15: Table 14). As might be expected, we observed fewer CNVs in the patient samples than in cell lines. More importantly, we were able to re-construct the local Hi-C map surrounding the breakpoints of the SVs. We can observe interesting enhancer-hijacking events and the formation of neo-TADs in these local Hi-C maps (Fig. 4c–h). These observations provide an important resource to further study the function of the re-arranged enhancers in the context of bladder cancer.

Chromatin interactions induced by structure variation (SV) events. a, b Circos plot showing intra- and inter-chromosome SVs in SCABER (a) and SW780 (b). c A large intra-chromosomal translocation on chr9. dh Inter-chromosomal translocations. The breakpoints were identified by the HiCBreakfinder software. We then reconstructed the local Hi-C maps across the breakpoints. RNA-Seq and H3K27ac ChIP-Seq tracks from the same cell type are shown below the Hi-C maps

Neuronal PAS Domain Protein 2 (NPAS2) is a novel luminal BLCA TF which regulates luminal gene expression and cell migration

Genome-wide open chromatin analysis of BLCA cell lines provides an ideal platform for the identification of novel transcriptional regulators of BLCA cell fate and phenotype. Here we performed motif analysis of luminal-associated, basal-associated, and shared open chromatin regions, resulting in the identification of distinct TFs in each cluster. Among them, many represent known families of subtype-specific regulators, such as the GATA, FOX, and ETS families at luminal-associated ATAC-Seq peaks. Among them, we noticed a potential novel bHLH containing regulator, NPAS2, which is enriched in the luminal-associated and shared clusters, but not enriched in basal-associated ATAC-Seq peaks (Fig. 5a). We examined its binding profile using the latest ENCODE data (HEPG2 cells) [43] and found that NPAS2 binds at the FOXA1 promoter region (Fig. 5b), but not at regulatory regions for basal marker genes. This suggests the possibility that NPAS2 may be an upstream regulator of FOXA1. We then checked the TCGA data and found that high expression level of NPAS2 is significantly correlated to overall patient survival (Fig. 5c).

NPAS2 is a novel bladder cancer regulator. a p-values of NPAS2 motif in luminal-associated (RT4, SW780), basal-associated (SCABER, HT1376), and shared open chromatin regions. b NPAS2 ChIP-seq signal near luminal marker genes FOXA1, GATA3, and PPARG in HEPG2 cell line. c NPAS2 Kaplan-Meier curve is shown here for 2000 days with log-rank statistics and hazards ratio. d Transwell migration assay representative crystal violet staining (left) and quantification of differences in transwell migration (right) are shown following overexpression of NPAS2 in SCABER. e RT-qPCR results for basal marker genes KRT5, KRT6A, STAT3, and TFAP2C are shown here for wild-type and NPAS2 overexpressed SCABER basal cell line. f NPAS2, FOXA1/GATA3, and PPARG RT-qPCR are shown here for wildtype and FOXA1/GATA3 overexpressed SCABER basal cell line

To further determine whether NPAS2 expression influences the downstream target expression and phenotype, we overexpressed NPAS2 in the basal-squamous BLCA cell line SCABER. First, we performed trans-well migration assays and found that overexpression of NPAS2 in SCABER cells decreased cell trans-well migration (Fig. 5d). We then performed RT-qPCR experiments and found that the basal marker genes (such as KRT5, KRT6A, and TFAP2C) are significantly downregulated (Fig. 5e) following NPAS2 overexpression, suggesting NPAS2 represses the expression of a subset of basal marker genes.

Because our functional genomics analysis suggests that FOXA1 and GATA3 cooperate to regulate luminal target genes [8], we individually overexpressed FOXA1 and GATA3 in SCABER cells to test their ability to regulate NPAS2 expression. We observed increased expression of NPAS2 by both FOXA1 and GATA3 overexpression (Fig. 5f).


Discussions

Advances in single-cell technologies present new challenges and opportunities for making biological discovery. Single-cell studies often involve large numbers of cells, which are powerful at characterizing cellular heterogeneity, but small numbers of biological samples, which are underpowered for discovering common disease genes. It has been shown by recent genome-wide association analysis that it is possible to enable new discovery by performing association analysis at cell-type resolutions [55]. For cancer and genetic diseases driven by somatic mutations, being able to obtain genetic footprint at various time and conditions can enable discovery of genes responsible for disease progression and resistance to therapy.

However, it remains unclear what analytical strategies should be deployed to achieve the benefits. Even more challenging it gets when CNAs are being considered, as CNAs affect large regions of the genome and are difficult to trace using phylogenetics methods.

In our study, we demonstrated that it is possible to achieve the benefit by reconstructing copy number evolution history as a lineage tree, i.e., MEDALT, and performing permutation-based statistical analysis, i.e., LSA, to identify fitness-associated CNAs and genes.

We have learned several important lessons in our study.

First, it is important to perform accurate lineage tracing. Although the single-copy gain and loss model that we implemented in deriving MEDALTs is limited in complexity, it already performed substantially better than conventional phylogenetics algorithms such as MP that assumes infinite sites and NJ that employs naïve distance metrics, as shown in our simulation and in real data analysis. It is conceivable that further development of methodology that incorporates more complex genome evolution mechanisms such as chromothripsis [56] can lead to better results.

An important goal was to represent convergent evolution that is likely prevalent in the lens of CNAs [10, 57]. Conventional phylogenetics algorithms strictly prohibit the expression of convergent evolution by disallowing an alteration to occur multiple times in a course of evolution [28]. Several new algorithms relaxed such limitation but were designed for analyzing point mutation data [58]. As shown in our analysis of the TNBC patients, genes identified based on convergent evolution analysis (i.e., PLSA) had an even higher fraction of known cancer genes than those identified based on cohort-level single-lineage LSA. Our result suggests that examining convergent evolution is likely a key component towards fully unleashing the power of single-cell studies.

Unlike canonical phylogenetic trees, MEDALTs are minimal spanning trees that do not contain unobserved internal ancestral nodes. Representing evolution using minimal spanning trees instead of phylogenetics trees was our deliberate choice, as it allowed us to develop polynomial-runtime solutions that are scalable to real datasets containing thousands of cells. It also allowed us to conveniently implement biologically meaningful MED and enforce directionality constraints. Phylogenetics algorithms are likely effective when the numbers of cells are small and that the alterations are simple to trace. None of these conditions apply to available SCCN datasets that have CNAs evolving non-linearly in hundreds of cells. Moreover, we have shown in our simulation that for the purpose of detecting fitness-association alterations, our method outperformed phylogenetics approaches in a wide range of sample sizes.

A particular challenge in developing and evaluating computational lineage tracing methods is the lack of exact ground truth. Although various experimental technologies have been developed [59, 60], we are not aware of any that can be applied to trace copy number evolution in patient samples. To circumvent this, we utilized in silico simulation that mimics several prevalent CNA mechanisms to evaluate the accuracies of the reconstructed lineages and fitness-associated alterations. We also utilized longitudinal datasets on which we knew the biological stages of the cells to evaluate the chronological accuracy of the inference results. Although these strategies are unlikely sufficient to validate all the edges and lengths in the trees, they are objective and sufficient to discriminate various approaches.

Second, it is important to control biases in statistical inference. It is challenging to detect fitness-associated genes, as CNAs often affect a large number of genes and that the sample sizes are often small. Passenger CNAs that occur naturally in non-functional regions such as those near fragile sites or repeats could easily cloud the discovery. In addition, lineage tracing algorithms are unlikely to be perfect and could introduce distinct biases. To address these challenges, we employed LSA, which randomly permutes SCCN profiles into different cells to reduce the biases introduced by background genomic variations and technical noises. And we reconstructed trees from permutated datasets to alleviate biases introduced by the lineage tracing algorithms. The evolutionarily meaningful MED metrics and constraints help our analyses to focus on biologically relevant hypotheses, given limited computational resources. These procedures appeared important to achieve the accuracy. Further exploration of different ways to permute the data and to estimate the background distribution will likely lead to better results.

We assessed the functional impact of the identified genes using cell-line CRISPR essentiality screen data. We confirmed that the set of fitness-associated, amplified genes discovered by our methods are significantly more essential than other control gene sets in cancer cell lines. We also nominated novel genes that appear to have prognostic values in TCGA and the METABRIC datasets. These assessment strategies likely have false positives and negatives. Further comprehensive, well-controlled and targeted experiments will likely be required to fully assess the functional impact and clinical values of these genes.

Lastly, it was exciting to observe benefits of our methods on both the scDNA-seq and the scRNA-seq data. Although RNA-derived copy number profiles may not be as accurate as those derived from DNAs, previous studies [61] suggested that they can reasonably distinguish tumor clones. Our study further revealed the value of scRNA-seq data in lineage tracing and supported the notion that genomic profiles, even approximations, are more accurate than transcriptomic profiles in determining biological timing of cells. Our results opened doors towards utilizing scRNA-seq as a platform to understand genetics underlying developmental processes and perform gene discovery.


CONCLUSION

The number of users proves that MEXPRESS, through its ease of use and unique, integrative data overview, found its place in the toolbox of many researchers. By combining a comprehensive visualization and statistical analysis in a single figure, MEXPRESS helps researchers quickly identify dysregulations and their clinical relevance in cancer. With this major, feedback-driven update, we aim to consolidate MEXPRESS’s place in the set of open source web tools available to researchers and clinicians.


Methods

Haploproficient genes and orthology analysis

The set of S.cerevisiae genes which are haploproficient in turbidostat culture was obtained using the growth data of [8] and an FDR cutoff of 0.02. This stringent FDR cut-off rigorously defines those genes for which heterozygosity confers a strong fitness advantage, but has no effect on the functional enrichment of genes identified as haploproficient. Genes defined as ‘haploproficient’ for the purposes of this study are listed in Additional file 1: Table S1. The set of chromosome maintenance-associated HP genes described in [8] overlaps, but is not coincident, with the HPGI set studied here, since the current set also includes DNA damage-response genes.

Orthology assignments were made using the InParanoid algorithm [50] and compared with the results of a BLAST [51] reciprocal best-hits search. GO enrichment searches were performed using the Babelomics 4 FatiGO tool [52]. To assess the significance of HP gene conservation, the number of HP genes having orthologs in a given Ascomycete species, given the number of S. cerevisiae HP genes, was compared against the whole-genome conserved proportion using a χ 2 or Fisher exact test (depending on sample size), with the null hypothesis of identical distribution. All findings of significance were reiterated using a Z test for difference of proportions. Where necessary, P values were corrected for multiple testing using the Bonferroni correction. Cell cycle and DNA damage repair pathways were obtained from the KEGG pathway database [53].

Expression data for S.cerevisae genes was obtained from the Saccharomyces Genome Database [54] and protein expression levels from [55]. A list of human cancer genes/oncogenes was obtained from the Cancer Gene Index [17] enrichment of HP genes amongst the orthologs was determined using a χ 2 test as above. CNV incidence across eight tumour types (breast invasive carcinoma, rectum adencarcinoma colon adenocarcinoma, kidney renal cell clear carcinoma, uterine corpus endometrioid carcinoma, glioblastoma multiforme, acute myeloid leukemia, lung adenocarcinoma, lung squamous cell carcinoma, serous cystadenocarcinoma) as measured by comparative genomic hybridisation, was obtained from the NCI Cancer Genome Atlas online data browser [17] with a copy number (log2 ratio) of magnitude >0.5 taken as the significance threshold. Details of the sampling and analysis of the tumour samples are described in [17]. A P-value for HP ortholog overrepresentation was calculated using a χ 2 test .The TGCA database was also used to perform a pathway search for overrepresentation of HP orthologs.

Yeast strains

In total, 30 HP genes were chosen for analysis, based upon the criteria discussed in the Results above. The heterozygous deletion mutant of each gene was obtained from the heterozygous diploid deletion library (Open Biosystems), in the BY4743 (MAT a /α, his3D1/his3D1, leu2D0/leu2D0, LYS2/lys2D0, met15D0/MET15, ura3D0/ura3D0) genetic background. For non-essential genes, the homozygous deletant was retrieved from the analogous homozygous diploid deletion library (Open Biosystems).

Control strains were the BY4743 WT, along with the heterozygous deletion mutant of the non-functional his3 locus the non-HP, non-cell cycle ho/HO heterozygous deletion strain and the heterozygous deletion mutant of the non-HP, cell cycle gene HSL1. In addition, heterozygous deletion mutants of the G1 and G2 cyclins were included in several of the experiments. A complete list of the strains used is provided in Additional file 6: Table S6.

Cell-cycle profiling

Flow cytometric analysis of the deletion strains’ cell cycle profiles was carried about following the method of [56]. Briefly,

10 7 cells in mid-exponential phase were harvested, washed, and fixed in absolute ethanol at 4C overnight. Fixed cells were then collected, washed, and boiled for 15 minutes in 2 mg/mL RNAse in 50 mM Tris-Cl (pH 8), and incubated at 37C for 2–12 hours. Cells were resuspended in protease solution (5 mg/mL pepsin, 4.5 μL/mL concentrated HCl), incubated for 15 minutes at 37C and resuspended in 50 mM Tris (pH 7.5). For analysis, 50 mL of cell suspension was added to 1 mL of 1 mM Sytox Green in 50 mM Tris pH 7.5), vortexed and analysed using a Cyan flow cytometer (Beckman Coulter). FlowJo (Tree Star) analysis software was used to fit histograms to the peaks representing 1C and 2C DNA content, and thereby calculate the number of cells in the G1 and G2 phases, and infer the number in S phase from the remaining fraction of the population.

Chronological lifespan assay

Cultures were inoculated from frozen stocks, grown overnight in YPD at 3°C, and 200mL of each was transferred into a well of a 96-well microtiter plate (Corning). Strains were present in duplicate on each plate, with a buffer of WT in the wells around the edge of the plate, so edge effects would not impact test colony measurements. A Singer Rotor HDA colony pinning robot was used to spot four replicates of each well onto a YPD + 10 μg/mL phloxine B (Sigma) plate. Phloxine B is a fluorescein derivative taken up when the cell membrane is disrupted upon cell death [57]. Plates were incubated for 48 hours at 3°C and photographed using an Epson 1240 Scanner. The colony images were analysed using a custom image-analysis code written in MatLab, with colony size measured by pixel count, and fraction of dead cells by the intensity of colony redness [10]. Since these parameters are independent, this allowed the dissection of the effect of cell viability upon colony growth from that of growth rate variation. The 96-well liquid cultures were incubated at 3°C, and, every second day over a period of three weeks, the colony-pinning onto YPD + phloxine B and image analysis repeated. For each plate, the median culture intensity for each strain was compared with the growth of the WT on that plate, and also with the strain growth and viability after the initial 48-hour period. The experiment was performed twice.

At several points throughout the 3-week period, several strains were selected at random, and viability assayed by performing serial dilutions and counting colony-forming units. These results were checked for compatibility with the microplate viability results.

Apoptosis assays

The rate of occurrence of apoptosis in the different strain populations was measured in two ways. Apoptosis was first induced by pretreating cells with 0.001%, 0.01% MMS, 0.0001% or 0.001% TBHP in overnight culture keeping a negative, non-induced WT control sample.

The translocation of phosphatidyl serine to the cell surface, a marker of apoptosis [58], was measured using an Annexin V-FITC Apoptosis Detection kit. (Sigma). Cells were harvested, washed in 1.2M sorbitol, 0.5 mM MgCl2, 35 mM K phosphate (pH 6.8) and then digested in 5.5% glusulase (Sigma) and 15 U/mL lyticase (Sigma) for 2 hours at 28C. Spheroplasts were harvested, washed in binding buffer (10 mM Hepes/NaOH pH 7.4, 140 mM NaCl, 2.5 mM CaCl2 in 1.2 M sorbitol buffer) and resuspended in binding buffer/sorbitol. 5 mL of FITC-labelled annexin V, and 10 mL of 10010 mg/mL propidium iodide were added to each sample, with control samples containing 1.) no label, 2.) FITC-annexin V only, and 3.) PI only. Fluorescence was quantified using a CyAn (Beckman Coulter). Gates were fitted on the basis of the the control samples, dividing a log PI versus log FITC plot into four quadrants: lower left (neither FITC nor PI-stained) – viable cells upper left (PI stain only) – necrotic cells lower right (FITC only) – early apoptotic cells and upper right (PI and FITC-stained) – late apoptotic cells. FlowJo software (TreeStar) was used to count the fraction of the total cell population in each quadrant. The proportion of both necrotic and apoptotic cells for each strain was normalised to strain viability (i.e. on the basis of the proportion of cells assigned to the lower-left FITC/PI quadrant), and the ratio of necrotic:apoptotic cells calculated. Ratios for each strain were normalised to the WT value, and the standard deviation across all samples calculated. Strains having a necrosis:apoptosis ratio further than 1.5x this standard deviation from WT levels were deemed to demonstrate abnormal apoptosis rates.

Growth rate and drug sensitivity assays

Growth and drug sensitivity assays were performed both on solid media and in liquid cultures. For solid assays, the required drug concentration was added to YPD-agar containing 10μg/m/mL phloxine B. Overnight cultures of the strains were spotted onto the (drug-containing) plates using a Singer rotor, as above. Plates were incubated at 3°C and photographed at 24 and 48 hours and analysed using an image-processing code as described above. Strain growth and viability was compared both with WT growth on the same plate, and with growth on YPD-agar (or YPD-agar plus DMSO, where the drug is DMSO-soluble). The ratio of viability and size with and without drug was calculated for every strain on a plate, and the standard deviation of all ratios calculated. Strains having a drug:untreated ratio greater than or less than two standard deviations from that of the WT were deemed to be resistant and sensitive, respectively.

Assays in liquid culture were performed by transferring 5mL of overnight culture into each well of a 96-well microtitre plate, containing 200 μL of YPD plus the required concentration of drug. Absorbance was measured for 30 hours at 3°C using a BMG Optima platereader, maximum growth rate calculated using a curve-fitting script written in R, and the growth rate for each strain compared with that of the WT in the same plate, and growth in YPD/YPD + DMSO.


References

Yi K, Ju Y. Patterns and mechanisms of structural variations in human cancer. Exp Mol Med. 201850:98.

Yang L, Luquette L, Gehlenborg N, Xi R, Haseley P, Hsieh C, Zhang C, Ren X, Protopopov A, Chin L, et al. Diverse mechanisms of somatic structural variations in human cancer genomes. Cell. 2013153:919–29.

Zhang Y, Yang L, Kucherlapati M, Chen F, Hadjipanayis A, Pantazi A, Bristow C, Lee E, Mahadeshwar H, Tang J, et al. A pan-cancer compendium of genes deregulated by somatic genomic rearrangement across more than 1,400 cases. Cell Rep. 201824:515–27.

Campbell P, Getz G, Stuart J, Korbel J, Stein L. Pan-cancer analysis of whole genomes. Preprint at. 2017. https://doi.org/10.1101/162784.

Zhang Y, Chen F, Fonseca N, He Y, Fujita M, Nakagawa H, Zhang Z, Brazma A, Creighton C. Whole genome and RNA sequencing of 1,220 cancers reveals hundreds of genes deregulated by rearrangement of cis-regulatory elements. Preprint at. 2017. https://doi.org/10.1101/099861.

Deaton A, Bird A. CpG islands and the regulation of transcription. Genes Dev. 201125:1010–22.

Bird A. DNA methylation patterns and epigenetic memory. Genes Dev. 200216:6–21.

Pfeifer G. Defining driver DNA methylation changes in human cancer. Int J Mol Sci. 201819:E1166.

Morano A, Angrisano T, Russo G, Landi R, Pezone A, Bartollino S, Zuchegna C, Babbio F, Bonapace I, Allen B, et al. Targeted DNA methylation by homology-directed repair in mammalian cells. Transcription reshapes methylation on the repaired gene. Nucleic Acids Res. 201442:804–21.

Russo G, Landi R, Pezone A, Morano A, Zuchegna C, Romano A, Muller M, Gottesman M, Porcellini A, Avvedimento E. DNA damage and repair modify DNA methylation and chromatin domain of the targeted locus: mechanism of allele methylation polymorphism. Sci Rep. 20166:33222.

Allen B, Pezone A, Porcellini A, Muller M, Masternak M. Non-homologous end joining induced alterations in DNA methylation: a source of permanent epigenetic change. Oncotarget. 20178:40359–72.

Sun W, Bunn P, Jin C, Little P, Zhabotynsky V, Perou C, Hayes D, Chen M, Lin D. The association between copy number aberration, DNA methylation and gene expression in tumor samples. Nucleic Acids Res. 201846:3009–18.

Davis C, Ricketts C, Wang M, Yang L, Cherniack A, Shen H, Buhay C, Kang H, Kim S, Fahey C, et al. The somatic genomic landscape of chromophobe renal cell carcinoma. Cancer Cell. 201426:319–30.

Forbes S, Beare D, Boutselakis H, Bamford S, Bindal N, Tate J, Cole C, Ward S, Dawson E, Ponting L, et al. COSMIC: somatic cancer genetics at high-resolution. Nucleic Acids Res. 201745:D777–83.

Lawrence M, Stojanov P, Mermel C, Robinson J, Garraway L, Golub T, Meyerson M, Gabriel S, Lander E, Getz G. Discovery and saturation analysis of cancer genes across 21 tumour types. Nature. 2014505:495–501.

Chen F, Zhang Y, Gibbons D, Deneen B, Kwiatkowski D, Ittmann M, Creighton C. Pan-cancer molecular classes transcending tumor lineage across 32 cancer types, multiple data platforms, and over 10,000 cases. Clin Cancer Res. 201824:2182–93.

Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad Sci U S A. 2003100:9440–5.

Hu X, Wang Q, Tang M, Barthel F, Amin S, Yoshihara K, Lang F, Martinez-Ledesma E, Lee S, Zheng S, Verhaak R. TumorFusions: an integrative resource for cancer-associated transcript fusions. Nucleic Acids Res. 201846:D1144–9.

Peifer M, Hertwig F, Roels F, Dreidax D, Gartlgruber M, Menon R, Krämer A, Roncaioli J, Sand F, Heuckmann J, et al. Telomerase activation by genomic rearrangements in high-risk neuroblastoma. Nature. 2015526:700–4.

Creighton C, Hernandez-Herrera A, Jacobsen A, Levine D, Mankoo P, Schultz N, Du Y, Zhang Y, Larsson E, Sheridan R, et al. Integrated analyses of microRNAs demonstrate their widespread influence on gene expression in high-grade serous ovarian carcinoma. PLoS One. 20127:e34546.

Ungewiss C, Rizvi Z, Roybal J, Peng D, Gold K, Shin D, Creighton C, Gibbons D. The microRNA-200/Zeb1 axis regulates ECM-dependent β1-integrin/FAK signaling, cancer cell invasion and metastasis through CRKL. Sci Rep. 20166:18652.

Kiuru-Kuhlefelt S, Sarlomo-Rikala M, Larramendy M, Söderlund M, Hedman K, Miettinen M, Knuutila S. FGF4 and INT2 oncogenes are amplified and expressed in Kaposi’s sarcoma. Mod Pathol. 200013:433–7.

Weischenfeldt J, Dubash T, Drainas A, Mardin B, Chen Y, Stütz A, Waszak S, Bosco G, Halvorsen A, Raeder B, et al. Pan-cancer analysis of somatic copy-number alterations implicates IRS4 and IGF2 in enhancer hijacking. Nat Genet. 201749:65–74.

Godinho M, Meijer D, Setyono-Han B, Dorssers L, van Agthoven T. Characterization of BCAR4, a novel oncogene causing endocrine resistance in human breast cancer cells. J Cell Physiol. 2011226:1741–9.

Kim J, Piao H, Kim B, Yao F, Han Z, Wang Y, Xiao Z, Siverly A, Lawhon S, Ton B, et al. Long noncoding RNA MALAT1 suppresses breast cancer metastasis. Nat Genet. 201850:1705–15.

Yang X, Han H, De Carvalho D, Lay F, Jones P, Liang G. Gene body methylation can alter gene expression and is a therapeutic target in cancer. Cancer Cell. 201426:577–90.

Dixon J, Selvaraj S, Yue F, Kim A, Li Y, Shen Y, Hu M, Liu J, Ren B. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature. 2012485:376–80.

Andersson R, Gebhard C, Miguel-Escalada I, Hoof I, Bornholdt J, Boyd M, Chen Y, Zhao X, Schmidl C, Suzuki T, et al. An atlas of active enhancers across human cell types and tissues. Nature. 2014507:455–61.

Taylor A, Shih J, Ha G, Gao G, Zhang X, Berger A, Schumacher S, Wang C, Hu H, Liu J, et al. Genomic and functional approaches to understanding cancer aneuploidy. Cancer Cell. 201833:676–89.

Knijnenburg T, Wang L, Zimmermann M, Chambwe N, Gao G, Cherniack A, Fan H, Shen H, Way G, Greene C, et al. Genomic and molecular landscape of DNA damage repair deficiency across The Cancer Genome Atlas. Cell Rep. 201823:239–54 1.

Bindea G, Mlecnik B, Tosolini M, Kirilovsky A, Waldner M, Obenauf A, Angell H, Fredriksen T, Lafontaine L, Berger A, et al. Spatiotemporal dynamics of intratumoral immune cells reveal the immune landscape in human cancer. Immunity. 201339:782–95.

Thorsson V, Gibbs D, Brown S, Wolf D, Bortone D, Ou Yang T, Porta-Pardo E, Gao G, Plaisier C, Eddy J, et al. The immune landscape of cancer. Immunity. 201848:812–30.

Mermel CH, Schumacher SE, Hill B, Meyerson ML, Beroukhim R, Getz G. GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers. Genome Biol. 201112:R41.

Alaei-Mahabadi B, Bhadury J, Karlsson J, Nilsson J, Larsson E. Global analysis of somatic structural genomic alterations and their impact on gene expression in diverse human cancers. Proc Natl Acad Sci U S A. 2016113:13768–73.

Drier Y, Lawrence M, Carter S, Stewart C, Gabriel S, Lander E, Meyerson M, Beroukhim R, Getz G. Somatic rearrangements across cancer reveal classes of samples with distinct patterns of DNA breakage and rearrangement-induced hypermutability. Genome Res. 201323:228–35.

Esteller M. Epigenetics in cancer. N Engl J Med. 2008358:1148–59.

Eden A, Gaudet F, Waghmare A, Jaenisch R. Chromosomal instability and tumors promoted by DNA hypomethylation. Science. 2003300:455.

Coarfa C, Pichot C, Jackson A, Tandon A, Amin V, Raghuraman S, Paithankar S, Lee A, McGuire S, Milosavljevic A. Analysis of interactions between the epigenome and structural mutability of the genome using Genboree Workbench tools. BMC Bioinformatics. 201415(Suppl 7):S2.

Hajkova P, Jeffries S, Lee C, Miller N, Jackson S, Surani M. Genome-wide reprogramming in the mouse germ line entails the base excision repair pathway. Science. 2010329:78–82.

Laird P, Jaenisch R. DNA methylation and cancer. Hum Mol Genet. 19943 Spec No:1487–95.

James S, Pogribny I, Pogribna M, Miller B, Jernigan S, Melnyk S. Mechanisms of DNA damage, DNA hypomethylation, and tumor progression in the folate/methyl-deficient rat model of hepatocarcinogenesis. J Nutr. 2003133:3740S–7S.

Yung C, O'Connor B, Yakneen S, Zhang J, Ellrott K, Kleinheinz K, Miyoshi N, Raine K, Royo R, Saksena G, et al. Large-scale uniform analysis of cancer whole genomes in multiple computing environments. Preprint at. 2017. https://doi.org/10.1101/161638.

Wala J, Shapira O, Li Y, Craft D, Schumacher S, Imielinski M, Haber J, Roberts N, Yao X, Stewart C, et al. Selective and mechanistic sources of recurrent rearrangements across the cancer genome. Preprint at. 2017. https://doi.org/10.1101/187609.

Chen K, Wallis J, McLellan M, Larson D, Kalicki J, Pohl C, McGrath S, Wendl M, Zhang Q, Locke D, et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat Methods. 20096:677–81.

Chen F, Zhang Y, Şenbabaoğlu Y, Ciriello G, Yang L, Reznik E, Shuch B, Micevic G, De Velasco G, Shinbrot E, et al. Multilevel genomics-based taxonomy of renal cell carcinoma. Cell Rep. 201614:2476–89.

Lee A, Ewing A, Ellrott K, Hu Y, Houlahan K, Bare J, Espiritu S, Huang V, Dang K, Chong Z, et al. Combining accurate tumor genome simulation with crowdsourcing to benchmark somatic structural variant detection. Genome Biol. 201819:188.

Fonseca N, Kahles A, Lehmann K-V, Calabrese C, Chateigner A, Davidson N, Demircioğlu D, He Y, Lamaze F, Li S, et al. Pan-cancer study of heterogeneous RNA aberrations. Preprint at. 2017. https://doi.org/10.1101/183889.

The_Cancer_Genome_Atlas_Research_Network. Comprehensive molecular characterization of clear cell renal cell carcinoma. Nature. 2013499:43–9.

Johnson W, Rabinovic A, Li C. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 20078:118–27.

Hoadley K, Yau C, Hinoue T, Wolf D, Lazar A, Drill E, Shen R, Taylor A, Cherniack A, Thorsson V, et al. Cell-of-origin patterns dominate the molecular classification of 10,000 tumors from 33 types of cancer. Cell. 2018173:291–304.

McCarroll S, Kuruvilla F, Korn J, Cawley S, Nemesh J, Wysoker A, Shapero M, de Bakker P, Maller J, Kirby A, et al. Integrated detection and population genetic analysis of SNPs and copy number variation. Nat Genet. 200840:1166–74.

Gerstung M, Jolly C, Leshchiner I, Dentro S, Rosado S, Rosebrock D, Mitchell T, Rubanova Y, Anur P, Yu K, et al. The evolutionary history of 2,658 cancers. Preprint at. 2018. https://doi.org/10.1101/161562.

Xie C, Leung Y, Chen A, Long D, Hoyo C, Ho S. Differential methylation values in differential methylation analysis. Bioinformatics. 201935:1094–7.

Creighton C, Nagaraja A, Hanash S, Matzuk M, Gunaratne P. A bioinformatics tool for linking gene expression profiling results with public databases of microRNA target predictions. RNA. 200814:2290–6.

Saldanha AJ. Java Treeview--extensible visualization of microarray data. Bioinformatics. 200420:3246–8.

Zhang Y, Yang L, Kucherlapati M, Chen F, Hadjipanayis A, Pantazi A, Bristow C, Lee E, Mahadeshwar H, Tang J, et al. R-code for linear models integrating expression data with somatic structural data. Github. 2019 https://github.com/chadcreighton/SV-expression_integration.


Methods

Haploproficient genes and orthology analysis

The set of S.cerevisiae genes which are haploproficient in turbidostat culture was obtained using the growth data of [8] and an FDR cutoff of 0.02. This stringent FDR cut-off rigorously defines those genes for which heterozygosity confers a strong fitness advantage, but has no effect on the functional enrichment of genes identified as haploproficient. Genes defined as ‘haploproficient’ for the purposes of this study are listed in Additional file 1: Table S1. The set of chromosome maintenance-associated HP genes described in [8] overlaps, but is not coincident, with the HPGI set studied here, since the current set also includes DNA damage-response genes.

Orthology assignments were made using the InParanoid algorithm [50] and compared with the results of a BLAST [51] reciprocal best-hits search. GO enrichment searches were performed using the Babelomics 4 FatiGO tool [52]. To assess the significance of HP gene conservation, the number of HP genes having orthologs in a given Ascomycete species, given the number of S. cerevisiae HP genes, was compared against the whole-genome conserved proportion using a χ 2 or Fisher exact test (depending on sample size), with the null hypothesis of identical distribution. All findings of significance were reiterated using a Z test for difference of proportions. Where necessary, P values were corrected for multiple testing using the Bonferroni correction. Cell cycle and DNA damage repair pathways were obtained from the KEGG pathway database [53].

Expression data for S.cerevisae genes was obtained from the Saccharomyces Genome Database [54] and protein expression levels from [55]. A list of human cancer genes/oncogenes was obtained from the Cancer Gene Index [17] enrichment of HP genes amongst the orthologs was determined using a χ 2 test as above. CNV incidence across eight tumour types (breast invasive carcinoma, rectum adencarcinoma colon adenocarcinoma, kidney renal cell clear carcinoma, uterine corpus endometrioid carcinoma, glioblastoma multiforme, acute myeloid leukemia, lung adenocarcinoma, lung squamous cell carcinoma, serous cystadenocarcinoma) as measured by comparative genomic hybridisation, was obtained from the NCI Cancer Genome Atlas online data browser [17] with a copy number (log2 ratio) of magnitude Ϡ.5 taken as the significance threshold. Details of the sampling and analysis of the tumour samples are described in [17]. A P-value for HP ortholog overrepresentation was calculated using a χ 2 test .The TGCA database was also used to perform a pathway search for overrepresentation of HP orthologs.

Yeast strains

In total, 30 HP genes were chosen for analysis, based upon the criteria discussed in the Results above. The heterozygous deletion mutant of each gene was obtained from the heterozygous diploid deletion library (Open Biosystems), in the BY4743 (MATa/α, his3D1/his3D1, leu2D0/leu2D0, LYS2/lys2D0, met15D0/MET15, ura3D0/ura3D0) genetic background. For non-essential genes, the homozygous deletant was retrieved from the analogous homozygous diploid deletion library (Open Biosystems).

Control strains were the BY4743 WT, along with the heterozygous deletion mutant of the non-functional his3 locus the non-HP, non-cell cycle ho/HO heterozygous deletion strain and the heterozygous deletion mutant of the non-HP, cell cycle gene HSL1. In addition, heterozygous deletion mutants of the G1 and G2 cyclins were included in several of the experiments. A complete list of the strains used is provided in Additional file 6: Table S6.

Cell-cycle profiling

Flow cytometric analysis of the deletion strains’ cell cycle profiles was carried about following the method of [56]. Briefly,

10 7 cells in mid-exponential phase were harvested, washed, and fixed in absolute ethanol at 4C overnight. Fixed cells were then collected, washed, and boiled for 15 minutes in 2 mg/mL RNAse in 50 mM Tris-Cl (pH 8), and incubated at 37C for 2� hours. Cells were resuspended in protease solution (5 mg/mL pepsin, 4.5 μL/mL concentrated HCl), incubated for 15 minutes at 37C and resuspended in 50 mM Tris (pH 7.5). For analysis, 50 mL of cell suspension was added to 1 mL of 1 mM Sytox Green in 50 mM Tris pH 7.5), vortexed and analysed using a Cyan flow cytometer (Beckman Coulter). FlowJo (Tree Star) analysis software was used to fit histograms to the peaks representing 1C and 2C DNA content, and thereby calculate the number of cells in the G1 and G2 phases, and infer the number in S phase from the remaining fraction of the population.

Chronological lifespan assay

Cultures were inoculated from frozen stocks, grown overnight in YPD at 3ଌ, and 200mL of each was transferred into a well of a 96-well microtiter plate (Corning). Strains were present in duplicate on each plate, with a buffer of WT in the wells around the edge of the plate, so edge effects would not impact test colony measurements. A Singer Rotor HDA colony pinning robot was used to spot four replicates of each well onto a YPD +� μg/mL phloxine B (Sigma) plate. Phloxine B is a fluorescein derivative taken up when the cell membrane is disrupted upon cell death [57]. Plates were incubated for 48 hours at 3ଌ and photographed using an Epson 1240 Scanner. The colony images were analysed using a custom image-analysis code written in MatLab, with colony size measured by pixel count, and fraction of dead cells by the intensity of colony redness [10]. Since these parameters are independent, this allowed the dissection of the effect of cell viability upon colony growth from that of growth rate variation. The 96-well liquid cultures were incubated at 3ଌ, and, every second day over a period of three weeks, the colony-pinning onto YPD + phloxine B and image analysis repeated. For each plate, the median culture intensity for each strain was compared with the growth of the WT on that plate, and also with the strain growth and viability after the initial 48-hour period. The experiment was performed twice.

At several points throughout the 3-week period, several strains were selected at random, and viability assayed by performing serial dilutions and counting colony-forming units. These results were checked for compatibility with the microplate viability results.

Apoptosis assays

The rate of occurrence of apoptosis in the different strain populations was measured in two ways. Apoptosis was first induced by pretreating cells with 0.001%, 0.01% MMS, 0.0001% or 0.001% TBHP in overnight culture keeping a negative, non-induced WT control sample.

The translocation of phosphatidyl serine to the cell surface, a marker of apoptosis [58], was measured using an Annexin V-FITC Apoptosis Detection kit. (Sigma). Cells were harvested, washed in 1.2M sorbitol, 0.5 mM MgCl2, 35 mM K phosphate (pH 6.8) and then digested in 5.5% glusulase (Sigma) and 15 U/mL lyticase (Sigma) for 2 hours at 28C. Spheroplasts were harvested, washed in binding buffer (10 mM Hepes/NaOH pH 7.4, 140 mM NaCl, 2.5 mM CaCl2 in 1.2 M sorbitol buffer) and resuspended in binding buffer/sorbitol. 5 mL of FITC-labelled annexin V, and 10 mL of 10010 mg/mL propidium iodide were added to each sample, with control samples containing 1.) no label, 2.) FITC-annexin V only, and 3.) PI only. Fluorescence was quantified using a CyAn (Beckman Coulter). Gates were fitted on the basis of the the control samples, dividing a log PI versus log FITC plot into four quadrants: lower left (neither FITC nor PI-stained) – viable cells upper left (PI stain only) – necrotic cells lower right (FITC only) – early apoptotic cells and upper right (PI and FITC-stained) – late apoptotic cells. FlowJo software (TreeStar) was used to count the fraction of the total cell population in each quadrant. The proportion of both necrotic and apoptotic cells for each strain was normalised to strain viability (i.e. on the basis of the proportion of cells assigned to the lower-left FITC/PI quadrant), and the ratio of necrotic:apoptotic cells calculated. Ratios for each strain were normalised to the WT value, and the standard deviation across all samples calculated. Strains having a necrosis:apoptosis ratio further than 1.5x this standard deviation from WT levels were deemed to demonstrate abnormal apoptosis rates.

Growth rate and drug sensitivity assays

Growth and drug sensitivity assays were performed both on solid media and in liquid cultures. For solid assays, the required drug concentration was added to YPD-agar containing 10μg/m/mL phloxine B. Overnight cultures of the strains were spotted onto the (drug-containing) plates using a Singer rotor, as above. Plates were incubated at 3ଌ and photographed at 24 and 48 hours and analysed using an image-processing code as described above. Strain growth and viability was compared both with WT growth on the same plate, and with growth on YPD-agar (or YPD-agar plus DMSO, where the drug is DMSO-soluble). The ratio of viability and size with and without drug was calculated for every strain on a plate, and the standard deviation of all ratios calculated. Strains having a drug:untreated ratio greater than or less than two standard deviations from that of the WT were deemed to be resistant and sensitive, respectively.

Assays in liquid culture were performed by transferring 5mL of overnight culture into each well of a 96-well microtitre plate, containing 200 μL of YPD plus the required concentration of drug. Absorbance was measured for 30 hours at 3ଌ using a BMG Optima platereader, maximum growth rate calculated using a curve-fitting script written in R, and the growth rate for each strain compared with that of the WT in the same plate, and growth in YPD/YPD +𠂝MSO.


2. Methods

This section proposes an expanded graph database model that includes the gene expression, miRNA expression, DNA methylation, copy number gain and loss information, tissue slide information, and mutation data from TCGA. It also outlines the steps performed to create the proposed graph database model.

2.1. Data

For this study, we have specifically added copy number information, miRNA expression, and image information of the tissue slide to the previously stored clinical information, gene expression (log2 counts per million), hyper and hypomethylation information, and mis-sense mutation data from the Genomics Data Commons (GDC) for breast cancer (BRCA), prostate adenocarcinoma (PRAD), and the pancreatic adenocarcinoma (PAAD). Table 1 shows the summary information about the data set used for this study.