Name	Name	Last commit message	Last commit date
Latest commit gwaybio Merge pull request #41 from gwaygenomics/final-pipeline Jul 29, 2017 84c6079 · Jul 29, 2017 History 157 Commits
config	config	adding kappa to parameter sweep	Jul 24, 2017
data	data	add encoded clinical data	Jul 29, 2017
figures	figures	add weight exploration figures and notebook	Jul 29, 2017
models	models	remove original vae pancan script and old hdf5 models	Jul 29, 2017
param_sweep	param_sweep	add parameter sweep results	Jul 29, 2017
results	results	add git lfs results files	Jul 29, 2017
scripts	scripts	add nbconvert distance script	Jul 29, 2017
.gitattributes	.gitattributes	add git lfs results files	Jul 29, 2017
.gitignore	.gitignore	ignore pycache	Jul 29, 2017
LICENSE.md	LICENSE.md	add license file	Jul 17, 2017
README.md	README.md	update readme for pipeline instructions	Jul 29, 2017
download_data.sh	download_data.sh	add copy number download instructions	Jul 14, 2017
environment.yml	environment.yml	add pmacs cluster env	Jul 25, 2017
explore_weights.ipynb	explore_weights.ipynb	save background genes as txt file	Jul 29, 2017
get_distance.ipynb	get_distance.ipynb	fix reference to GBM in distance script	Jul 29, 2017
pancan_vae_keras_onehidden_warmup_batchnorm.ipynb	pancan_vae_keras_onehidden_warmup_batchnorm.ipynb	update keras model removing tsne	Jul 29, 2017
param_sweep.sh	param_sweep.sh	add note in parameter sweep	Jul 29, 2017
parameter_sweep.md	parameter_sweep.md	add parameter sweep markdown explanation	Jul 25, 2017
process_data.ipynb	process_data.ipynb	updating processing data notebook	Jul 14, 2017
run_pipeline.sh	run_pipeline.sh	adding pipeline script	Jul 29, 2017
subtraction.ipynb	subtraction.ipynb	update cancer-type mean subtraction notebook	Jul 28, 2017
tsne_vae.ipynb	tsne_vae.ipynb	add tsne notebook	Jul 29, 2017

Name

Last commit message

Last commit date

gwaybio

Merge pull request #41 from gwaygenomics/final-pipeline

Jul 29, 2017

84c6079 · Jul 29, 2017

157 Commits

config

adding kappa to parameter sweep

Jul 24, 2017

data

add encoded clinical data

Jul 29, 2017

figures

add weight exploration figures and notebook

Jul 29, 2017

models

remove original vae pancan script and old hdf5 models

Jul 29, 2017

param_sweep

add parameter sweep results

Jul 29, 2017

results

add git lfs results files

Jul 29, 2017

scripts

add nbconvert distance script

Jul 29, 2017

.gitattributes

add git lfs results files

Jul 29, 2017

Jul 29, 2017

Jul 17, 2017

update readme for pipeline instructions

Jul 29, 2017

download_data.sh

add copy number download instructions

Jul 14, 2017

environment.yml

add pmacs cluster env

Jul 25, 2017

explore_weights.ipynb

save background genes as txt file

Jul 29, 2017

get_distance.ipynb

fix reference to GBM in distance script

Jul 29, 2017

pancan_vae_keras_onehidden_warmup_batchnorm.ipynb

update keras model removing tsne

Jul 29, 2017

param_sweep.sh

add note in parameter sweep

Jul 29, 2017

parameter_sweep.md

add parameter sweep markdown explanation

Jul 25, 2017

process_data.ipynb

updating processing data notebook

Jul 14, 2017

run_pipeline.sh

adding pipeline script

Jul 29, 2017

subtraction.ipynb

update cancer-type mean subtraction notebook

Jul 28, 2017

tsne_vae.ipynb

add tsne notebook

Jul 29, 2017

Variational Autoencoder - Pan Cancer

Gregory Way and Casey Greene 2017

The repository stores scripts to train, evaluate, and extract knowledge from a variational autoencoder trained on 33 different cancer-types from The Cancer Genome Atlas (TCGA).

The Data

TCGA has collected numerous different genomic measurements from over 10,000 different tumors spanning 33 different cancer-types. In this repository, we extract cancer signatures from gene expression data (RNA-seq).

The RNA-seq data serves as a measurement describing the high-dimensional state of each tumor. As a highly heterogeneous disease, cancer exists in several different combination of states. Our goal is to extract these different states using high capacity models capable of identifying common signatures in gene expression data across different cancer-types.

The Model

We present a variational autoencoder (VAE) applied to cancer gene expression data. A VAE is a deep generative model introduced by Kingma and Welling in 2013. The model has two direct benefits of modeling cancer gene expression data.

Automatically engineer non-linear features
Learning the reduced dimension manifold of cancer expression space

As a generative model, the reduced dimension features can be sampled from to simulate data. The manifold can also be interpolated to interrogate trajectories and transitions between states.

VAEs have typically been applied to image data and have demonstrated remarkable generative capacity and modeling flexibility. VAEs are different from deterministic autoencoders because of the added constraint of normally distributed feature activations per sample. This constraint not only regularizes the model, but also provides the interpretable manifold.

Below is a t-SNE visualization of the VAE encoded features (p = 100) for all tumors.

Training

The current model training is explained in this notebook

For a complete pipeline with reproducibility instructions, refer to run_pipeline.sh. Note that scripts originally written in Jupyter notebooks ported to the scripts folder for pipeline purposes with:

jupyter nbconvert --to=script --FilesWriter.build_directory=scripts *.ipynb

Architecture

We select the top 5,000 most variably expressed genes by median absolute deviation. We compress this 5,000 vector of gene expression (for all samples) into two vectors of length 100; one representing the a mean and the other the variance. This vector can be sampled from to generate samples from an approximation of the data generating function. This hidden layer is then reconstructed back to the original dimensions. We use batch normalization and relu activation layers in the compression steps to prevent dead nodes and positive weights. We use a sigmoid activation in the decoder. We use the Keras library with a TensorFlow backend for training.

Parameter sweep

In order to select the most optimal parameters for the model, we ran a parameter search over a small grid of parameters. See parameter_sweep.md for more details. Overall, we selected optimal learning rate = 0.0005, batch size = 50, epochs = 100. Training with optimal parameters was similar for training and a 10% test set across each epoch.

Model Evaluation

After training with optimal hyper parameters, the unsupervised model can be interpreted. For instance, the distribution of activations across different nodes can be visualized. For example, the first 10 nodes (of 100) can be visualized by sample activation patterns.

In this scenario, each node activation pattern contributes uniquely to each tumor and may represent specific gene expression signatures of biological significance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Variational Autoencoder - Pan Cancer

The Data