Summary

This script illustrates a simple differential gene expression (DGE) analysis with DESeq2:

Importing Salmon counts into Summarized Experiment R object
Reading Summarized Experiment object into DESeq DataSet object
Exploring data before the differential expression analysis
Detecting differentially expressed genes with DESeq2
Exporting and exploring the results

The results are exported into a text file and visualized using heatmaps and volcano-plots.

Please note that this is only an introduction into DGE analysis.
A detailed tutorial for DGE analysis with DESEq2 package could be found here:

http://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html

Start section

# Packages should be installed only once. 
# They have been pre-installed. 
# Hence these lines are commented. 

# CRAN packages
# install.packages("dplyr")
# install.packages("ggplot2")
# install.packages("pheatmap")
# install.packages("BiocManager")

# Bioconductor packages
# BiocManager::install("tximeta")
# BiocManager::install("DESeq2")
# BiocManager::install("org.Hs.eg.db")
# BiocManager::install('EnhancedVolcano')

# Time stamp
Sys.time()

## [1] "2024-06-26 16:02:43 BST"

# Clean environment
rm(list=ls())
graphics.off()

1. Read Salmon counts with tximeta

tximeta stands for “Transcripts Import with Metadata”. This package allows a user to import Salmon counts into a data structure called Summarized Experiment (SE). Importantly, tximeta can detect the version of human genome resources (genome, transcriptome and annotation) used by Salmon during the transcripts count. Of course, tximeta can only recognize commonly used human (or mouse) genome resources from Gencode, Ensembl or Refseq. For other species or customized genome resources you may use tximport package, or even construct the Summarized Experiment object manually.

Luckily, we used a compatible Gencode version of resources (Human Gencode v45) during the Salmon count. So, we may use tximeta for data import. In addition to recognizing the transcripts names, tximeta will be able to aggregate the transcripts into genes, and even will add the genes/transcripts genomic coordinates to the produced Summarized Experiment (making a Ranged Summarized Experiment).

You may find detailed tutorials for using Summarised Experiment and tximeta here:

1.1 Read Salmon transcript counts

Note that after loading tximeta we will to check its version. This is needed because older versions of tximeta may be out of sync with recent versions of genomic resources. tximeta v.1.20.+ is already OK for the GENCODE v.45

The first time when tximeta calls for GENCODE v45, it downloads the resources from the internet and performs some necessary configuration (e.g. asks user where to store the files etc). This needs to be done only once. To save time, all required resources had already been downloaded to the VM.

# Load tximeta library
library(tximeta)
packageVersion("tximeta")

## [1] '1.20.3'

# Set working folder 
# Note that when rendering RMD, setwd() may not persist beyond the current chunk:
# https://bookdown.org/yihui/rmarkdown-cookbook/working-directory.html#working-directory

#base_folder="/home/ubuntu/Documents/bulk-rnaseq/short-reads"
base_folder <- "C:\\Users\\alexe\\OneDrive - Cranfield University\\External\\ebi\\short_reads"
data_folder <- file.path(base_folder, "data", "salmon_counts")
setwd(data_folder)

# Read samples information
coldata <- read.table("samples.txt", header=T, sep = "\t")
coldata

##                 files names        SRA     Depth     CCLE            condition
## 1 SRR8615612/quant.sf   AC1 SRR8615612  83450360 NCIH1373       Adenocarcinoma
## 2 SRR8616105/quant.sf   AC2 SRR8616105  84706828   NCIH23       Adenocarcinoma
## 3 SRR8615805/quant.sf   AC3 SRR8615805 109186348  NCIH358       Adenocarcinoma
## 4 SRR8616002/quant.sf  SCC1 SRR8616002  79854702   DMS114 Small_Cell_Carcinoma
## 5 SRR8615428/quant.sf  SCC2 SRR8615428  77000229  CORL311 Small_Cell_Carcinoma
## 6 SRR8615609/quant.sf  SCC3 SRR8615609  60897689 NCIH1836 Small_Cell_Carcinoma

# Read Salmon counts to Summarized Experiment (SE) object
se <- tximeta(coldata) 

# Show summary of the SE object:
# it should be RangedSummarizedExperiment 
# with 252,048 transcripts and 6 samples
se

## class: RangedSummarizedExperiment 
## dim: 252048 6 
## metadata(6): tximetaInfo quantInfo ... txomeInfo txdbInfo
## assays(3): counts abundance length
## rownames(252048): ENST00000456328.2 ENST00000450305.2 ... ENST00000387460.2 ENST00000387461.2
## rowData names(3): tx_id gene_id tx_name
## colnames(6): AC1 AC2 ... SCC2 SCC3
## colData names(5): names SRA Depth CCLE condition

# Load SummarizedExperiment library:
# it provides core functions for handling SE objects 
library(SummarizedExperiment)

# Information about the genome
seqinfo(se)

## Seqinfo object with 25 sequences (1 circular) from hg38 genome:
##   seqnames seqlengths isCircular genome
##   chr1      248956422      FALSE   hg38
##   chr2      242193529      FALSE   hg38
##   chr3      198295559      FALSE   hg38
##   chr4      190214555      FALSE   hg38
##   chr5      181538259      FALSE   hg38
##   ...             ...        ...    ...
##   chr21      46709983      FALSE   hg38
##   chr22      50818468      FALSE   hg38
##   chrX      156040895      FALSE   hg38
##   chrY       57227415      FALSE   hg38
##   chrM          16569       TRUE   hg38

# Information about the samples
colData(se)

## DataFrame with 6 rows and 5 columns
##            names         SRA     Depth        CCLE            condition
##      <character> <character> <integer> <character>          <character>
## AC1          AC1  SRR8615612  83450360    NCIH1373       Adenocarcinoma
## AC2          AC2  SRR8616105  84706828      NCIH23       Adenocarcinoma
## AC3          AC3  SRR8615805 109186348     NCIH358       Adenocarcinoma
## SCC1        SCC1  SRR8616002  79854702      DMS114 Small_Cell_Carcinoma
## SCC2        SCC2  SRR8615428  77000229     CORL311 Small_Cell_Carcinoma
## SCC3        SCC3  SRR8615609  60897689    NCIH1836 Small_Cell_Carcinoma

# Assays: matrices with the counts
# (and other data relevant to counts)
assayNames(se)

## [1] "counts"    "abundance" "length"

# Information about transcripts:
# note that SE uses Ensembl transcript IDs (ENST...),
# also note the genomic coordinates (given in Rle/IRanges formats)
rowRanges(se)

## GRanges object with 252048 ranges and 3 metadata columns:
##                     seqnames      ranges strand |     tx_id           gene_id           tx_name
##                        <Rle>   <IRanges>  <Rle> | <integer>   <CharacterList>       <character>
##   ENST00000456328.2     chr1 11869-14409      + |         1 ENSG00000290825.1 ENST00000456328.2
##   ENST00000450305.2     chr1 12010-13670      + |         2 ENSG00000223972.6 ENST00000450305.2
##   ENST00000488147.2     chr1 14696-24886      - |     11638 ENSG00000227232.6 ENST00000488147.2
##   ENST00000619216.1     chr1 17369-17436      - |     11639 ENSG00000278267.1 ENST00000619216.1
##   ENST00000473358.1     chr1 29554-31097      + |         3 ENSG00000243485.5 ENST00000473358.1
##                 ...      ...         ...    ... .       ...               ...               ...
##   ENST00000361681.2     chrM 14149-14673      - |    252928 ENSG00000198695.2 ENST00000361681.2
##   ENST00000387459.1     chrM 14674-14742      - |    252929 ENSG00000210194.1 ENST00000387459.1
##   ENST00000361789.2     chrM 14747-15887      + |    252920 ENSG00000198727.2 ENST00000361789.2
##   ENST00000387460.2     chrM 15888-15953      + |    252921 ENSG00000210195.2 ENST00000387460.2
##   ENST00000387461.2     chrM 15956-16023      - |    252930 ENSG00000210196.2 ENST00000387461.2
##   -------
##   seqinfo: 25 sequences (1 circular) from hg38 genome

1.2 Aggregate transcripts to genes

# Summarize to gene level 
# Use most abundant transcript to select the most relevant gene coordinates
gse <- summarizeToGene(se, assignRanges="abundant")

# Explore the result
# Note that GSE already uses genes IDs 
# (ENSG..., 62,729 genes)
gse

## class: RangedSummarizedExperiment 
## dim: 62749 6 
## metadata(7): tximetaInfo quantInfo ... txdbInfo assignRanges
## assays(3): counts abundance length
## rownames(62749): ENSG00000000003.16 ENSG00000000005.6 ... ENSG00000293559.1 ENSG00000293560.1
## rowData names(5): gene_id tx_ids isoform max_prop iso_prop
## colnames(6): AC1 AC2 ... SCC2 SCC3
## colData names(5): names SRA Depth CCLE condition

rowRanges(gse)

## GRanges object with 62749 ranges and 5 metadata columns:
##                      seqnames              ranges strand |            gene_id                                                     tx_ids            isoform  max_prop                             iso_prop
##                         <Rle>           <IRanges>  <Rle> |        <character>                                            <CharacterList>        <character> <numeric>                        <NumericList>
##   ENSG00000000003.16     chrX 100627108-100636806      - | ENSG00000000003.16  ENST00000373020.9,ENST00000612152.4,ENST00000496771.5,...  ENST00000373020.9  0.687237    0.6872372,0.0025714,0.2824258,...
##    ENSG00000000005.6     chrX 100593624-100597531      + |  ENSG00000000005.6                        ENST00000373031.5,ENST00000485971.1  ENST00000485971.1  1.000000                                  0,1
##   ENSG00000000419.14    chr20   50934870-50958532      - | ENSG00000000419.14 ENST00000466152.5,ENST00000371582.8,ENST00000371588.10,... ENST00000371588.10  0.758189    0.0161384,0.0665779,0.0422139,...
##   ENSG00000000457.14     chr1 169853074-169893959      - | ENSG00000000457.14 ENST00000367771.11,ENST00000367770.5,ENST00000367772.8,...  ENST00000367772.8  0.515659       0.213536,0.000000,0.515659,...
##   ENSG00000000460.17     chr1 169795043-169804347      + | ENSG00000000460.17  ENST00000498289.5,ENST00000472795.5,ENST00000359326.9,...  ENST00000496973.5  0.383497 0.00838816,0.01530056,0.15528023,...
##                  ...      ...                 ...    ... .                ...                                                        ...                ...       ...                                  ...
##    ENSG00000293556.1     chr7 149776435-149783291      + |  ENSG00000293556.1                        ENST00000486824.3,ENST00000493502.3  ENST00000486824.3  0.921758                  0.9217577,0.0782423
##    ENSG00000293557.1     chr7 149824562-149833978      + |  ENSG00000293557.1  ENST00000461331.5,ENST00000488835.5,ENST00000492965.5,...  ENST00000472850.3  0.411791       0.000000,0.173459,0.342876,...
##    ENSG00000293558.1     chr7 149810915-149812635      + |  ENSG00000293558.1                                          ENST00000481772.3  ENST00000481772.3  1.000000                                    1
##    ENSG00000293559.1     chr7   99238911-99261636      + |  ENSG00000293559.1                                          ENST00000453194.2  ENST00000453194.2  1.000000                                    1
##    ENSG00000293560.1     chr7   99297912-99311130      + |  ENSG00000293560.1                        ENST00000452010.2,ENST00000429079.1  ENST00000452010.2  1.000000                                  1,0
##   -------
##   seqinfo: 25 sequences (1 circular) from hg38 genome

# Clean-up 
# (delete interim data that are not any longer needed)
rm(coldata,se,data_folder)

2. Adding gene annotations

2.1 org.Hs.eg.db

Bioconductor provides a large collection of Annotation Data Packages:
- https://bioconductor.org/packages/release/data/annotation/

Many annotations for human genes are provided via org.Hs.eg.db data package:
- https://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html

Information from org.Hs.eg.db can be easily added to Summarised Experiment objects.

# Load database with human genes annotations 
library(org.Hs.eg.db)

# Explore the content of org.Hs.eg.db
org.Hs.eg.db

## OrgDb object:
## | DBSCHEMAVERSION: 2.1
## | Db type: OrgDb
## | Supporting package: AnnotationDbi
## | DBSCHEMA: HUMAN_DB
## | ORGANISM: Homo sapiens
## | SPECIES: Human
## | EGSOURCEDATE: 2023-Sep11
## | EGSOURCENAME: Entrez Gene
## | EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
## | CENTRALID: EG
## | TAXID: 9606
## | GOSOURCENAME: Gene Ontology
## | GOSOURCEURL: http://current.geneontology.org/ontology/go-basic.obo
## | GOSOURCEDATE: 2023-07-27
## | GOEGSOURCEDATE: 2023-Sep11
## | GOEGSOURCENAME: Entrez Gene
## | GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
## | KEGGSOURCENAME: KEGG GENOME
## | KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes
## | KEGGSOURCEDATE: 2011-Mar15
## | GPSOURCENAME: UCSC Genome Bioinformatics (Homo sapiens)
## | GPSOURCEURL: 
## | GPSOURCEDATE: 2023-Aug20
## | ENSOURCEDATE: 2023-May10
## | ENSOURCENAME: Ensembl
## | ENSOURCEURL: ftp://ftp.ensembl.org/pub/current_fasta
## | UPSOURCENAME: Uniprot
## | UPSOURCEURL: http://www.UniProt.org/
## | UPSOURCEDATE: Mon Sep 18 16:12:39 2023

# Add gene SYMBOLs to our GSE object
gse <- addIds(gse, "SYMBOL", gene=TRUE)

# Explore the result
# (the SYMBOLs column is added to the right side of the table)
rowRanges(gse)

## GRanges object with 62749 ranges and 6 metadata columns:
##                      seqnames              ranges strand |            gene_id                                                     tx_ids            isoform  max_prop                             iso_prop      SYMBOL
##                         <Rle>           <IRanges>  <Rle> |        <character>                                            <CharacterList>        <character> <numeric>                        <NumericList> <character>
##   ENSG00000000003.16     chrX 100627108-100636806      - | ENSG00000000003.16  ENST00000373020.9,ENST00000612152.4,ENST00000496771.5,...  ENST00000373020.9  0.687237    0.6872372,0.0025714,0.2824258,...      TSPAN6
##    ENSG00000000005.6     chrX 100593624-100597531      + |  ENSG00000000005.6                        ENST00000373031.5,ENST00000485971.1  ENST00000485971.1  1.000000                                  0,1        TNMD
##   ENSG00000000419.14    chr20   50934870-50958532      - | ENSG00000000419.14 ENST00000466152.5,ENST00000371582.8,ENST00000371588.10,... ENST00000371588.10  0.758189    0.0161384,0.0665779,0.0422139,...        DPM1
##   ENSG00000000457.14     chr1 169853074-169893959      - | ENSG00000000457.14 ENST00000367771.11,ENST00000367770.5,ENST00000367772.8,...  ENST00000367772.8  0.515659       0.213536,0.000000,0.515659,...       SCYL3
##   ENSG00000000460.17     chr1 169795043-169804347      + | ENSG00000000460.17  ENST00000498289.5,ENST00000472795.5,ENST00000359326.9,...  ENST00000496973.5  0.383497 0.00838816,0.01530056,0.15528023,...       FIRRM
##                  ...      ...                 ...    ... .                ...                                                        ...                ...       ...                                  ...         ...
##    ENSG00000293556.1     chr7 149776435-149783291      + |  ENSG00000293556.1                        ENST00000486824.3,ENST00000493502.3  ENST00000486824.3  0.921758                  0.9217577,0.0782423        <NA>
##    ENSG00000293557.1     chr7 149824562-149833978      + |  ENSG00000293557.1  ENST00000461331.5,ENST00000488835.5,ENST00000492965.5,...  ENST00000472850.3  0.411791       0.000000,0.173459,0.342876,...        <NA>
##    ENSG00000293558.1     chr7 149810915-149812635      + |  ENSG00000293558.1                                          ENST00000481772.3  ENST00000481772.3  1.000000                                    1        <NA>
##    ENSG00000293559.1     chr7   99238911-99261636      + |  ENSG00000293559.1                                          ENST00000453194.2  ENST00000453194.2  1.000000                                    1        <NA>
##    ENSG00000293560.1     chr7   99297912-99311130      + |  ENSG00000293560.1                        ENST00000452010.2,ENST00000429079.1  ENST00000452010.2  1.000000                                  1,0        <NA>
##   -------
##   seqinfo: 25 sequences (1 circular) from hg38 genome

2.2 Gencode GTF

Unfortunately, many ENSEMBL genes still don’t have corresponding SYMBOLS, as could be seen by NA-s at the end of our rowRanges.

# How many Ensembl genes have no SYMBOLs yet?
sum(is.na(rowRanges(gse)$SYMBOL))

## [1] 26279

# The total number of the genes in our GSE:
length(rowRanges(gse))

## [1] 62749

To avoid potential problems caused by the missed SYMBOLS, you may get the “gene names” directly from the GTF file provided by Gencode. Gencode uses SYMBOLs to name the gens where possible, and uses other identifiers (e.g. Ensembl IDs) if the SYMBOL is not available.

IMPORTANT: use the same version of Gencode that was used for Salmon counts!

# Library containing function for reading GFF files
library(rtracklayer)

# GenCode 45 annotations (GTF) file name
gtf_file <- file.path(base_folder, "resources","gencode_45","gencode.v45.primary_assembly.annotation.gtf.gz")

# Read GTF file to a data frame (this may take a couple of minutes ...)
gtf.df <- readGFF(gtf_file)

# Explore annotations imported from the GTF file
head(gtf.df)

##   seqid source       type start   end score strand phase           gene_id                          gene_type gene_name level                 tag     transcript_id transcript_type transcript_name transcript_support_level    havana_transcript exon_number           exon_id    hgnc_id          havana_gene  ont protein_id ccdsid artif_dupl
## 1  chr1 HAVANA       gene 11869 14409    NA      +    NA ENSG00000290825.1                             lncRNA   DDX11L2     2 overlaps_pseudogene              <NA>            <NA>            <NA>                     <NA>                 <NA>        <NA>              <NA>       <NA>                 <NA> <NA>       <NA>   <NA>       <NA>
## 2  chr1 HAVANA transcript 11869 14409    NA      +    NA ENSG00000290825.1                             lncRNA   DDX11L2     2   Ensembl_canonical ENST00000456328.2          lncRNA     DDX11L2-202                        1 OTTHUMT00000362751.1        <NA>              <NA>       <NA>                 <NA> <NA>       <NA>   <NA>       <NA>
## 3  chr1 HAVANA       exon 11869 12227    NA      +    NA ENSG00000290825.1                             lncRNA   DDX11L2     2   Ensembl_canonical ENST00000456328.2          lncRNA     DDX11L2-202                        1 OTTHUMT00000362751.1           1 ENSE00002234944.1       <NA>                 <NA> <NA>       <NA>   <NA>       <NA>
## 4  chr1 HAVANA       exon 12613 12721    NA      +    NA ENSG00000290825.1                             lncRNA   DDX11L2     2   Ensembl_canonical ENST00000456328.2          lncRNA     DDX11L2-202                        1 OTTHUMT00000362751.1           2 ENSE00003582793.1       <NA>                 <NA> <NA>       <NA>   <NA>       <NA>
## 5  chr1 HAVANA       exon 13221 14409    NA      +    NA ENSG00000290825.1                             lncRNA   DDX11L2     2   Ensembl_canonical ENST00000456328.2          lncRNA     DDX11L2-202                        1 OTTHUMT00000362751.1           3 ENSE00002312635.1       <NA>                 <NA> <NA>       <NA>   <NA>       <NA>
## 6  chr1 HAVANA       gene 12010 13670    NA      +    NA ENSG00000223972.6 transcribed_unprocessed_pseudogene   DDX11L1     2                <NA>              <NA>            <NA>            <NA>                     <NA>                 <NA>        <NA>              <NA> HGNC:37102 OTTHUMG00000000961.2 <NA>       <NA>   <NA>       <NA>

# Make table linking Ensembl gene ID to the Gencode "gene_name"
library(dplyr)
geneId2Name.df <- gtf.df %>% 
  select(gene_id, gene_name) %>% 
  distinct()

# Explore result
head(geneId2Name.df)

##             gene_id   gene_name
## 1 ENSG00000290825.1     DDX11L2
## 2 ENSG00000223972.6     DDX11L1
## 3 ENSG00000227232.6      WASH7P
## 4 ENSG00000278267.1   MIR6859-1
## 5 ENSG00000243485.5 MIR1302-2HG
## 6 ENSG00000284332.1   MIR1302-2

tail(geneId2Name.df)

##                 gene_id       gene_name
## 63236 ENSG00000278573.1 ENSG00000278573
## 63237 ENSG00000276017.1 ENSG00000276017
## 63238 ENSG00000278817.1 ENSG00000278817
## 63239 ENSG00000277196.4 ENSG00000277196
## 63240 ENSG00000278625.1              U6
## 63241 ENSG00000277374.1              U1

# The following 3 lines of code add the Gencode "gene_names" to our GSE object
rownames(geneId2Name.df) <- geneId2Name.df$gene_id
geneId2Name.df <- geneId2Name.df[rownames(gse),]
rowRanges(gse)$gene_name <- geneId2Name.df$gene_name

# Check result
# (column with "gene name" is added to the right side of the table)
rowRanges(gse)

## GRanges object with 62749 ranges and 7 metadata columns:
##                      seqnames              ranges strand |            gene_id                                                     tx_ids            isoform  max_prop                             iso_prop      SYMBOL       gene_name
##                         <Rle>           <IRanges>  <Rle> |        <character>                                            <CharacterList>        <character> <numeric>                        <NumericList> <character>     <character>
##   ENSG00000000003.16     chrX 100627108-100636806      - | ENSG00000000003.16  ENST00000373020.9,ENST00000612152.4,ENST00000496771.5,...  ENST00000373020.9  0.687237    0.6872372,0.0025714,0.2824258,...      TSPAN6          TSPAN6
##    ENSG00000000005.6     chrX 100593624-100597531      + |  ENSG00000000005.6                        ENST00000373031.5,ENST00000485971.1  ENST00000485971.1  1.000000                                  0,1        TNMD            TNMD
##   ENSG00000000419.14    chr20   50934870-50958532      - | ENSG00000000419.14 ENST00000466152.5,ENST00000371582.8,ENST00000371588.10,... ENST00000371588.10  0.758189    0.0161384,0.0665779,0.0422139,...        DPM1            DPM1
##   ENSG00000000457.14     chr1 169853074-169893959      - | ENSG00000000457.14 ENST00000367771.11,ENST00000367770.5,ENST00000367772.8,...  ENST00000367772.8  0.515659       0.213536,0.000000,0.515659,...       SCYL3           SCYL3
##   ENSG00000000460.17     chr1 169795043-169804347      + | ENSG00000000460.17  ENST00000498289.5,ENST00000472795.5,ENST00000359326.9,...  ENST00000496973.5  0.383497 0.00838816,0.01530056,0.15528023,...       FIRRM           FIRRM
##                  ...      ...                 ...    ... .                ...                                                        ...                ...       ...                                  ...         ...             ...
##    ENSG00000293556.1     chr7 149776435-149783291      + |  ENSG00000293556.1                        ENST00000486824.3,ENST00000493502.3  ENST00000486824.3  0.921758                  0.9217577,0.0782423        <NA> ENSG00000293556
##    ENSG00000293557.1     chr7 149824562-149833978      + |  ENSG00000293557.1  ENST00000461331.5,ENST00000488835.5,ENST00000492965.5,...  ENST00000472850.3  0.411791       0.000000,0.173459,0.342876,...        <NA> ENSG00000293557
##    ENSG00000293558.1     chr7 149810915-149812635      + |  ENSG00000293558.1                                          ENST00000481772.3  ENST00000481772.3  1.000000                                    1        <NA> ENSG00000293558
##    ENSG00000293559.1     chr7   99238911-99261636      + |  ENSG00000293559.1                                          ENST00000453194.2  ENST00000453194.2  1.000000                                    1        <NA> ENSG00000293559
##    ENSG00000293560.1     chr7   99297912-99311130      + |  ENSG00000293560.1                        ENST00000452010.2,ENST00000429079.1  ENST00000452010.2  1.000000                                  1,0        <NA> ENSG00000293560
##   -------
##   seqinfo: 25 sequences (1 circular) from hg38 genome

# Clean-up
rm(gtf_file, gtf.df, geneId2Name.df)

3. Make DESeq2 object

At this point we have imported Salmon counts to the Summarized Experiment object, and have added some annotations (such as SYMBOLS and “gene names”). Now we are prepared to import data into DESeq DataSet.

3.1 Reading data

# Load DESeq2 library
library(DESeq2)

# Read data
dds <- DESeqDataSet(gse, design = ~ condition)

# Explore result
dds

## class: DESeqDataSet 
## dim: 62749 6 
## metadata(8): tximetaInfo quantInfo ... assignRanges version
## assays(3): counts abundance avgTxLength
## rownames(62749): ENSG00000000003.16 ENSG00000000005.6 ... ENSG00000293559.1 ENSG00000293560.1
## rowData names(7): gene_id tx_ids ... SYMBOL gene_name
## colnames(6): AC1 AC2 ... SCC2 SCC3
## colData names(5): names SRA Depth CCLE condition

# Clean-up
rm(gse)

3.2 DESeq2 design formula

Note the design element of the DESeq object: it describes what analysis we want to perform. In this example we will consider a very simple design: comparing gene expressions between the two types of lung cancer given in the condition column:

design = ~ condition

More complex designs might include covariates (e.g. sex or batch). In such cases, the covariate data should be provided in the samples table, and then included into the design formula. If the design includes covariates, then the main variable of interest (e.g. condition in our example) MUST be kept at the end of the DESeq2 formula. For instance, a design with batch-correction and adjustment for sex could look like this:

design = ~ batch + sex + condition

In other packages this order could be different.

Discussing more complex designs is beyond this introductory course. However, information about advanced use of DESeq2 could be found in the tutorial mentioned above:
- http://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html

4. Removing low-expressed genes

The low-expressed genes are not informative. At the same time, they may complicate downstream tasks (e.g. make calculations slower, or inflate multiple testing correction). Manual removal of low-expressed genes is optional for DESeq2 analysis, because DESeq2 automatically takes care of them later. However, it is not optional for edgeR analysis.

Conventionally, you may exclude genes with median count per sample less than 10. Alternatively, you may explore the genes abundances and select a customized threshold.

# Calculate median counts 
median_counts <- apply(counts(dds),1,median)

# Explore the median counts distribution
# (vertical line shows arbitrary threshold of count=10)
hist(log2(median_counts), 
     main="Median counts per gene (Log2)",
     xlab="Log2(median count)")
abline(v=log2(10), col="red", lty=2)

# Select the low expressed genes
# using an arbitrary threshold of median count 10 
low_expressed_genes <- median_counts < 10
sum(low_expressed_genes)

## [1] 42184

# Remove the low expressed genes
dds <- dds[!low_expressed_genes, ]
dds

## class: DESeqDataSet 
## dim: 20565 6 
## metadata(8): tximetaInfo quantInfo ... assignRanges version
## assays(3): counts abundance avgTxLength
## rownames(20565): ENSG00000000003.16 ENSG00000000419.14 ... ENSG00000293552.1 ENSG00000293557.1
## rowData names(7): gene_id tx_ids ... SYMBOL gene_name
## colnames(6): AC1 AC2 ... SCC2 SCC3
## colData names(5): names SRA Depth CCLE condition

# Clean-up
rm(low_expressed_genes, median_counts)

5. Explore data before analysis

5.1 Apply Variance-Stabilizing transformation

It is recommended to normalize data for exploration. Here we use Variance-Stabilizing-Transformation (VST), provided by DESeq2 (an alternative is Rlog).

Importantly, we apply VST/Rlog transformation ONLY to the exploratory data! The actual DGE analysis is performed later starting from the raw data.

The boxplots below show that there is no gross outliers in our data.

# Distributions of raw counts
boxplot(log2(assay(dds)), main="Raw Data", ylab="log2(expr)")

# VST (or Rlog): for exploring and visualizing ONLY !
vst <- varianceStabilizingTransformation(dds)
#rlg <- rlog(dds)

# Counts after VST
boxplot(log2(assay(vst)), main="VST Data", ylab="log2(expr)")

5.2 Heatmap

To speed-up computation, we only use 1000 most variable genes for this step. The heatmap below suggests that 1000 most variable genes could separate lung adenocarcinoma and small-cells lung cancers in our dataset.

# Select 1000 most variable genes
variable_genes <- sort(apply(assay(vst),1,var), decreasing=T)[1:1000]
selected_data <- assay(vst)[names(variable_genes),]

# Plot heatmap
heatmap(selected_data, labRow=NA)

# Clean-up
rm(variable_genes, selected_data)

5.3 PCA

plotPCA is a built-in DESeq2 function. The PCA plot confirms that lung adenocarcinomas and small-cell lung cancers have very distinct gene expression pasterns.

plotPCA(vst)

# Clean-up: 
# Note that we don't need VST data any longer
# the actual DGE analysis will start with raw data
rm(vst)

6. Differential gene expression analysis

After the exploration suggested that there is no gross outliers in the data, and that transcriptional profiles are different between the studied groups, we will look for the genes differentially expressed between these groups.

6.1 Core steps of DESeq analysis

The core steps of DESeq analysis includes fitting the negative binomial model and dispersion adjustments are all done with a single DESeq command.

The results of calculations are saved as a DeSEq DataSet object with additional fields containing the calculated statistical metrics. For simplicity, in the code below the result is saved under the same name as the initial DeSEq DataSet object. However, you may note some new “assays” within the new object, and new columns added to the rowData table.

# Learn DESeq command parameters
#?DESeq

# Add information about differential expression to DESeq-DataSet object
dds <- DESeq(dds)

# Check result (there are 7 assays instead of 3)
dds

## class: DESeqDataSet 
## dim: 20565 6 
## metadata(8): tximetaInfo quantInfo ... assignRanges version
## assays(7): counts abundance ... H cooks
## rownames(20565): ENSG00000000003.16 ENSG00000000419.14 ... ENSG00000293552.1 ENSG00000293557.1
## rowData names(29): gene_id tx_ids ... deviance maxCooks
## colnames(6): AC1 AC2 ... SCC2 SCC3
## colData names(5): names SRA Depth CCLE condition

# Row data: many statistical metrics added for for each gene
colnames(rowData(dds))

##  [1] "gene_id"                                                        "tx_ids"                                                         "isoform"                                                        "max_prop"                                                       "iso_prop"                                                       "SYMBOL"                                                         "gene_name"                                                      "baseMean"                                                       "baseVar"                                                        "allZero"                                                        "dispGeneEst"                                                    "dispGeneIter"                                                   "dispFit"                                                        "dispersion"                                                     "dispIter"                                                      
## [16] "dispOutlier"                                                    "dispMAP"                                                        "Intercept"                                                      "condition_Small_Cell_Carcinoma_vs_Adenocarcinoma"               "SE_Intercept"                                                   "SE_condition_Small_Cell_Carcinoma_vs_Adenocarcinoma"            "WaldStatistic_Intercept"                                        "WaldStatistic_condition_Small_Cell_Carcinoma_vs_Adenocarcinoma" "WaldPvalue_Intercept"                                           "WaldPvalue_condition_Small_Cell_Carcinoma_vs_Adenocarcinoma"    "betaConv"                                                       "betaIter"                                                       "deviance"                                                       "maxCooks"

6.2 Variance (dispersion) adjustments

The plot below illustrates the dispersion adjustments: the black dots show observed dispersions, the red line shows the trend (average dispersion depending on the level of gene expression), and the blue dots show the adjusted dispersions used for DGE calculation.

plotDispEsts(dds, main="Dispersion estimates and adjustments")

6.3 Setting FDR thresholds and extracting results

The results of DGE analysis should be extracted from DDESeq Data Set object results() function, which calculates p-values and FDR for a given fold change.

Importantly, the defaults are to calculate FDR<0.1 for any fold change. These are very relaxed thresholds, which could be appropriate only for a very small number of samples per group (a typical case just several years ago, when the high costs of sequencing were prohibitive for larger group sizes).

6.3.1 Default settings

# Extract results for default FDR and log-fold-change
FDR_0.1_any_FC <- results(dds)

# --------- Explore results --------- #

# Summary: numbers of detected genes and
# genes excluded excluded as outliers. 
# Note that DESeq selected 13 as the threshold for excluding low expressed genes
summary(FDR_0.1_any_FC)

## 
## out of 20565 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 1131, 5.5%
## LFC < 0 (down)     : 1097, 5.3%
## outliers [1]       : 947, 4.6%
## low counts [2]     : 399, 1.9%
## (mean count < 13)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

# Head and tail of the results table
FDR_0.1_any_FC

## log2 fold change (MLE): condition Small Cell Carcinoma vs Adenocarcinoma 
## Wald test p-value: condition Small Cell Carcinoma vs Adenocarcinoma 
## DataFrame with 20565 rows and 6 columns
##                     baseMean log2FoldChange     lfcSE      stat    pvalue      padj
##                    <numeric>      <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000000003.16   2490.54      -0.569151  0.740499 -0.768604 0.4421282  0.770363
## ENSG00000000419.14   6779.61       1.632398  0.835378  1.954083 0.0506914  0.265677
## ENSG00000000457.14   1145.30      -0.473832  0.346140 -1.368903 0.1710295  0.505487
## ENSG00000000460.17   1711.59      -0.313666  0.523161 -0.599559 0.5488003  0.832673
## ENSG00000000971.17   1095.35      -1.317647  2.109042 -0.624761 0.5321278  0.823030
## ...                      ...            ...       ...       ...       ...       ...
## ENSG00000293546.1    79.8270      0.2848225  0.727473  0.391523 0.6954107  0.898247
## ENSG00000293548.1    42.8602     -0.0795315  0.525817 -0.151253 0.8797759  0.963976
## ENSG00000293550.1   402.8264      1.4794925  0.821960  1.799957 0.0718675  0.323623
## ENSG00000293552.1   101.8336     -0.8594089  0.753021 -1.141281 0.2537530  0.605825
## ENSG00000293557.1    67.2056      1.4346120  2.442988  0.587237 0.5570448  0.837506

# Explaining column names of the results table
as.data.frame(mcols(FDR_0.1_any_FC, use.names=T))

##                        type                                                              description
## baseMean       intermediate                                mean of normalized counts for all samples
## log2FoldChange      results log2 fold change (MLE): condition Small Cell Carcinoma vs Adenocarcinoma
## lfcSE               results         standard error: condition Small Cell Carcinoma vs Adenocarcinoma
## stat                results         Wald statistic: condition Small Cell Carcinoma vs Adenocarcinoma
## pvalue              results      Wald test p-value: condition Small Cell Carcinoma vs Adenocarcinoma
## padj                results                                                     BH adjusted p-values

MA plot

MA plot shows fold change depending on the average gene expression. The blue dots show genes,for which changes are considered significant under the selected thresholds.

plotMA(dds, main="MA plot: FDR<0.1 for any fold change")

6.3.2 Customised settings

The more conventional thresholds would be FDR<0.05 for at least 2x fold change. These, or even stricter thresholds, are usually used nowadays when it is affordable to sequence tens (or hundreds) of samples per group.

Note that the number of differentially expressed genes has dropped under the stricter thresholds.

# Calculate results for 2-fold change and FDR<0.05 
# Note that in log2-scale the 2-fold change is equivalent to the lfcThreshold=1
FDR_0.05_2x_change <- results(dds, lfcThreshold=1, alpha=0.05)

# Explore results
summary(FDR_0.05_2x_change)

## 
## out of 20565 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 1.00 (up)    : 405, 2%
## LFC < -1.00 (down) : 379, 1.8%
## outliers [1]       : 947, 4.6%
## low counts [2]     : 0, 0%
## (mean count < 8)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

FDR_0.05_2x_change

## log2 fold change (MLE): condition Small Cell Carcinoma vs Adenocarcinoma 
## Wald test p-value: condition Small Cell Carcinoma vs Adenocarcinoma 
## DataFrame with 20565 rows and 6 columns
##                     baseMean log2FoldChange     lfcSE      stat    pvalue      padj
##                    <numeric>      <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000000003.16   2490.54      -0.569151  0.740499  0.000000  1.000000         1
## ENSG00000000419.14   6779.61       1.632398  0.835378  0.757020  0.449038         1
## ENSG00000000457.14   1145.30      -0.473832  0.346140  0.000000  1.000000         1
## ENSG00000000460.17   1711.59      -0.313666  0.523161  0.000000  1.000000         1
## ENSG00000000971.17   1095.35      -1.317647  2.109042 -0.150612  0.880282         1
## ...                      ...            ...       ...       ...       ...       ...
## ENSG00000293546.1    79.8270      0.2848225  0.727473  0.000000  1.000000         1
## ENSG00000293548.1    42.8602     -0.0795315  0.525817  0.000000  1.000000         1
## ENSG00000293550.1   402.8264      1.4794925  0.821960  0.583353  0.559656         1
## ENSG00000293552.1   101.8336     -0.8594089  0.753021  0.000000  1.000000         1
## ENSG00000293557.1    67.2056      1.4346120  2.442988  0.177902  0.858800         1

# Make MA-plot
plotMA(dds, lfcThreshold=1, alpha=0.05, main="MA plot: FDR<0.05 for at least 2x change")

7. Export and explore results

7.1 Save to a data-frame

Typical steps after selecting the desired DESeq thresholds include:

exporting results to a conventional data-frame
adding gene names/symbols to the gene ID-s
removing genes with missed p-values (DESeq does not calculate adjusted p-values for the genes, which it considers insufficiently expressed)
exporting results to a text file for downstream analysis with third-party tools (e.g. enrichment analysis, pathway fitting or biomarker development)

# Extract results to a data frame
result.df <- as.data.frame(FDR_0.05_2x_change)
dim(result.df)

## [1] 20565     6

head(result.df)

##                    baseMean log2FoldChange     lfcSE       stat    pvalue padj
## ENSG00000000003.16 2490.543     -0.5691506 0.7404988  0.0000000 1.0000000    1
## ENSG00000000419.14 6779.605      1.6323983 0.8353780  0.7570205 0.4490376    1
## ENSG00000000457.14 1145.300     -0.4738320 0.3461398  0.0000000 1.0000000    1
## ENSG00000000460.17 1711.593     -0.3136660 0.5231614  0.0000000 1.0000000    1
## ENSG00000000971.17 1095.345     -1.3176474 2.1090420 -0.1506122 0.8802816    1
## ENSG00000001036.14 4424.283     -2.0144005 0.7382122 -1.3741314 0.1694009    1

tail(result.df)

##                    baseMean log2FoldChange     lfcSE       stat    pvalue padj
## ENSG00000293541.1  75.51924     -3.2150821 1.3931679 -1.5899606 0.1118437    1
## ENSG00000293546.1  79.82702      0.2848225 0.7274733  0.0000000 1.0000000    1
## ENSG00000293548.1  42.86017     -0.0795315 0.5258167  0.0000000 1.0000000    1
## ENSG00000293550.1 402.82636      1.4794925 0.8219601  0.5833526 0.5596560    1
## ENSG00000293552.1 101.83360     -0.8594089 0.7530214  0.0000000 1.0000000    1
## ENSG00000293557.1  67.20561      1.4346120 2.4429881  0.1779018 0.8588001    1

# Get gene names from DDS object
gene_names.df <- as.data.frame(rowData(dds)[,c("gene_id","SYMBOL","gene_name")])
dim(gene_names.df)

## [1] 20565     3

head(gene_names.df)

##                               gene_id SYMBOL gene_name
## ENSG00000000003.16 ENSG00000000003.16 TSPAN6    TSPAN6
## ENSG00000000419.14 ENSG00000000419.14   DPM1      DPM1
## ENSG00000000457.14 ENSG00000000457.14  SCYL3     SCYL3
## ENSG00000000460.17 ENSG00000000460.17  FIRRM     FIRRM
## ENSG00000000971.17 ENSG00000000971.17    CFH       CFH
## ENSG00000001036.14 ENSG00000001036.14  FUCA2     FUCA2

tail(gene_names.df)

##                             gene_id SYMBOL       gene_name
## ENSG00000293541.1 ENSG00000293541.1   <NA> ENSG00000293541
## ENSG00000293546.1 ENSG00000293546.1   <NA>         ALG1L9P
## ENSG00000293548.1 ENSG00000293548.1   <NA> ENSG00000293548
## ENSG00000293550.1 ENSG00000293550.1   <NA> ENSG00000293550
## ENSG00000293552.1 ENSG00000293552.1   <NA> ENSG00000293552
## ENSG00000293557.1 ENSG00000293557.1   <NA> ENSG00000293557

# Add gene names to the data frame with results
all(rownames(gene_names.df) == rownames(result.df))

## [1] TRUE

result.df <- data.frame(gene_names.df,result.df)
dim(result.df)

## [1] 20565     9

head(result.df)

##                               gene_id SYMBOL gene_name baseMean log2FoldChange     lfcSE       stat    pvalue padj
## ENSG00000000003.16 ENSG00000000003.16 TSPAN6    TSPAN6 2490.543     -0.5691506 0.7404988  0.0000000 1.0000000    1
## ENSG00000000419.14 ENSG00000000419.14   DPM1      DPM1 6779.605      1.6323983 0.8353780  0.7570205 0.4490376    1
## ENSG00000000457.14 ENSG00000000457.14  SCYL3     SCYL3 1145.300     -0.4738320 0.3461398  0.0000000 1.0000000    1
## ENSG00000000460.17 ENSG00000000460.17  FIRRM     FIRRM 1711.593     -0.3136660 0.5231614  0.0000000 1.0000000    1
## ENSG00000000971.17 ENSG00000000971.17    CFH       CFH 1095.345     -1.3176474 2.1090420 -0.1506122 0.8802816    1
## ENSG00000001036.14 ENSG00000001036.14  FUCA2     FUCA2 4424.283     -2.0144005 0.7382122 -1.3741314 0.1694009    1

tail(result.df)

##                             gene_id SYMBOL       gene_name  baseMean log2FoldChange     lfcSE       stat    pvalue padj
## ENSG00000293541.1 ENSG00000293541.1   <NA> ENSG00000293541  75.51924     -3.2150821 1.3931679 -1.5899606 0.1118437    1
## ENSG00000293546.1 ENSG00000293546.1   <NA>         ALG1L9P  79.82702      0.2848225 0.7274733  0.0000000 1.0000000    1
## ENSG00000293548.1 ENSG00000293548.1   <NA> ENSG00000293548  42.86017     -0.0795315 0.5258167  0.0000000 1.0000000    1
## ENSG00000293550.1 ENSG00000293550.1   <NA> ENSG00000293550 402.82636      1.4794925 0.8219601  0.5833526 0.5596560    1
## ENSG00000293552.1 ENSG00000293552.1   <NA> ENSG00000293552 101.83360     -0.8594089 0.7530214  0.0000000 1.0000000    1
## ENSG00000293557.1 ENSG00000293557.1   <NA> ENSG00000293557  67.20561      1.4346120 2.4429881  0.1779018 0.8588001    1

# Remove genes with no p-value 
# (this includes the low-expressed genes)
excluded_genes <- is.na(result.df$padj)
sum(excluded_genes)

## [1] 947

result.df <- result.df[!excluded_genes,]
dim(result.df)

## [1] 19618     9

# Select columns to save and order by p-value (and fold change)
result.df <- result.df %>% 
  select(gene_id, gene_name, baseMean, log2FoldChange,  padj) %>% 
  arrange(padj, desc(log2FoldChange))
dim(result.df)

## [1] 19618     5

head(result.df)

##                               gene_id gene_name  baseMean log2FoldChange         padj
## ENSG00000130294.18 ENSG00000130294.18     KIF1A 42564.292      11.143601 2.609015e-57
## ENSG00000075340.23 ENSG00000075340.23      ADD2  6060.279       9.964503 5.089675e-41
## ENSG00000105971.15 ENSG00000105971.15      CAV2  9138.902      -8.356510 1.852796e-37
## ENSG00000167644.12 ENSG00000167644.12  C19orf33  1720.753      -9.048687 4.554644e-30
## ENSG00000069188.17 ENSG00000069188.17      SDK2  1194.699       8.980931 1.987185e-29
## ENSG00000159164.10 ENSG00000159164.10      SV2A  3426.134       7.556652 4.332935e-26

tail(result.df)

##                               gene_id       gene_name   baseMean log2FoldChange padj
## ENSG00000237330.3   ENSG00000237330.3          RNF223  333.23756      -4.035052    1
## ENSG00000285294.1   ENSG00000285294.1       LINC00842   56.07101      -4.059909    1
## ENSG00000260537.2   ENSG00000260537.2 ENSG00000260537  141.57984      -4.460399    1
## ENSG00000236362.9   ENSG00000236362.9         GAGE12F 2533.10619      -4.492739    1
## ENSG00000088340.18 ENSG00000088340.18          FER1L4  386.19624      -4.982070    1
## ENSG00000291034.1   ENSG00000291034.1 ENSG00000291034  130.23387      -5.283913    1

# Save results table to a text file
output_folder <- file.path(base_folder,"results","s04_differential_gene_expression")
dir.create(output_folder)
results_file <- file.path(output_folder,"DESeq2_FDR_0.05_2x_change_genes.txt")
write.table(result.df, file=results_file, quote=FALSE, sep="\t", row.names = FALSE, col.names = TRUE)

# Clean-up
rm(excluded_genes, results_file, output_folder, gene_names.df)

7.2 Volcano plot

We will use EnhancedVolcano package to make the volcano plot:

https://bioconductor.org/packages/devel/bioc/vignettes/EnhancedVolcano/inst/doc/EnhancedVolcano.html

library(EnhancedVolcano)
#?EnhancedVolcano

EnhancedVolcano(result.df,
    lab = result.df$gene_name,
    x = "log2FoldChange",
    y = "padj", 
    title="DESeq2 results",
    subtitle="Enhanced Volcano Plot")

7.3 Heatmap with top genes

This chunk of code illustrates how to use the pheatmap function for advanced heatmap plotting.

We will use log-transformed TPM values from tximeta. Normalized abundances calculated by VSD or Rlog could also be used.

# Library for advanced heatmap plotting
library(pheatmap)
#?pheatmap

# Library to make green-red palette
library(RColorBrewer)

# Get log-transformed TPM values from tximeta
# +1 in case if some expressions are 0
log_expr_tpm.mx <- log2(assay(dds,"abundance") + 1) 

# Select (arbitrary) cut-offs for top genes
log_fc_cutoff <- 10
adj_p_cutoff <- 1e-5

# Extract results for top genes
top_genes.df <- result.df %>% 
  filter(abs(log2FoldChange) > log_fc_cutoff, padj < adj_p_cutoff)
dim(top_genes.df)

## [1] 31  5

# Print top 10 genes
top_genes.df %>% 
  arrange(padj) %>% 
  head(n=10)

##                               gene_id       gene_name   baseMean log2FoldChange         padj
## ENSG00000130294.18 ENSG00000130294.18           KIF1A 42564.2922       11.14360 2.609015e-57
## ENSG00000136155.17 ENSG00000136155.17            SCEL  1775.8663      -10.13023 2.637792e-20
## ENSG00000185666.16 ENSG00000185666.16            SYN3  1161.9712       10.83813 4.818606e-17
## ENSG00000153292.16 ENSG00000153292.16          ADGRF1 10662.7983      -11.25918 2.508326e-16
## ENSG00000286190.2   ENSG00000286190.2 ENSG00000286190  1026.7065       12.56927 1.215160e-15
## ENSG00000158825.6   ENSG00000158825.6             CDA  1308.2019      -10.21058 3.274156e-14
## ENSG00000163873.10 ENSG00000163873.10           GRIK3  1001.4176       10.96557 5.768153e-11
## ENSG00000132932.19 ENSG00000132932.19          ATP8A2  4731.8992       10.52145 5.099518e-10
## ENSG00000196132.14 ENSG00000196132.14            MYT1  3187.1319       11.05015 9.609516e-10
## ENSG00000259495.3   ENSG00000259495.3        ARNT2-DT   986.7009       10.49658 1.676677e-09

# Get log-transformed expression values for the top genes
top_genes_exprs.mx <- log_expr_tpm.mx[top_genes.df$gene_id, ]

# Change gene IDs to gene names
rownames(top_genes_exprs.mx) <- top_genes.df$gene_name
rownames(top_genes.df) <- top_genes.df$gene_name

# Make palette of 99 colours for heatmap
gbr_99 <- colorRampPalette(c("green", "black", "red"))(n = 99)

# Data frame with condition bar (used by pheatmap)
cases.df <- as.data.frame(colData(dds))
cases.df <- cases.df %>% select(condition)

# Prepare heatmap title
num_genes <- nrow(top_genes.df)
main <- paste(num_genes, "genes with log2(FC) >", log_fc_cutoff, "and adj.P <", adj_p_cutoff, "\n")

# Plot heatmap
pheatmap(
  mat = top_genes_exprs.mx, # matrix with data
  color=gbr_99, # user-defined palette
  scale = "row", 
  annotation_col = cases.df, 
  border_color = NA, # don't draw borders in heatmap
  main = main)

# Clean-up
rm(log_expr_tpm.mx, log_fc_cutoff, adj_p_cutoff, main, cases.df, gbr_99, top_genes_exprs.mx, num_genes, result.df)

7.4 Plotting individual genes

Finally, DESeq2 provides plotCounts function for plotting expressions of individual genes.

gene_id <- "ENSG00000130294.18" 
gene_name <- top_genes.df[top_genes.df$gene_id==gene_id,"gene_name"]
plotCounts(dds, gene=gene_id, main=gene_name)

gene_id <- "ENSG00000136155.17" 
gene_name <- top_genes.df[top_genes.df$gene_id==gene_id,"gene_name"]
plotCounts(dds, gene=gene_id, main=gene_name)

rm(gene_id, gene_name, top_genes.df)

End section

# save image
#save.image(file.path(base_folder, "scripts","s04_differential_gene_expression","DESeq2_analysis.RData"))
save.image("DESeq2_analysis.RData")

# List objects
ls()

## [1] "base_folder"        "dds"                "FDR_0.05_2x_change" "FDR_0.1_any_FC"

# List libraries etc 
# (for reproducible coding/research) 
sessionInfo()

## R version 4.3.1 (2023-06-16 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19045)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.utf8  LC_CTYPE=English_United Kingdom.utf8    LC_MONETARY=English_United Kingdom.utf8 LC_NUMERIC=C                            LC_TIME=English_United Kingdom.utf8    
## 
## time zone: Europe/London
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] RColorBrewer_1.1-3          pheatmap_1.0.12             EnhancedVolcano_1.20.0      ggrepel_0.9.5               ggplot2_3.5.1               DESeq2_1.42.1               dplyr_1.1.4                 rtracklayer_1.62.0          org.Hs.eg.db_3.18.0         SummarizedExperiment_1.32.0 MatrixGenerics_1.14.0       matrixStats_1.3.0           GenomicFeatures_1.54.4      AnnotationDbi_1.64.1        Biobase_2.62.0              GenomicRanges_1.54.1        GenomeInfoDb_1.38.8         IRanges_2.36.0              S4Vectors_0.40.2            BiocGenerics_0.48.1         tximeta_1.20.3             
## 
## loaded via a namespace (and not attached):
##   [1] DBI_1.2.2                     bitops_1.0-7                  biomaRt_2.58.2                rlang_1.1.1                   magrittr_2.0.3                compiler_4.3.1                RSQLite_2.3.6                 png_0.1-8                     vctrs_0.6.5                   stringr_1.5.1                 ProtGenerics_1.34.0           pkgconfig_2.0.3               crayon_1.5.2                  fastmap_1.1.1                 dbplyr_2.5.0                  XVector_0.42.0                labeling_0.4.3                utf8_1.2.4                    Rsamtools_2.18.0              promises_1.3.0                rmarkdown_2.26                tzdb_0.4.0                    purrr_1.0.2                   bit_4.0.5                     xfun_0.43                     zlibbioc_1.48.2               cachem_1.0.8                  jsonlite_1.8.8                progress_1.2.3                blob_1.2.4                    highr_0.10                    later_1.3.2                   DelayedArray_0.28.0          
##  [34] BiocParallel_1.36.0           interactiveDisplayBase_1.40.0 parallel_4.3.1                prettyunits_1.2.0             R6_2.5.1                      bslib_0.7.0                   stringi_1.8.3                 jquerylib_0.1.4               Rcpp_1.0.12                   knitr_1.46                    readr_2.1.5                   httpuv_1.6.15                 Matrix_1.6-0                  tidyselect_1.2.1              rstudioapi_0.16.0             abind_1.4-5                   yaml_2.3.8                    codetools_0.2-20              curl_5.2.1                    lattice_0.22-6                tibble_3.2.1                  withr_3.0.0                   shiny_1.8.1.1                 KEGGREST_1.42.0               evaluate_0.23                 BiocFileCache_2.10.2          xml2_1.3.6                    Biostrings_2.70.3             pillar_1.9.0                  BiocManager_1.30.22           filelock_1.0.3                generics_0.1.3                vroom_1.6.5                  
##  [67] RCurl_1.98-1.14               BiocVersion_3.18.1            ensembldb_2.26.1              hms_1.1.3                     munsell_0.5.1                 scales_1.3.0                  xtable_1.8-4                  glue_1.7.0                    lazyeval_0.2.2                tools_4.3.1                   AnnotationHub_3.10.1          BiocIO_1.12.0                 locfit_1.5-9.9                GenomicAlignments_1.38.2      XML_3.99-0.16.1               grid_4.3.1                    colorspace_2.1-0              GenomeInfoDbData_1.2.11       restfulr_0.0.15               cli_3.6.1                     rappdirs_0.3.3                fansi_1.0.6                   S4Arrays_1.2.1                AnnotationFilter_1.26.0       gtable_0.3.5                  sass_0.4.9                    digest_0.6.33                 SparseArray_1.2.4             tximport_1.30.0               farver_2.1.1                  rjson_0.2.21                  memoise_2.0.1                 htmltools_0.5.8              
## [100] lifecycle_1.0.4               httr_1.4.7                    mime_0.12                     bit64_4.0.5

Introduction to differential gene expression analysis with DESeq2

EBI Cancer Genomics & Transcriptomics 2024

Alexey Larionov

27 June 2024