4 Preparing gene information

TCGA used the gencode.v22.annotation.gtf annotation file to generate RNA-seq counts:

For convinience, this GTF file was downloaded to the resources folder for this tutorial. However, it can also be downloaded directly from gencode:

GTF files can be red into R data-frame using readGFF function from rtracklayer package.

4.0.1 Load required libraries

Warnings and messages printed during the library loading would clutter the tutorial. Thus they are supressed in this code. This should not be done during actual analysis because these Warnings and Messages are important. Thus, many functions in genomic packages may have the same names as dplyr functions. So, ignoring Warnings and Messages may lead to unexpected interferences.

#install.packages(dplyr)
suppressWarnings(suppressMessages(library(dplyr)))

#install.packages("BiocManager")
#BiocManager::install("rtracklayer")
suppressWarnings(suppressMessages(library(rtracklayer))) # for readGFF()

4.0.2 Read genes data

The genes’ Names and IDs must be read from the file that was used during RNA-seq counts preparation. In case of the TCGA RNA-seq data, this is gencode.v22.annotation.gtf.

# Read appropriate annotations file
gencode_22_file <- file.path(base_folder,"resources","gencode.v22.annotation.gtf")
gencode_22.df <- readGFF(gencode_22_file)

# Explore the annotations
dim(gencode_22.df)
## [1] 2563671      26
colnames(gencode_22.df)
##  [1] "seqid"                    "source"                   "type"                     "start"                    "end"                      "score"                    "strand"                   "phase"                    "gene_id"                  "gene_type"                "gene_status"              "gene_name"                "level"                    "havana_gene"              "transcript_id"            "transcript_type"          "transcript_status"        "transcript_name"          "tag"                      "transcript_support_level" "havana_transcript"        "exon_number"              "exon_id"                  "ont"                      "protein_id"               "ccdsid"
# Extract gene_id-s and gene_name-s
genes.df <- gencode_22.df %>% 
  select(gene_id, gene_name) %>% 
  arrange(gene_name) %>% 
  distinct()

# Copy gene IDs to rownames 
rownames(genes.df) <- genes.df$gene_id

# Check result
head(genes.df) # Interestingly, the gene names are not unique
##                             gene_id gene_name
## ENSG00000275877.1 ENSG00000275877.1 5_8S_rRNA
## ENSG00000276871.1 ENSG00000276871.1 5_8S_rRNA
## ENSG00000252830.2 ENSG00000252830.2   5S_rRNA
## ENSG00000276442.1 ENSG00000276442.1   5S_rRNA
## ENSG00000274408.1 ENSG00000274408.1   5S_rRNA
## ENSG00000274059.1 ENSG00000274059.1   5S_rRNA
tail(genes.df)
##                               gene_id gene_name
## ENSG00000203995.8   ENSG00000203995.8    ZYG11A
## ENSG00000232242.2   ENSG00000232242.2  ZYG11AP1
## ENSG00000162378.11 ENSG00000162378.11    ZYG11B
## ENSG00000159840.14 ENSG00000159840.14       ZYX
## ENSG00000074755.13 ENSG00000074755.13     ZZEF1
## ENSG00000036549.11 ENSG00000036549.11      ZZZ3
# Clean-up
rm(gencode_22_file, gencode_22.df)