02 plyranges

My Favorite R/Python Package

Author
Affiliation

Simon Holzinger

Published

2024-11-19

Overview

The plyranges (link) provides a tidy interface to Bioconductor’s GenomicRanges and IRanges packages, enabling the use of tidyverse verbs to manipulate GRanges objects in R. It is part of the tidyomics project (link) to create tidy analysis packages for omics data. For more information on the tidy philosophy in data science see tidyverse.

Lee S, Cook D, Lawrence M (2019) plyranges: a grammar of genomic data transformation. Genome Biol 20:4. https://doi.org/10.1186/s13059-018-1597-8

Hutchison WJ, Keyes TJ, tidyomics Consortium, et al (2024) The tidyomics ecosystem: enhancing omic data analyses. Nat Methods 21:1166–1170. https://doi.org/10.1038/s41592-024-02299-2

Installation

Install the plyranges package from Bioconductor:

BiocManager::install("plyranges")
library(plyranges)
library(tidyverse)

Usage

Constructing Ranges

Builds ranges objects directly from tibbles/data.frames. Requires only correctly named columns.

# For Ranges requires two of three out of "start", "end", "width"
df <- tibble(start = c(1, 3, 5, 6),
             end = c(2, 4, 8, 10))

# or 
df <- tibble(start = c(1, 3, 5, 6),
             width = c(2, 2, 4, 5))

# or 
df <- tibble(end = c(2, 4, 8, 10),
             width = c(2, 2, 4, 5))

# build ranges object
ranges <- df %>% as_iranges()
ranges
IRanges object with 4 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]         1         2         2
  [2]         3         4         2
  [3]         5         8         4
  [4]         6        10         5

GRanges requires sequence name column and strand information. Additional columns are kept as metadata!

# GRanges requires additionaly "seqnames" and "strand" column.
gr <- df %>% 
  mutate(seqnames = c("chr1", "chr2", "chr1", "chr1"),
         strand = c("*", "-", "+", "+"),
         GC = c(0.6, 0.54, 0.4, 0.44),
         Group = c("A", "A", "B", "B")) %>% 
  as_granges()

gr
GRanges object with 4 ranges and 2 metadata columns:
      seqnames    ranges strand |        GC       Group
         <Rle> <IRanges>  <Rle> | <numeric> <character>
  [1]     chr1       1-2      * |      0.60           A
  [2]     chr2       3-4      - |      0.54           A
  [3]     chr1       5-8      + |      0.40           B
  [4]     chr1      6-10      + |      0.44           B
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

Tidy working with GRanges

Many tidy verbs such as filter() or mutate() can be used with GRanges objects. And if they are not available for GRanges you can easily convert back and forth between GRanges and tibbles/dataframes. This allows full integration into tidyverse workflows (ggplot, purrr, etc.).

gr %>% 
  group_by(Group) %>% 
  mutate(mean_GC = mean(GC))
GRanges object with 4 ranges and 3 metadata columns:
Groups: Group [2]
      seqnames    ranges strand |        GC       Group   mean_GC
         <Rle> <IRanges>  <Rle> | <numeric> <character> <numeric>
  [1]     chr1       1-2      * |      0.60           A      0.57
  [2]     chr2       3-4      - |      0.54           A      0.57
  [3]     chr1       5-8      + |      0.40           B      0.42
  [4]     chr1      6-10      + |      0.44           B      0.42
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths
gr %>% 
  filter(Group == "A")
GRanges object with 2 ranges and 2 metadata columns:
      seqnames    ranges strand |        GC       Group
         <Rle> <IRanges>  <Rle> | <numeric> <character>
  [1]     chr1       1-2      * |      0.60           A
  [2]     chr2       3-4      - |      0.54           A
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths
gr %>%
  filter(start > 4)
GRanges object with 2 ranges and 2 metadata columns:
      seqnames    ranges strand |        GC       Group
         <Rle> <IRanges>  <Rle> | <numeric> <character>
  [1]     chr1       5-8      + |      0.40           B
  [2]     chr1      6-10      + |      0.44           B
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths
# Example for plotting metadata 
gr %>%
  as_tibble() %>% 
  ggplot(aes(x = Group, y = GC)) +
  geom_boxplot() +
  theme_pubr() +
  labs_pubr()

Resizing of ranges is also easy using mutate()or stretch(). Plyranges provides handy anchoring functions to fix either start, end or center coordinates. They can also be shifted left/right/upstream/downstream.

Anchor options available in plyranges
gr %>%
  anchor_5p() %>%
  stretch(extend = 2)
GRanges object with 4 ranges and 2 metadata columns:
      seqnames    ranges strand |        GC       Group
         <Rle> <IRanges>  <Rle> | <numeric> <character>
  [1]     chr1       1-4      * |      0.60           A
  [2]     chr2       1-4      - |      0.54           A
  [3]     chr1      5-10      + |      0.40           B
  [4]     chr1      6-12      + |      0.44           B
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths
gr %>%
  anchor_centre() %>% 
  mutate(width = 1) # With even numbers it will use the lower position! 
GRanges object with 4 ranges and 2 metadata columns:
      seqnames    ranges strand |        GC       Group
         <Rle> <IRanges>  <Rle> | <numeric> <character>
  [1]     chr1         1      * |      0.60           A
  [2]     chr2         3      - |      0.54           A
  [3]     chr1         6      + |      0.40           B
  [4]     chr1         8      + |      0.44           B
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths
gr %>% 
  shift_upstream(shift = 100) # It will also shift ranges into negative values
GRanges object with 4 ranges and 2 metadata columns:
      seqnames    ranges strand |        GC       Group
         <Rle> <IRanges>  <Rle> | <numeric> <character>
  [1]     chr1   -99--98      * |      0.60           A
  [2]     chr2   103-104      - |      0.54           A
  [3]     chr1   -95--92      + |      0.40           B
  [4]     chr1   -94--90      + |      0.44           B
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

Data import/output

Plyranges provides functions based on rtracklayer for reading/writing BAM, BED, GFF, BigWig etc files. Bed files are automatically appended if the GRanges includes correct columns for the bed format. Use gff3 for the export of more metadata.

annotation <- read_gff("Data/example.gff")

# write_bed() function automatically includes columns with the names specified in the bed format. 
annotation %>% 
  filter(type == "protein_coding_gene") %>% 
  mutate(name = Name, 
         score = 1) %>% 
  write_bed("Data/example.bed")

plyranges automaticaly converts 0-based bed files to 1-based GRanges - and converts them back to 0-based if written as bed.

Working with multiple GRanges

Multiple GRanges objects can be joined together. There are four options: overlap, nearest, follow and precede.

Three different overlap modes are available: inner, left and intersect. The intersect join will return the intersection as ranges. The inner join will return only overlapping ranges and the left join will return all ranges from the query.

These examples are taken directly from the vignette.

query <- data.frame(seqnames = "chr1",
               strand = c("+", "-"),
               start = c(1, 9),
               end =  c(7, 10),
               key.a = letters[1:2]) %>%
  as_granges()

subject <- data.frame(seqnames = "chr1",
               strand = c("-", "+"),
               start = c(2, 6),
               end = c(4, 8),
               key.b = LETTERS[1:2]) %>%
  as_granges()

intersect_rng <- join_overlap_intersect(query, subject)
intersect_rng
GRanges object with 2 ranges and 2 metadata columns:
      seqnames    ranges strand |       key.a       key.b
         <Rle> <IRanges>  <Rle> | <character> <character>
  [1]     chr1       2-4      + |           a           A
  [2]     chr1       6-7      + |           a           B
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Visualisation of plyranges intersect join.
inner_rng <- join_overlap_inner(query, subject)

# or

inner_rng <- find_overlaps(query, subject)

inner_rng
GRanges object with 2 ranges and 2 metadata columns:
      seqnames    ranges strand |       key.a       key.b
         <Rle> <IRanges>  <Rle> | <character> <character>
  [1]     chr1       1-7      + |           a           A
  [2]     chr1       1-7      + |           a           B
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Visualisation of plyranges inner join.
left_rng <- join_overlap_left(query, subject)
left_rng
GRanges object with 3 ranges and 2 metadata columns:
      seqnames    ranges strand |       key.a       key.b
         <Rle> <IRanges>  <Rle> | <character> <character>
  [1]     chr1       1-7      + |           a           A
  [2]     chr1       1-7      + |           a           B
  [3]     chr1      9-10      - |           b        <NA>
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Visualisation of plyranges left join

plyranges also provides functions to find and join nearest, preceding and following Ranges.

Scheme for join nearest operators

Usage examples

Example: Find atac peaks in 5’UTRs of genes

UTR <- read_gff("Data/example.gff") %>% 
  filter(type == "five_prime_UTR") %>% 
  mutate(name = as.character(Parent)) %>% 
  select(name)

atac <- read_bed("Data/atacdata_chr1.bed") %>%
  select(score) %>% 
  mutate(atacID = 1:length(.))

atac %>% 
  join_overlap_inner(UTR)
GRanges object with 6 ranges and 3 metadata columns:
         seqnames        ranges strand |     score    atacID            name
            <Rle>     <IRanges>  <Rle> | <numeric> <integer>     <character>
  [1] Pf3D7_01_v3 116969-117261      * |        89        10 PF3D7_0102500.1
  [2] Pf3D7_01_v3 128348-128527      * |       143        12 PF3D7_0102800.1
  [3] Pf3D7_01_v3 131134-131362      * |       128        13 PF3D7_0102900.1
  [4] Pf3D7_01_v3   97588-97742      * |        68        73 PF3D7_0102200.1
  [5] Pf3D7_01_v3   97829-98297      * |       170        84 PF3D7_0102200.1
  [6] Pf3D7_01_v3 103858-104272      * |       169        95 PF3D7_0102300.1
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Example: How many nucleosomes are in the coding region of genes?

nucs <- read_gff("Data/nuc_chr1.gff") %>%
  select(occupancy, dyad, fuzziness, nucID)

exons <- read_gff("Data/example.gff") %>% 
  filter(type == "exon") %>% 
  mutate(name = as.character(Parent)) %>% 
  select(name)

exons %>% 
  join_overlap_inner(nucs) %>% 
  group_by(name) %>% 
  summarise(n = plyranges::n())
DataFrame with 21 rows and 2 columns
               name         n
        <character> <integer>
1   PF3D7_0100100.1         1
2   PF3D7_0100300.1         4
3   PF3D7_0100500.1         1
4   PF3D7_0100600.1         2
5   PF3D7_0100700.1         1
...             ...       ...
17  PF3D7_0102500.1        10
18  PF3D7_0102700.1         4
19  PF3D7_0102800.1         3
20  PF3D7_0102900.1         2
21  PF3D7_0103000.1         4

Example: Compare fuzziness of nearest nucleosomes to TSS with other nucleosomes

# get nucleosome data
nucs <- read_gff("Data/nuc_chr1.gff") %>%
  select(occupancy, dyad, fuzziness, nucID)

# get TSS
TSS <- read_gff("Data/example.gff") %>% 
  filter(type == "mRNA") %>% 
  anchor_5p() %>%
  mutate(width = 1) %>% 
  mutate(name = as.character(Parent)) %>% 
  select(name)

# find nucleosomes nearest to TSS
TSS_nucs <- TSS %>% 
  join_nearest(nucs)

# categorize nucleosomes and plot
nucs %>% 
  filter(end < max(start(TSS))) %>% 
  mutate(type = ifelse(nucID %in% TSS_nucs$nucID, "TSS", "nonTSS")) %>% 
  as_tibble() %>% 
  mutate(fuzziness = as.numeric(fuzziness)) %>%
  ggplot(aes(x = type, y = fuzziness)) +
  geom_boxplot() +
  theme_pubr() +
  labs_pubr()

Problems

Computing efficiency

Working with big GRanges objects can be very slow. It might make sense to work with normal tables instead.

Example: calculating the average accessibility (bigwig) for each nucleosome position (bed).

bw <- read_bigwig("Data/atac_chr1.bw")
nucs <- read_gff("Data/nuc_chr1.gff") %>%
  select(occupancy, dyad, fuzziness, nucID)

# Turn bigwig into a list of vectors
factBW <- bw %>% 
  coverage(., weight = .$score) %>% 
      as.list(.) %>% 
      map(~as.vector(.))

# use nucleosome coordinates to select values
mapped_mean_acc <- nucs %>% 
  as_tibble() %>% 
  group_by(nucID) %>% 
  nest() %>% 
      mutate(score = map(data, ~factBW[[as.character(.x$seqnames)]] %>% # first select chromosome vector
                           .[.x$start:.x$end] %>% # then select values
                           mean())) %>% # calculate means
      dplyr::select(-data, nucID) %>% 
      mutate(score = unlist(score))

nucs %>% 
  as_tibble() %>% 
  left_join(mapped_mean_acc) %>% 
  as_granges()
Joining with `by = join_by(nucID)`
GRanges object with 700 ranges and 5 metadata columns:
           seqnames        ranges strand |   occupancy        dyad
              <Rle>     <IRanges>  <Rle> | <character> <character>
    [1] Pf3D7_01_v3     7882-8021      * |      206.41        7951
    [2] Pf3D7_01_v3     8552-8691      * |      170.71        8621
    [3] Pf3D7_01_v3 589802-589941      * |       30.56      589871
    [4] Pf3D7_01_v3     7372-7511      * |      111.78        7441
    [5] Pf3D7_01_v3 591672-591811      * |       68.16      591741
    ...         ...           ...    ... .         ...         ...
  [696] Pf3D7_01_v3 213152-213291      * |       95.85      213221
  [697] Pf3D7_01_v3 634042-634181      * |      131.63      634111
  [698] Pf3D7_01_v3 126002-126141      * |      146.51      126071
  [699] Pf3D7_01_v3 476242-476381      * |      101.59      476311
  [700] Pf3D7_01_v3   21062-21201      * |       54.06       21131
            fuzziness       nucID     score
          <character> <character> <numeric>
    [1]  30.333783693           1  0.402595
    [2] 31.0467124486           2  0.253790
    [3] 31.5630071002           3  0.664259
    [4] 32.1329350846           4  0.271278
    [5] 32.4936245399           5  1.316213
    ...           ...         ...       ...
  [696] 50.6708714049         696  2.016190
  [697] 50.6913731901         697  0.209215
  [698]  50.695407322         698  1.063388
  [699] 50.7011194898         699  0.701590
  [700] 50.7085619858         700  0.142421
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

mixup with dplyr/tidyverse

Sometimes functions break because they mix with their tidyverse counterparts.

Summary

The plyranges package allows frictionless usage of GRanges in a tidy data background. It makes filtering, manipulation, import and export of genomic data super easy.

Learning More

There are several online resources and workshops to see what is possible and learn more:

References

Lee S, Cook D, Lawrence M (2019) plyranges: a grammar of genomic data transformation. Genome Biol 20:4. https://doi.org/10.1186/s13059-018-1597-8