BiocManager::install("plyranges")
library(plyranges)
library(tidyverse)Overview
The plyranges (link) provides a tidy interface to Bioconductor’s GenomicRanges and IRanges packages, enabling the use of tidyverse verbs to manipulate GRanges objects in R. It is part of the tidyomics project (link) to create tidy analysis packages for omics data. For more information on the tidy philosophy in data science see tidyverse.
Lee S, Cook D, Lawrence M (2019) plyranges: a grammar of genomic data transformation. Genome Biol 20:4. https://doi.org/10.1186/s13059-018-1597-8
Hutchison WJ, Keyes TJ, tidyomics Consortium, et al (2024) The tidyomics ecosystem: enhancing omic data analyses. Nat Methods 21:1166–1170. https://doi.org/10.1038/s41592-024-02299-2
Installation
Install the plyranges package from Bioconductor:
Usage
Constructing Ranges
Builds ranges objects directly from tibbles/data.frames. Requires only correctly named columns.
# For Ranges requires two of three out of "start", "end", "width"
df <- tibble(start = c(1, 3, 5, 6),
end = c(2, 4, 8, 10))
# or
df <- tibble(start = c(1, 3, 5, 6),
width = c(2, 2, 4, 5))
# or
df <- tibble(end = c(2, 4, 8, 10),
width = c(2, 2, 4, 5))
# build ranges object
ranges <- df %>% as_iranges()
rangesIRanges object with 4 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
[1] 1 2 2
[2] 3 4 2
[3] 5 8 4
[4] 6 10 5
GRanges requires sequence name column and strand information. Additional columns are kept as metadata!
# GRanges requires additionaly "seqnames" and "strand" column.
gr <- df %>%
mutate(seqnames = c("chr1", "chr2", "chr1", "chr1"),
strand = c("*", "-", "+", "+"),
GC = c(0.6, 0.54, 0.4, 0.44),
Group = c("A", "A", "B", "B")) %>%
as_granges()
grGRanges object with 4 ranges and 2 metadata columns:
seqnames ranges strand | GC Group
<Rle> <IRanges> <Rle> | <numeric> <character>
[1] chr1 1-2 * | 0.60 A
[2] chr2 3-4 - | 0.54 A
[3] chr1 5-8 + | 0.40 B
[4] chr1 6-10 + | 0.44 B
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
Tidy working with GRanges
Many tidy verbs such as filter() or mutate() can be used with GRanges objects. And if they are not available for GRanges you can easily convert back and forth between GRanges and tibbles/dataframes. This allows full integration into tidyverse workflows (ggplot, purrr, etc.).
gr %>%
group_by(Group) %>%
mutate(mean_GC = mean(GC))GRanges object with 4 ranges and 3 metadata columns:
Groups: Group [2]
seqnames ranges strand | GC Group mean_GC
<Rle> <IRanges> <Rle> | <numeric> <character> <numeric>
[1] chr1 1-2 * | 0.60 A 0.57
[2] chr2 3-4 - | 0.54 A 0.57
[3] chr1 5-8 + | 0.40 B 0.42
[4] chr1 6-10 + | 0.44 B 0.42
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
gr %>%
filter(Group == "A")GRanges object with 2 ranges and 2 metadata columns:
seqnames ranges strand | GC Group
<Rle> <IRanges> <Rle> | <numeric> <character>
[1] chr1 1-2 * | 0.60 A
[2] chr2 3-4 - | 0.54 A
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
gr %>%
filter(start > 4)GRanges object with 2 ranges and 2 metadata columns:
seqnames ranges strand | GC Group
<Rle> <IRanges> <Rle> | <numeric> <character>
[1] chr1 5-8 + | 0.40 B
[2] chr1 6-10 + | 0.44 B
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
# Example for plotting metadata
gr %>%
as_tibble() %>%
ggplot(aes(x = Group, y = GC)) +
geom_boxplot() +
theme_pubr() +
labs_pubr()Resizing of ranges is also easy using mutate()or stretch(). Plyranges provides handy anchoring functions to fix either start, end or center coordinates. They can also be shifted left/right/upstream/downstream.
plyrangesgr %>%
anchor_5p() %>%
stretch(extend = 2)GRanges object with 4 ranges and 2 metadata columns:
seqnames ranges strand | GC Group
<Rle> <IRanges> <Rle> | <numeric> <character>
[1] chr1 1-4 * | 0.60 A
[2] chr2 1-4 - | 0.54 A
[3] chr1 5-10 + | 0.40 B
[4] chr1 6-12 + | 0.44 B
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
gr %>%
anchor_centre() %>%
mutate(width = 1) # With even numbers it will use the lower position! GRanges object with 4 ranges and 2 metadata columns:
seqnames ranges strand | GC Group
<Rle> <IRanges> <Rle> | <numeric> <character>
[1] chr1 1 * | 0.60 A
[2] chr2 3 - | 0.54 A
[3] chr1 6 + | 0.40 B
[4] chr1 8 + | 0.44 B
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
gr %>%
shift_upstream(shift = 100) # It will also shift ranges into negative valuesGRanges object with 4 ranges and 2 metadata columns:
seqnames ranges strand | GC Group
<Rle> <IRanges> <Rle> | <numeric> <character>
[1] chr1 -99--98 * | 0.60 A
[2] chr2 103-104 - | 0.54 A
[3] chr1 -95--92 + | 0.40 B
[4] chr1 -94--90 + | 0.44 B
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
Data import/output
Plyranges provides functions based on rtracklayer for reading/writing BAM, BED, GFF, BigWig etc files. Bed files are automatically appended if the GRanges includes correct columns for the bed format. Use gff3 for the export of more metadata.
annotation <- read_gff("Data/example.gff")
# write_bed() function automatically includes columns with the names specified in the bed format.
annotation %>%
filter(type == "protein_coding_gene") %>%
mutate(name = Name,
score = 1) %>%
write_bed("Data/example.bed")plyranges automaticaly converts 0-based bed files to 1-based GRanges - and converts them back to 0-based if written as bed.
Working with multiple GRanges
Multiple GRanges objects can be joined together. There are four options: overlap, nearest, follow and precede.
Three different overlap modes are available: inner, left and intersect. The intersect join will return the intersection as ranges. The inner join will return only overlapping ranges and the left join will return all ranges from the query.
These examples are taken directly from the vignette.
query <- data.frame(seqnames = "chr1",
strand = c("+", "-"),
start = c(1, 9),
end = c(7, 10),
key.a = letters[1:2]) %>%
as_granges()
subject <- data.frame(seqnames = "chr1",
strand = c("-", "+"),
start = c(2, 6),
end = c(4, 8),
key.b = LETTERS[1:2]) %>%
as_granges()
intersect_rng <- join_overlap_intersect(query, subject)
intersect_rngGRanges object with 2 ranges and 2 metadata columns:
seqnames ranges strand | key.a key.b
<Rle> <IRanges> <Rle> | <character> <character>
[1] chr1 2-4 + | a A
[2] chr1 6-7 + | a B
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
inner_rng <- join_overlap_inner(query, subject)
# or
inner_rng <- find_overlaps(query, subject)
inner_rngGRanges object with 2 ranges and 2 metadata columns:
seqnames ranges strand | key.a key.b
<Rle> <IRanges> <Rle> | <character> <character>
[1] chr1 1-7 + | a A
[2] chr1 1-7 + | a B
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
left_rng <- join_overlap_left(query, subject)
left_rngGRanges object with 3 ranges and 2 metadata columns:
seqnames ranges strand | key.a key.b
<Rle> <IRanges> <Rle> | <character> <character>
[1] chr1 1-7 + | a A
[2] chr1 1-7 + | a B
[3] chr1 9-10 - | b <NA>
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
plyranges also provides functions to find and join nearest, preceding and following Ranges.
Usage examples
Example: Find atac peaks in 5’UTRs of genes
UTR <- read_gff("Data/example.gff") %>%
filter(type == "five_prime_UTR") %>%
mutate(name = as.character(Parent)) %>%
select(name)
atac <- read_bed("Data/atacdata_chr1.bed") %>%
select(score) %>%
mutate(atacID = 1:length(.))
atac %>%
join_overlap_inner(UTR)GRanges object with 6 ranges and 3 metadata columns:
seqnames ranges strand | score atacID name
<Rle> <IRanges> <Rle> | <numeric> <integer> <character>
[1] Pf3D7_01_v3 116969-117261 * | 89 10 PF3D7_0102500.1
[2] Pf3D7_01_v3 128348-128527 * | 143 12 PF3D7_0102800.1
[3] Pf3D7_01_v3 131134-131362 * | 128 13 PF3D7_0102900.1
[4] Pf3D7_01_v3 97588-97742 * | 68 73 PF3D7_0102200.1
[5] Pf3D7_01_v3 97829-98297 * | 170 84 PF3D7_0102200.1
[6] Pf3D7_01_v3 103858-104272 * | 169 95 PF3D7_0102300.1
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
Example: How many nucleosomes are in the coding region of genes?
nucs <- read_gff("Data/nuc_chr1.gff") %>%
select(occupancy, dyad, fuzziness, nucID)
exons <- read_gff("Data/example.gff") %>%
filter(type == "exon") %>%
mutate(name = as.character(Parent)) %>%
select(name)
exons %>%
join_overlap_inner(nucs) %>%
group_by(name) %>%
summarise(n = plyranges::n())DataFrame with 21 rows and 2 columns
name n
<character> <integer>
1 PF3D7_0100100.1 1
2 PF3D7_0100300.1 4
3 PF3D7_0100500.1 1
4 PF3D7_0100600.1 2
5 PF3D7_0100700.1 1
... ... ...
17 PF3D7_0102500.1 10
18 PF3D7_0102700.1 4
19 PF3D7_0102800.1 3
20 PF3D7_0102900.1 2
21 PF3D7_0103000.1 4
Example: Compare fuzziness of nearest nucleosomes to TSS with other nucleosomes
# get nucleosome data
nucs <- read_gff("Data/nuc_chr1.gff") %>%
select(occupancy, dyad, fuzziness, nucID)
# get TSS
TSS <- read_gff("Data/example.gff") %>%
filter(type == "mRNA") %>%
anchor_5p() %>%
mutate(width = 1) %>%
mutate(name = as.character(Parent)) %>%
select(name)
# find nucleosomes nearest to TSS
TSS_nucs <- TSS %>%
join_nearest(nucs)
# categorize nucleosomes and plot
nucs %>%
filter(end < max(start(TSS))) %>%
mutate(type = ifelse(nucID %in% TSS_nucs$nucID, "TSS", "nonTSS")) %>%
as_tibble() %>%
mutate(fuzziness = as.numeric(fuzziness)) %>%
ggplot(aes(x = type, y = fuzziness)) +
geom_boxplot() +
theme_pubr() +
labs_pubr()Problems
Computing efficiency
Working with big GRanges objects can be very slow. It might make sense to work with normal tables instead.
Example: calculating the average accessibility (bigwig) for each nucleosome position (bed).
bw <- read_bigwig("Data/atac_chr1.bw")
nucs <- read_gff("Data/nuc_chr1.gff") %>%
select(occupancy, dyad, fuzziness, nucID)
# Turn bigwig into a list of vectors
factBW <- bw %>%
coverage(., weight = .$score) %>%
as.list(.) %>%
map(~as.vector(.))
# use nucleosome coordinates to select values
mapped_mean_acc <- nucs %>%
as_tibble() %>%
group_by(nucID) %>%
nest() %>%
mutate(score = map(data, ~factBW[[as.character(.x$seqnames)]] %>% # first select chromosome vector
.[.x$start:.x$end] %>% # then select values
mean())) %>% # calculate means
dplyr::select(-data, nucID) %>%
mutate(score = unlist(score))
nucs %>%
as_tibble() %>%
left_join(mapped_mean_acc) %>%
as_granges()Joining with `by = join_by(nucID)`
GRanges object with 700 ranges and 5 metadata columns:
seqnames ranges strand | occupancy dyad
<Rle> <IRanges> <Rle> | <character> <character>
[1] Pf3D7_01_v3 7882-8021 * | 206.41 7951
[2] Pf3D7_01_v3 8552-8691 * | 170.71 8621
[3] Pf3D7_01_v3 589802-589941 * | 30.56 589871
[4] Pf3D7_01_v3 7372-7511 * | 111.78 7441
[5] Pf3D7_01_v3 591672-591811 * | 68.16 591741
... ... ... ... . ... ...
[696] Pf3D7_01_v3 213152-213291 * | 95.85 213221
[697] Pf3D7_01_v3 634042-634181 * | 131.63 634111
[698] Pf3D7_01_v3 126002-126141 * | 146.51 126071
[699] Pf3D7_01_v3 476242-476381 * | 101.59 476311
[700] Pf3D7_01_v3 21062-21201 * | 54.06 21131
fuzziness nucID score
<character> <character> <numeric>
[1] 30.333783693 1 0.402595
[2] 31.0467124486 2 0.253790
[3] 31.5630071002 3 0.664259
[4] 32.1329350846 4 0.271278
[5] 32.4936245399 5 1.316213
... ... ... ...
[696] 50.6708714049 696 2.016190
[697] 50.6913731901 697 0.209215
[698] 50.695407322 698 1.063388
[699] 50.7011194898 699 0.701590
[700] 50.7085619858 700 0.142421
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
mixup with dplyr/tidyverse
Sometimes functions break because they mix with their tidyverse counterparts.
Summary
The plyranges package allows frictionless usage of GRanges in a tidy data background. It makes filtering, manipulation, import and export of genomic data super easy.
Learning More
There are several online resources and workshops to see what is possible and learn more:
References
Lee S, Cook D, Lawrence M (2019) plyranges: a grammar of genomic data transformation. Genome Biol 20:4. https://doi.org/10.1186/s13059-018-1597-8