vrtility is an R package that aims to make the best use of GDAL’s VRT capabilities for efficient processing of large raster datasets - mainly with Earth Observation in mind. This package’s primary focus is on the use of GDAL VRT pixel functions using python. These numpy based python pixel functions are used to apply cloud masks and summarise pixel values (e.g. median) from multiple images (i.e create a composite image). These main features are made possible by the {gdalraster} and {reticulate} packages.
Caution
This package is under active development and is likely to change. Contributions and suggestions are still very welcome!
-
No intermediate downloads - the use of nested VRTs enables the download and processing of only the required data in a single gdalwarp (or gdal_translate) call. This reduces disk read/write time.
-
modular design: We’re basically creating remote sensing pipelines using nested VRTs. This allows for the easy addition of new pixel functions and masking functions. but could easily be adapted for deriving spectral indices or calculating complex time series functions.
-
extremely efficient parallel processing using gdalraster and mirai when using the “gdalraster” compute engine.
You can install the development version of vrtility from GitHub with:
# install.packages("pak")
pak::pkg_install("Permian-Global-Research/vrtility")
Here is a simple example where we:
-
Define a bounding box and search a STAC catalog for Sentinel-2 data
-
Create a
vrt_collection
object - essentially a list of individual VRTs (each making up one image) which we refer to asvrt_block
s in this package. -
Then, we apply the mask using pixel functions. This simply modifies the XML of the VRT “blocks”.
-
Because this set of images have more than one common spatial reference system (SRS) we convert the
vrt_block
s in thevrt_collection
to warped VRTs, giving us avrt_collection_warped
object. -
These images are then “stacked” (combined into a single VRT with multiple layers in each VRTRasterBand), giving us a
vrt_stack
object. -
A median pixel function is then added to the
vrt_stack
. -
all of this is then executed at the end of the vrt pipeline using
vrt_compute
. Here we are using thegdalraster
engine to write the output which, in combination with the mirai package downloads and processes the data in parallel across bands and within bands (as determined by thensplits
argument).
library(vrtility)
# Set up asynchronous workers to parallelise vrt_collect and vrt_set_maskfun
mirai::daemons(6)
#> [1] 6
bbox <- gdalraster::bbox_from_wkt(
wkt = wk::wkt("POINT (144.3 -7.6)"),
extend_x = 0.17,
extend_y = 0.125
)
te <- bbox_to_projected(bbox)
trs <- attr(te, "wkt")
s2_stac <- sentinel2_stac_query(
bbox = bbox,
start_date = "2023-01-01",
end_date = "2023-12-31",
max_cloud_cover = 35,
assets = c("B02", "B03", "B04", "SCL")
)
# number of items:
length(s2_stac$features)
#> [1] 12
system.time({
median_composite <- vrt_collect(s2_stac) |>
vrt_set_maskfun(
mask_band = "SCL",
mask_values = c(0, 1, 2, 3, 8, 9, 10, 11)
) |>
vrt_warp(t_srs = trs, te = te, tr = c(10, 10)) |>
vrt_stack() |>
vrt_set_pixelfun() |>
vrt_compute(
outfile = fs::file_temp(ext = "tif"),
engine = "gdalraster",
nsplits = 3L
)
})
#> user system elapsed
#> 2.182 0.127 38.621
plot_raster_src(
median_composite,
c(3, 2, 1)
)
{vrtility} uses {mirai}, alongside {purrr} to manage asynchronous
parallelisation. By setting mirai::daemons(n)
before running the vrt
pipeline, we can improve performance, depending on the speed of the
server holding the data. In some cases this will make little difference
for example, the Microsoft Planetary Computer STAC API is already pretty
fast. However, for NASA’s Earthdata STAC API, this can make a huge
difference. Paralellism is available in three functions at present:
vrt_collect
, vrt_set_maskfun
and vrt_compute
. In order to use
asynchronous processing, in the vrt_compute
function, we need to set
engine = "gdalraster"
. The “gdalraster” engine is always parallelised
across bands by default, then a further nested parallelisation step is
possible within bands by setting nsplits
to a value greater than 1. If
you want to reduce the number of processes used, explicitly set
mirai::daemons(1)
or just use the “warp” engine.
We can also use on-disk raster files too, as shown here with this
example dataset - note that the inputs have multiple spatial reference
systems and therefore we need to warp them (as in the above example)
before stacking. If your images are all in the same CRS, you might save
a lot of time by warping only once in vrt_compute
. We can plot these
vrt_{x}
objects using plot()
but note that for very large rasters,
where we are computing pixel functions, this can be slow and we are
better off using vrt_compute
to write to disk and then plotting the
output.
s2files <- fs::dir_ls(system.file("s2-data", package = "vrtility"))[1:4]
ex_collect <- vrt_collect(s2files)
par(mfrow = c(2, 2))
purrr::walk(
seq_len(ex_collect$n_items),
~ plot(ex_collect, item = .x, bands = c(3, 2, 1))
)
ex_collect_mask <- vrt_set_maskfun(
ex_collect,
mask_band = "SCL",
mask_values = c(0, 1, 2, 3, 8, 9, 10, 11),
)
purrr::walk(
seq_len(ex_collect_mask$n_items),
~ plot(ex_collect_mask, item = .x, bands = c(3, 2, 1))
)
# extract a block to use as a template for warping
t_block <- ex_collect[[1]][[4]]
ex_composite <- vrt_warp(
ex_collect_mask,
t_srs = t_block$srs,
te = t_block$bbox,
tr = c(20, 20)
) |>
vrt_stack() |>
vrt_set_pixelfun(pixfun = median_numpy())
par(mfrow = c(1, 1))
plot(ex_composite, bands = c(3, 2, 1), quiet = TRUE)
# write to disk if we wanted to...
# vrt_compute(
# ex_composite,
# outfile = fs::file_temp(ext = "tif"),
# engine = "warp"
# )
- Add additional pixel functions (geometric median in particular).
- Add default C++ pixel functions.
- time series functions…
- Add custom C++ or expression based pixel functions