Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of geoparquet when not loading geoarrow #28

Open
kylebutts opened this issue Dec 6, 2023 · 8 comments
Open

Handling of geoparquet when not loading geoarrow #28

kylebutts opened this issue Dec 6, 2023 · 8 comments

Comments

@kylebutts
Copy link

First of all, thanks for this awesome work. It's been great to see the progress on all this :-)

In the example on the readme, you load a .parquet file that contains a geometry example. Since there is not a separate naming format/convention (e.g. .geo.parquet or .geoparquet), I might not know that there is a geometry in there, so I just load arrow and open the dataset as normal. Looking at the geometry column would be confusing to me. This behavior differs whether I have the geoarrow package loaded or not.

library(tidyverse)
library(arrow)

open_dataset("~/Desktop/nc.parquet") |>
  head(n = 1) |>
  pull(geometry, as_vector = TRUE)
#> <arrow_binary[1]>
#> [1] 01, 06, 00, 00, 00, 01, 00, 00, 00, 01, 03, 00, 00, 00, 01, 00, 00, 00, 1b, 00, 00, 00, 00, 00, 00, a0, 41, 5e, 54, c0, 00, 00, ...

library(geoarrow)
open_dataset("~/Desktop/nc.parquet") |>
  head(n = 1) |>
  pull(geometry, as_vector = TRUE)
#> <geoarrow_wkb[1]>
#> [1] MULTIPOLYGON (((-81.47276 36.23436, -81.54084 36.27251, -81.56198 36.27359, -81.63306 36.34069, -81.74107 36.39178, -81.69828 36.47178...

This issue might should be in the R arrow package, but I'm wondering if arrow should detect when there is a geometry column present and adjust behavior (the metadata is in there, so this information is known). For example, when calling collect(), should there be a warning that a geometry column is being collected and that geoarrow::st_collect() might be the better option (as in #21)? Or a warning when opening a geoparquet without geoarrow loaded?

library(tidyverse)
library(arrow)

nc = open_dataset("~/Desktop/nc.parquet") 
# We know there is a geometry from the metadata
nc$metadata[[1]]
#> [1] "{\"version\":\"0.3.0\",\"primary_column\":\"geometry\",\"columns\":{\"geometry\":{\"encoding\":\"WKB\",\"crs\":\"GEOGCS[\\\"NAD27\\\",DATUM[\\\"North_American_Datum_1927\\\",SPHEROID[\\\"Clarke 1866\\\",6378206.4,294.978698213898]],PRIMEM[\\\"Greenwich\\\",0],UNIT[\\\"degree\\\",0.0174532925199433,AUTHORITY[\\\"EPSG\\\",\\\"9122\\\"]],AXIS[\\\"Latitude\\\",NORTH],AXIS[\\\"Longitude\\\",EAST],AUTHORITY[\\\"EPSG\\\",\\\"4267\\\"]]\",\"bbox\":[-84.3239,33.882,-75.457,36.5896],\"geometry_type\":\"MultiPolygon\"}}}"
@paleolimbot
Copy link
Collaborator

First, just a note that a rewrite is in progress and should be available in January! The new package currently lives here: https://github.com/geoarrow/geoarrow-c/tree/main/r/geoarrow but may get moved to a less confusing location (like geoarrow/geoarrow-r). I'm just in the process of working with the extension type registration ( geoarrow/geoarrow-c#85 ) so this is well-timed!

The issue of automatic loading is a tricky one...the arrow package maybe shouldn't load arbitrary packages (as in, if we somehow encoded "r_pkgs" in the metadata or something), and while it could special-case the geoarrow package when this it is on CRAN, special-casing things can become unwieldy.

In any case, the first step is geoarrow on CRAN 🙂 ...I'm working on it!

@mrworthington
Copy link

The new package currently lives here: https://github.com/geoarrow/geoarrow-c/tree/main/r/geoarrow but may get moved to a less confusing location

Hi Dewey + Team! Trying to play with geoarrow for a project, but am not finding the new package you referenced. I clicked on the link above, but it just shows a "404 page not found" error. Hoping to use it in combination with open_dataset() on a shiny app I'm spinning up! For context, I've installed this current version of {geoarrow-r}, but am assuming this is not the one that you want people to be using.

@kylebutts
Copy link
Author

@mrworthington This pull request suggests they were moved back to this repo 2 weeks ago: geoarrow/geoarrow-c#89

@paleolimbot
Copy link
Collaborator

This is indeed the version that I'd like people to be using; however, it is missing the read_geoparquet_sf() function ( #30 ). If you need the previous version, I tagged it as 0.1.0.

Development did start out in geoarrow-c, but ultimately I found that it made more sense to keep it on its own (hence, geoarrow-r!).

@jaredlander
Copy link

jaredlander commented Feb 5, 2024

This is indeed the version that I'd like people to be using; however, it is missing the read_geoparquet_sf() function ( #30 ). If you need the previous version, I tagged it as 0.1.0.

Going forward, am I correct that we won't need to read_geoparquet_sf() but rather just use read_parquet()? And if so, will it automatically become an sf object? Currently with version 0.1.0.900, I have to run read_parquet('file.parquet') |> geoarrow:::st_as_sf.Dataset() because if I don't use geoarrow:::st_as_sf.Dataset() I get the following error:

Error in st_geometry.sf(x) : 
  attr(obj, "sf_column") does not point to a geometry column.
Did you rename it, without setting st_geometry(obj) <- "newname"?

@paleolimbot
Copy link
Collaborator

If you are only reading/writing Parquet files in R (with geoarrow loaded) and/or Python (after import geoarrow.pyarrow), you can just use write_parquet() and read_parquest(). This is not GeoParquet...it's just regular Parquet with extension types. This means that something like GDAL won't be able to understand it (yet) and uploading it to a cloud data warehouse won't work. The upside of not using GeoParquet is that more arrow tools work out-of-the-box (e.g., multi-file datasets via write_dataset()/open_dataset() in R or Python).

If you need to read with GDAL or some other tool, I would recommend using geoarrow::read_geoparquet_sf() (or geoarrow::read_geoparquet()) and geoarrow::write_geoparquet() going forward; however, I still have to finish the implementation (#30).

if I don't use geoarrow:::st_as_sf.Dataset() I get the following error:

I think you might want read_parquet(f, as_data_frame = FALSE) + st_as_sf(). I think the problem is that sf doesn't know that a lazy geoarrow column is "geometry". Eventually it probably will but the details of that are complicated and for now you'll have to help it.

@jaredlander
Copy link

Thanks for the info!

If you need to read with GDAL or some other tool, I would recommend using geoarrow::read_geoparquet_sf() (or geoarrow::read_geoparquet()) and geoarrow::write_geoparquet() going forward; however, I still have to finish the implementation (#30).

So it sounds like geoarrow::write_geoparquet() and friends are coming back? So I can install with renv::install('geoarrow/[email protected]') which gets me 0.1.0 instead of renv::install('geoarrow/geoarrow-r') which gets me 0.1.0.9000?

I'm using this to write parquet files to map with geoarrow/deck.gl layers (as opposed to pmtiles). The README says

Pass -lco GEOMETRY_ENCODING=GEOARROW when converting to Arrow or Parquet files in order to store geometries in a GeoArrow-native geometry column.

Likewise, this post says

Notice the GEOMETRY_ENCODING=GEOARROW? This file isn't quite valid GeoParquet, at least as of version 1.0, because it stores geometries in the efficient Arrow-native encoding instead of as WKB geometries.

This is needed for now because parquet-wasm doesn't have a way to parse WKB geometries into Arrow-native geometries. (A @geoarrow/geoparquet-wasm library is likely to be published by the end of 2023 that will parse any GeoParquet file and load it to GeoArrow.)

So I'm guessing I need to use geoarrow::write_geoparquet()? Which I get using 0.1.0, correct?

I think you might want read_parquet(f, as_data_frame = FALSE) + st_as_sf(). I think the problem is that sf doesn't know that a lazy geoarrow column is "geometry". Eventually it probably will but the details of that are complicated and for now you'll have to help it.

Yep, that fixed it!

@paleolimbot
Copy link
Collaborator

So it sounds like geoarrow::write_geoparquet() and friends are coming back?

Yes! With proper conformance to the 1.0.0 spec. The 1.0.0 spec doesn't include GeoArrow as an encoding option - it's WKB only - and there's some debate over whether it should be there in the first place.

So I'm guessing I need to use geoarrow::write_geoparquet()? Which I get using 0.1.0, correct?

I actually have no idea. I think maybe write_parquet() will work, but you might have to explicitly tell it to use interleaved coordinates. Off the top of my head I forget exactly how to do that but I'll try to circle back with an example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants