-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mesh Representation #155
Comments
In the long run, I think it would make most sense to switch to NetCDF, but provide a smooth transition, both on the actual model side as well as on the analysis side. Such a fundamental change may require us to signal it to everyone with a version number change (I guess a minor one, both for FESOM itself, and for PyFesom) We should also include in this discussion anyone who is actively generating new meshes (e.g. the paleo guys, who want exotic things like Cretaceous), since this will affect many workflows. |
One more plus of the plain text if that one can begin to analyse them before the first run is done and Still not clear to me if we should use: pyfesom2.function(data, mesh) or pyfesom2.function(data) With the mesh glued to the data in the second case. If we go for the hybrid model, I think the first variant is still a way to go? But using modified mesh when the region is cutted, as implemented by @suvarchal is also very tempting :) Metadata with locations of meshes is a good idea. I would keep them on awi GitLab, and have link to concrete hash. Not sure if this should be the namelist option or something set automatically inside FESOM2 (I currently lean towards the namelist). |
@pgierz can you ping some paleo people, so that they can express their concerns/wishes here? |
Done via email, since I am unsure if everyone in the paleodyn group uses github. I will copy/paste answers here if need be. |
I would suggest Paul's option C - implement the ability to read in mesh information via NetCDF, but also keep the traditional text-based format usable. NetCDF is a bit more easy to handle for the average modeller today. Maybe NetCDF will also make it more easy to do tricks like activating or deactivating mesh nodes with relatively simple script based methods? Just a guess, maybe the "text-based" Gurus know more on that. |
then, afterward, wonder why your file is suddenly messed up because you swapped a double quote for a single quote and this then induces chaos. So, one vote plus for NetCDF. (or, two: I vote that way as well) |
Yes, there is sed/gep/awk. But, traditionally, I prefer the ability to do things like cdo setcindexbox, maskindexbox, etc. as a more convenient method. ;o) If one anyway goes for NetCDF as an additional pathway to do the mesh description, maybe then it would be easy (or more easily) possible to create the traditional "fx" files that help users identify general metrics of their model setup, like CMIP6.CMIP.AWI.AWI-ESM-1-1-LR.piControl.r1i1p1f1.fx.areacella.gn, but for the ocean. |
Christian, could you share a |
Here it is for three different fx-variables that are available from CMIP6 for the AWI-CM-1-1-MR simulation piControl:
|
Here is for completeness grid description files that are used to work with CDO (generated by https://github.com/FESOM/spheRlab from @helgegoessling ) core2_griddes_nodes
core2_griddes_elements.nc```netcdf core2_griddes_elements { // global attributes:
|
Totally agree with @pgierz to go with hybrid option. One other reason is: text files are less effective to do a lazy loading of mesh, much needed for HR simulations, atleast in current form used. After slight modifications to the data format, mostly to add sizes/shapes of data it represents in the header and keep structure of file same(aux3d.out for instance) dask can be used (introducing a strict dependency) to read the data in a slightly better way but this will will still be ineffective then using netcdf. In fact by changing text format we are trying to do a cheap emulation of netcdf by adding meta data at the top. Hybrid option as i understand is to use text files for previous versions and use mesh.diag.nc for current versions. In case of textfiles, library can promote faster second time use by keeping a netcdf version in its cache, preferably for each user, say pyfesom2 cache-- I have some working code to this order using zarr but was not stable across versions of underlying, heavily used, fsspec, but this should be straight forward to adapt for netcdf and is expected to be stable. Size of meshdiag must have little influence on its analysis, nevertheless, it can be stripped on first use and also be stored in the above cache.
I think, in our case we are better off to keep mesh info external to the data files to make those files lean. Idea to publish to a datarepo and include remote path to mesh is cool and i think it enables a few applications. This should be done i think anyway for standard configurations like CMIP etc. But nevertheless, i guess, this can't be assumed for every fesom2 data file because publishing to dkrz or wherever needs an additional post processing step to publish the data, which is easy but needs an additional step. As for remote data format, netcdf files are not best choice for lazyness because inevitably library behind the scenes will have to download the data and load (zenodo, dkrz etc). If the data server supports range request/streaming access we are better off with a custom binary file or even text file, but this puts reqiurement on hosting server. If there is opendap server backing data then it is good and we can use netcdf, but i dont see one without hosting it ourself. But say if we use dkrz or zenodo, the best strategy is to use zarr, only drawback is we will have to predetermine the optimum chunks that also effects subsequent analysis using the data. This is not necessarily bad, just that we need to be aware that we are suggesting a strategy to use the data. An example using zarr at dkrz is https://nbviewer.jupyter.org/github/FESOM/pyfesom2/blob/master/notebooks/remote_datasets.ipynb, note how quickly library is ready when using large meshes. |
Just a note: unfortunately this info is not enough for meaningful visualization of the data (and even to regrid), for which we need info on faces/triangles. @helgegoessling's griddes is more complete to me. Another version of it is here: https://swiftbrowser.dkrz.de/public/dkrz_035d8f6ff058403bb42f8302e6badfbc/pyfesom2/cmip6-grids/netcdf/ whose zarr versions are part of remote data paths included in pyfesom2, eg., Nonetheless, i think it should be addressed in ESGF to include at least faces/triangle information at some point for others outside of AWI to be able to use AWI-CM's CMIP6 data (@tsemmler05 thoughts?). |
To me, there may be less conflict in approaches then it seems, and may be it is a bit matter of elegance because in any case mesh and data come from different file sources we can argue for both ways, in case of accessor part you mentioned was just convenience of having mesh and data in one place and to give a cleaner implementation -- after all they belong together. I would still go for keeping both arguments in functions mainly because i don't know the extent to which mesh object, its methods and attributes is used in analysis functions and i don't know if these attributes can become part of xarray dataset or casted as dataarrays and moreover, i think it may be better to keep existing structure in the interest of incremental updates. With #153 i am also curious to find out the other uses of existing mesh object. If we keep both arguments strategy a option to support both use cases could be to add a simple pre-condition to existing functions like:
A simple glue code to make a fake mesh like obj from merged xarray dataset could be
I also think namelist option makes more sense, but we need to ensure mesh was uploaded, and handle credentials on gitlab usedto upload elegantly. And it occurs to me that this can be handled elegantly when esm_tools is used to run fesom2. |
In #153, we are discussing how to handle preparing the pyfesom2 analysis tools for large data. One particular need arises here: how to handle (in future), the mesh representation on the Python side.
Several options exist, as far as I see (please feel free to edit the main issue and add more info, if anyone has something to add)
A. We continue with plain text files.
Pro: It's already there, many users are used to it, we get to feel retro and old-school as if we are still in the 80s when computing was still the wild west.
Con: ....it's plain text, and feels rather old-school and is if we are still in the wild west. There are smarter ways of doing it by now.
B. Switch to a NetCDF Format
Pro: We can self-document all of the mesh in the file itself. We know which part of the plain text file (which, is now a netcdf file) shows nodes, Lon/lat, ocean/coastline, etc etc.
Con: We need user migration. This would also imply a deeper switch inside of FESOM itself.
C. Hybrid Model
(Paul's preference right now)
We allow the user to read in a "plain text" mesh, and check if there is a NetCDF version there. If not, we make one "on-the-fly" for the next time around. This would speed up user migration to the new format.
Pro: We could, for a time, support both plain text and new netcdf meshes.
Con: It might take a bit of programming flexibility, but if we decide on a strategy, I am happy to volunteer the dirty part of that work.
One more thought:
I would, however, argue against including all the mesh information directly in the model output (this was something I had wished for earlier, but it would explode the file size). Rather, I would suggest that we find some way to publish all of our meshes publicly (if that is feasible), and provide metadata in the output files where the mesh can be accessed (via FTP, Git LFS, DKRZ Swift, whatever). We already have the mesh path in one of the namelists, it should not be too hard to just dump an extra line of metadata into the files when writing output. Then, when a user loads a particular dataset, whatever we have as a load function can check for this, and also load the accompanying information which may be needed for other operations.
The text was updated successfully, but these errors were encountered: