Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update history, refactor sdba processing functions, add unit handling in sdba #801

Merged
merged 23 commits into from
Aug 24, 2021

Conversation

aulemahal
Copy link
Collaborator

@aulemahal aulemahal commented Aug 19, 2021

Pull Request Checklist:

  • This PR addresses an already opened issue (for bug fixes / features)
    • This PR fixes #xyz
  • Tests for the changes have been added (for bug fixes / features)
  • Documentation has been added / updated (for bug fixes / features)
  • HISTORY.rst has been updated (with summary of main changes)
  • bumpversion (major / minor / patch) has been called on this branch
  • Tags have been pushed (git push --tags)
  • The relevant author information has been added to .zenodo.json

What kind of change does this PR introduce?

  • Moves unit handling sutff from [sdba] Optimizations and fixes #791 to here. Inputs of train and adjust methods are converted to common units. ref is used as the reference. Add unit handling in processing : adapt_freq, jitter_under/over_thresh, etc. Add unit handling to the Detrend objects.
  • Splits user-facing and metadata-handling function from their computational counterparts when map_blocks is used. This applies to processing.py. This also changed the signature of the concerned functions, so that they get an explicit one.
  • Generalize the setting of a xclim_history attribute through a new @update_xclim_history decorator. Decorated functions will have a correctly formatted xclim_history attribute added to their first output.

Does this PR introduce a breaking change?

Yes. Signature of adapt_freq, jitter_under_thresh and others has changed. All variables are given as DataArrays (no more dataset as input) and threshold are given as quantities, so strings, not numbers.

Other information:

We have a naming problem with the xclim_history attribute. Currently, the update_history function will take the inputs and merge their history attributes. Then we take it and put it in xclim_history. If a variable has been modified before by xclim, the xclim_history attribute is overwritten, not appended. Should we go back to history to make it simpler? @Zeitsperre @huard This applies to the indicators too!

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@aulemahal aulemahal requested a review from huard August 19, 2021 22:41
@aulemahal aulemahal added this to the v0.29 milestone Aug 19, 2021
@aulemahal aulemahal changed the title [sdba] Refactor processing function + add unit handling [sdba] Refactor processing functions + add unit handling Aug 19, 2021
@huard
Copy link
Collaborator

huard commented Aug 20, 2021

Remind me why we decided to have xclim_history in the first place.

Copy link
Collaborator

@huard huard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How confident are you that our test coverage would have picked potential errors introduced in this PR?
From experience, it's easy to break unit handling, and tests were not initially covered to check this. You added unit related checks, but how tight are we overall ?

@aulemahal
Copy link
Collaborator Author

We decided to use xclim_history because CF conventions say that history is a dataset attribute, but we are always dealing with DataArrays.

As for the tests, I did add more, but you are right that it's always a tricky question. Failures and bugs in the attributes will not always trigger test failures... I plan to add a bit more.

@aulemahal
Copy link
Collaborator Author

@Zeitsperre I'd like to hear your opinion on a change from xclim_history to history!

I think this PR as revealed a faille in our logic of being CF-orthodox and not using history : If xclim performs successive actions on a variable, only the latest is recorded in the xclim_history attribute. Moreover, it might append the input's history, as if that latest action was the only one.
I would suggest going back to history as it is the one that makes the most sense and is understood by the largest crowd. And it would solve the bug. It would be only slightly non-CF, as the history attribute is not flagged as a variable attribute by the conventions.

@Zeitsperre
Copy link
Collaborator

@Zeitsperre I'd like to hear your opinion on a change from xclim_history to history!

I think this PR as revealed a faille in our logic of being CF-orthodox and not using history : If xclim performs successive actions on a variable, only the latest is recorded in the xclim_history attribute. Moreover, it might append the input's history, as if that latest action was the only one.

This is true. We should be at the very least appending some general summary of actions taken to the history attribute. If we want to revisit this, my feeling is that the xclim_history should be used for debugging purposes, IE: add the call signatures used, xclim version, whether the data was Indicator-compliant. This should have some internal provenance-like information that the general user doesn't need (and doesn't know to look for).

I would suggest going back to history as it is the one that makes the most sense and is understood by the largest crowd. And it would solve the bug. It would be only slightly non-CF, as the history attribute is not flagged as a variable attribute by the conventions.

From my understanding, because it isn't flagged as a variable attribute means that we are free to do it. Try adding a history and xclim_history to a variable in a NetCDF and run it through a CF-Checker. there shouldn't be any errors.

@Zeitsperre
Copy link
Collaborator

Very small point, but the Notes docstring for NpdfTransform could use a small fix while you're fixing things up: https://xclim.readthedocs.io/en/stable/sdba_api.html#xclim.sdba.adjustment.NpdfTransform

@aulemahal
Copy link
Collaborator Author

So I changed "xclim_history" to "history" throughout xclim!
@tlogan2000 (so you're in the loop too).

This is not super-breaking, but kinda.
The output of an indicator computation before:

xclim_history : [2021-07-07 18:46:28] growing_degree_days: GROWING_DEGREE_DAYS(tas=<array>, thresh='10.0 degC', freq='YS') - xclim version: 0.28.0.

now:

history :  [2021-08-23 14:05:45] growing_degree_days: GROWING_DEGREE_DAYS(tas=tas, thresh='10.0 degC', freq='YS') - xclim version: 0.28.2-beta.

If the indicator was from a virtual module, the module name would be appended in the call string (ex: icclim.CD(...)).

Applies also to ensembles percentiles and stats, percentile_doy, and a few others.

@aulemahal aulemahal changed the title [sdba] Refactor processing functions + add unit handling Update history, refactor sdba processing functions Aug 23, 2021
@aulemahal
Copy link
Collaborator Author

@Zeitsperre
The CF-checker returns the following on the file produced by usage.ipynb:

CHECKING NetCDF FILE: /tmp/27024.nc
=====================
Using CF Checker Version 4.1.0
Checking against CF Version CF-1.8
Using Standard Name Table Version 77 (2021-01-19T13:38:50Z)
Using Area Type Table Version 10 (23 June 2020)
Using Standardized Region Name Table Version 4 (18 December 2018)

WARN: (2.6.1): No 'Conventions' attribute present

------------------
Checking variable: time
------------------
WARN: (3): No standard_name or long_name attribute specified

------------------
Checking variable: lat
------------------

------------------
Checking variable: lon
------------------

------------------
Checking variable: growing_degree_days
------------------
INFO: attribute history is being used in a non-standard way
WARN: (7.3): Coordinate variable time should have bounds or climatology attribute
WARN: (7.3): Coordinate variable time should have bounds or climatology attribute

ERRORS detected: 0
WARNINGS given: 4
INFORMATION messages: 1

@Zeitsperre
Copy link
Collaborator

That's exactly what we should expect. LGTM.

@aulemahal aulemahal changed the title Update history, refactor sdba processing functions Update history, refactor sdba processing functions, add unit handling in sdba Aug 23, 2021
@aulemahal
Copy link
Collaborator Author

I'm turning crazy soon. The sdba notebook compiles with a failure on the first try, but passes on the second try...

@Zeitsperre
Copy link
Collaborator

I'm turning crazy soon. The sdba notebook compiles with a failure on the first try, but passes on the second try...

This actually sounds insane, by Einstein's definition anyway. I'll give it a try as well.

@Zeitsperre
Copy link
Collaborator

Works for me on the first try. It may have to do with the compute() call. Jupyter has a habit of running through to the next cell before asynchronous processes are finished. Could it be that?

@aulemahal
Copy link
Collaborator Author

aulemahal commented Aug 23, 2021

🤯 So I copied only this example in a new script and ran it 7 times. At the start, I deleted the xclim testing data and the script's output. Between trial 3 and 4, I modified the code to have some more info in the traceback (the algo was untouched).

  1. Fail
  2. Fail
  3. Pass
  4. Fail
  5. Pass
  6. Pass
  7. Fail

Funny, eh?

EDIT: The compute call was the last operation of the script, in order to test @Zeitsperre's hypothesis.

@Zeitsperre
Copy link
Collaborator

What is going on?! This is really strange behaviour. Any information from the trace back (is it different from the doctests build)?

@aulemahal
Copy link
Collaborator Author

aulemahal commented Aug 23, 2021

I could get the following info:

  • It always fails at iteration 0 of npdf_transform.
  • It doesn't fail for all chunks. There are three chunks in the data, so npdf_transform is executed thrice.
  • It always fails at line scensp = ADJ.adjust(simp, **kwargs["adj_kws"]) and shows that simp has no attributes at all, while histp and refp had some. And even though sim has some and xr.set_options(keep_attrs=True) was used.

Thus, it seems that, at random moments, the last matrix multiplication of npdf_transform doesn't respect the context under which it is executed and attrs are not copied.

EDIT: It may be obvious, but writing it to be sure : this only happens when dask is used,

@aulemahal
Copy link
Collaborator Author

Added an explicit setting of units to all ADJ inputs. And a comment to note that it is weird.
Before I had a 6/10 failure rate. Now I have a 0/3 failure rate. THUS, it works.

@aulemahal
Copy link
Collaborator Author

Now, that's more interesting, a real error! My bad.

@aulemahal aulemahal merged commit d888c0f into master Aug 24, 2021
@aulemahal aulemahal deleted the sdba-refac-processing branch August 24, 2021 18:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants