Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid hangs in async pony_check properties when using actions #4405

Merged
merged 2 commits into from
Aug 29, 2023

Conversation

mfelsche
Copy link
Contributor

@mfelsche mfelsche commented Aug 20, 2023

This should fix issue #4391 for very slow execution environments.

[update]

After first assuming it was a timeout issue, i.e. that a slow execution environment did not finish within the default 60 secs timeout.
It turned out to be a logical flaw in how the _PropertyRunner handles expecting and completing actions. It was possible to have stray actors from old property runs call into the _PropertyRunner instance that was already dealing with another run, thus disturbing the current run. The async Property handling with complete and fail already had such a safeguard, the expect_action and complete_action/fail_action calls did not yet have this.

[/update]

After investigating #4391 i was not able to find a logical flaw in the PropertyRunner or in the way the UnitTest timeout and the TestHelper.long_test() feature is being used by pony_check. After looking at the logs, the best explanation for the spurious failures is that the timeout of 1 minute per property by default, was just too low for the specific qemu execution environment. The log lines are all expected and there should be 4 of them per property execution for a default of 100 samples. But the logs are short some of those, so the test simply didn't finish in the given 60 seconds.

Hence this PR is changing the timeout for the tests to 5 minutes. There is no test actually checking for this timeout explicitly, so there will be no increase in testing time, coming from this change.

@ponylang-main ponylang-main added the discuss during sync Should be discussed during an upcoming sync label Aug 20, 2023
@SeanTAllen
Copy link
Member

In _Async...

there is

    h.long_test(params.timeout)

If I read that correctly it sets the long test to X which starts before the properties test so the properties tests think they have X amount of time, but were started after the long test value for X. I think that is a logical flaw, and the long_test would do "X + some value" where X is params.timeout.

Am I incorrect in my understanding? That is the logical flaw I was speaking of in the issue.

@SeanTAllen
Copy link
Member

Logical error aside, I think this is a fine PR and if the logic error does exist, then it can be done in another PR.

@SeanTAllen SeanTAllen added the do not merge This PR should not be merged at this time label Aug 20, 2023
@SeanTAllen
Copy link
Member

I'm going to rerun the tests a few times to make sure we don't see the timeout we were seeing before.

@mfelsche
Copy link
Contributor Author

If I read that correctly it sets the long test to X which starts before the properties test so the properties tests think they have X amount of time, but were started after the long test value for X. I think that is a logical flaw, and the long_test would do "X + some value" where X is params.timeout.

Am I incorrect in my understanding? That is the logical flaw I was speaking of in the issue.

So, the call to h.long_test(params.timeout) immediately starts the timer, before the _PropertyRunner starts generating samples and actually executing the property. This delta is considerably small and inherent to how the runtime schedules the calls, and not controllable from the pony program itself. We might not have such a big delta if we'd handle the timeout in the _PropertyRunner itself, maybe checking it after each sample run and during shrinking? But honestly a timer seems like the best/most convenient option we got (as it is also able to interrupt the machinery if a generator or property takes very long), and there is already one in the pony_test harness, which is currently the way of running pony_check. I would consider it an improvement if the pony_check _PropertyRunner would handle the timeout itself, making it more independent from pony_test.
Summarizing, I would not call it a logical flaw, but a rather unprecise timeout. I can have a stab at this, but i think the added value is considerably small.

@mfelsche
Copy link
Contributor Author

Interestingly while I wrote all that, the error returned on a musl build, rendering some of my assumptions invalid it seems.

@SeanTAllen
Copy link
Member

ive never seen it timeout on "a real machine" (that I remember). That's genuinely surprising and has me rethinking assumptions that I had made.

@mfelsche
Copy link
Contributor Author

mfelsche commented Aug 23, 2023

I was able to reproduce the hang on a freebsd machine with

./build/debug/ponyc --debug packages/pony_check && cpuset -l3 ./pony_check --sequential

It does not happen often, but sometimes.

The log was:

**** FAILED: property_runner/async/complete_multi_fail_action
[PROPERTY] Action expected: succeed-once
[PROPERTY] Action completed: succeed-once
[PROPERTY] Action expected: succeed-once
[PROPERTY] Action failed: succeed-once
[PROPERTY] Action completed: succeed-once
[PROPERTY] action 'succeed-once' finished unexpectedly. ignoring.
[PROPERTY] Action failed: succeed-once
[PROPERTY] action 'succeed-once' finished unexpectedly. ignoring.
Test timed out without completing

Investigating...

@mfelsche mfelsche changed the title Increase ponycheck test timeout. Fix hang pony_check when using asnyc properties and actions Aug 27, 2023
@SeanTAllen
Copy link
Member

I saw you pushed an update to this @mfelsche, any progress? Got fun debugging stories?

@mfelsche
Copy link
Contributor Author

@SeanTAllen Well, as it turns out, the narrator is always right. It was a logical flaw in the pony_check implementation. See the update above in the PR description.

@mfelsche mfelsche added the changelog - fixed Automatically add "Fixed" CHANGELOG entry on merge label Aug 27, 2023
@ponylang-main
Copy link
Contributor

Hi @mfelsche,

The changelog - fixed label was added to this pull request; all PRs with a changelog label need to have release notes included as part of the PR. If you haven't added release notes already, please do.

Release notes are added by creating a uniquely named file in the .release-notes directory. We suggest you call the file 4405.md to match the number of this pull request.

The basic format of the release notes (using markdown) should be:

## Title

End user description of changes, why it's important,
problems it solves etc.

If a breaking change, make sure to include 1 or more
examples what code would look like prior to this change
and how to update it to work after this change.

Thanks.

@mfelsche mfelsche changed the title Fix hang pony_check when using asnyc properties and actions Fix hang in pony_check when using asnyc properties and actions Aug 27, 2023
@SeanTAllen
Copy link
Member

@mfelsche do we still need the timeout changes?

@SeanTAllen
Copy link
Member

@mfelsche can you draft a file commit message and post here so we can use it when we squash and merge this?

@SeanTAllen
Copy link
Member

Discussed during sync and @jemc thinks that if we don't need the timeout changes, they should be removed.

and with that hopefully terminate the occasional hangs.
@mfelsche mfelsche force-pushed the pony_check/timeout branch from 56837f3 to 30ebd5b Compare August 29, 2023 20:34
@mfelsche
Copy link
Contributor Author

mfelsche commented Aug 29, 2023

The timeout changes have been removed.

Here is a commit message for the squash:

Avoid hangs in async pony_check properties when using actions

It was possible for calls from stray actors from previous executions of the 
same property to call into the current execution of the property 
(in `_PropertyRunner`) and thus confuse  the current run.

This is fixed now. Each property run, and also the property executions during 
shrinking, now is uniquely identified, and all calls using actions now need to
provide those unique ids. Inside the `_PropertyRunner` we now check if the call
we receive is from the current expected run or from a previous one, and discard
the call in the latter case.

There is no user-facing API change for the `pony_check` package coming from
this.

@SeanTAllen SeanTAllen changed the title Fix hang in pony_check when using asnyc properties and actions Fix hang in pony_check when using async properties and actions Aug 29, 2023
@SeanTAllen SeanTAllen removed do not merge This PR should not be merged at this time discuss during sync Should be discussed during an upcoming sync labels Aug 29, 2023
@ponylang-main ponylang-main added the discuss during sync Should be discussed during an upcoming sync label Aug 29, 2023
@SeanTAllen SeanTAllen changed the title Fix hang in pony_check when using async properties and actions Avoid hangs in async pony_check properties when using actions Aug 29, 2023
@SeanTAllen SeanTAllen merged commit c97337a into main Aug 29, 2023
@SeanTAllen SeanTAllen deleted the pony_check/timeout branch August 29, 2023 22:54
@ponylang-main ponylang-main removed the discuss during sync Should be discussed during an upcoming sync label Aug 29, 2023
github-actions bot pushed a commit that referenced this pull request Aug 29, 2023
github-actions bot pushed a commit that referenced this pull request Aug 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog - fixed Automatically add "Fixed" CHANGELOG entry on merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants