Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Veritech testing ground #5566

Draft
wants to merge 12 commits into
base: main
Choose a base branch
from
Draft

Veritech testing ground #5566

wants to merge 12 commits into from

Conversation

nickgerace
Copy link
Contributor

@nickgerace nickgerace commented Feb 27, 2025

Description

This PR has become a testing ground for preventing veritech failures to process existing work and take on new work.

@nickgerace
Copy link
Contributor Author

/try

Copy link

github-actions bot commented Feb 27, 2025

Okay, starting a try! I'll update this comment once it's running...\n
🚀 Try running here! 🚀

Copy link

github-actions bot commented Feb 27, 2025

Dependency Review

✅ No vulnerabilities or OpenSSF Scorecard issues found.

OpenSSF Scorecard

PackageVersionScoreDetails

Scanned Files

@nickgerace
Copy link
Contributor Author

/try

Copy link

github-actions bot commented Mar 6, 2025

Okay, starting a try! I'll update this comment once it's running...\n
🚀 Try running here! 🚀

Signed-off-by: Nick Gerace <[email protected]>
@nickgerace
Copy link
Contributor Author

/try

Copy link

github-actions bot commented Mar 7, 2025

Okay, starting a try! I'll update this comment once it's running...\n
🚀 Try running here! 🚀

@nickgerace
Copy link
Contributor Author

/try

Copy link

github-actions bot commented Mar 7, 2025

Okay, starting a try! I'll update this comment once it's running...\n
🚀 Try running here! 🚀

@nickgerace
Copy link
Contributor Author

/try

Copy link

github-actions bot commented Mar 7, 2025

Okay, starting a try! I'll update this comment once it's running...\n
🚀 Try running here! 🚀

@nickgerace
Copy link
Contributor Author

/try

Copy link

github-actions bot commented Mar 8, 2025

Okay, starting a try! I'll update this comment once it's running...\n
🚀 Try running here! 🚀

nickgerace added a commit that referenced this pull request Mar 11, 2025
This commit adds a heartbeat app with force reconnection abilities to
veritech, adds naxum metrics and tunes metrics and telemetry everywhere.

Many of the changes in this commit come from the testing grounds PR
(#5566), but not all changes were included. The "pause-resume" stream
wrapper and client hot-swapper changes are either not ready for
production use or are not tactics we should employ for one reason or
another.

For veritech, the new heartbeat app is disabled in tests (by default)
and enabled in all other scenarios (by default). The heartbeat app
publishes messages via core NATS on a cadence and works with subject
prefixes. It uses the instance ID to determine the destination subject.

Metrics have been added to track pool exhaustion,

For naxum, metrics have been added for the core service and in
middleware. Tracing has been tuned to be as noiseless as possible while
making use of metrics where possible. These changes include tracking
failures for "double ack failure" and standardizing maintain progress
task logging.

The "ciruitbreaker" with 4 max messages has been removed as a result of
these changes. Not only is it an anti-pattern for handling NACK'd
messages, but it also masks over potential poison pill messages for
consumers who allow for more than 4 re-deliveries.

Finally, the graceful shutdown time is now tunable for veritech and its
default has moved from 6 hours to 20 minutes.

Signed-off-by: Nick Gerace <[email protected]>
nickgerace added a commit that referenced this pull request Mar 11, 2025
This commit adds a heartbeat app with force reconnection abilities to
veritech, adds naxum metrics and tunes metrics and telemetry everywhere.

Many of the changes in this commit come from the testing grounds PR
(#5566), but not all changes were included. The "pause-resume" stream
wrapper and client hot-swapper changes are either not ready for
production use or are not tactics we should employ for one reason or
another.

For veritech, the new heartbeat app is disabled in tests (by default)
and enabled in all other scenarios (by default). The heartbeat app
publishes messages via core NATS on a cadence and works with subject
prefixes. It uses the instance ID to determine the destination subject.

Metrics have been added to track pool exhaustion, publisher stats,
client stats, handlers doing work and more.

For naxum, metrics have been added for the core service and in
middleware. Tracing has been tuned to be as noiseless as possible while
making use of metrics where possible. These changes include tracking
failures for "double ack failure" and standardizing maintain progress
task logging.

The "ciruitbreaker" with 4 max messages has been removed as a result of
these changes. Not only is it an anti-pattern for handling NACK'd
messages, but it also masks over potential poison pill messages for
consumers who allow for more than 4 re-deliveries.

Finally, the graceful shutdown time is now tunable for veritech and its
default has moved from 6 hours to 20 minutes.

Signed-off-by: Nick Gerace <[email protected]>
nickgerace added a commit that referenced this pull request Mar 11, 2025
This commit adds a heartbeat app with force reconnection abilities to
veritech, adds naxum metrics and tunes metrics and telemetry everywhere.

Many of the changes in this commit come from the testing grounds PR
(#5566), but not all changes were included. The "pause-resume" stream
wrapper and client hot-swapper changes are either not ready for
production use or are not tactics we should employ for one reason or
another.

For veritech, the new heartbeat app is disabled in tests (by default)
and enabled in all other scenarios (by default). The heartbeat app
publishes messages via core NATS on a cadence and works with subject
prefixes. It uses the instance ID to determine the destination subject.

Metrics have been added to track pool exhaustion, publisher stats,
client stats, handlers doing work and more.

For naxum, metrics have been added for the core service and in
middleware. Tracing has been tuned to be as noiseless as possible while
making use of metrics where possible. These changes include tracking
failures for "double ack failure" and standardizing maintain progress
task logging.

The "ciruitbreaker" with 4 max messages has been removed as a result of
these changes. Not only is it an anti-pattern for handling NACK'd
messages, but it also masks over potential poison pill messages for
consumers who allow for more than 4 re-deliveries.

Finally, the graceful shutdown time is now tunable for veritech and its
default has moved from 6 hours to 20 minutes.

Signed-off-by: Nick Gerace <[email protected]>
nickgerace added a commit that referenced this pull request Mar 11, 2025
This commit adds a heartbeat app with force reconnection abilities to
veritech, adds naxum metrics and tunes metrics and telemetry everywhere.

Many of the changes in this commit come from the testing grounds PR
(#5566), but not all changes were included. The "pause-resume" stream
wrapper and client hot-swapper changes are either not ready for
production use or are not tactics we should employ for one reason or
another.

For veritech, the new heartbeat app is disabled in tests (by default)
and enabled in all other scenarios (by default). The heartbeat app
publishes messages via core NATS on a cadence and works with subject
prefixes. It uses the instance ID to determine the destination subject.

Metrics have been added to track pool exhaustion, publisher stats,
client stats, handlers doing work and more.

For naxum, metrics have been added for the core service and in
middleware. Tracing has been tuned to be as noiseless as possible while
making use of metrics where possible. These changes include tracking
failures for "double ack failure" and standardizing maintain progress
task logging.

The "ciruitbreaker" with 4 max messages has been removed as a result of
these changes. Not only is it an anti-pattern for handling NACK'd
messages, but it also masks over potential poison pill messages for
consumers who allow for more than 4 re-deliveries.

Finally, the graceful shutdown time is now tunable for veritech and its
default has moved from 6 hours to 20 minutes.

Signed-off-by: Nick Gerace <[email protected]>
nickgerace added a commit that referenced this pull request Mar 11, 2025
This commit adds a heartbeat app with force reconnection abilities to
veritech, adds naxum metrics and tunes metrics and telemetry everywhere.

Many of the changes in this commit come from the testing grounds PR
(#5566), but not all changes were included. The "pause-resume" stream
wrapper and client hot-swapper changes are either not ready for
production use or are not tactics we should employ for one reason or
another.

For veritech, the new heartbeat app is disabled in tests (by default)
and enabled in all other scenarios (by default). The heartbeat app
publishes messages via core NATS on a cadence and works with subject
prefixes. It uses the instance ID to determine the destination subject.

Metrics have been added to track pool exhaustion, publisher stats,
client stats, handlers doing work and more.

For naxum, metrics have been added for the core service and in
middleware. Tracing has been tuned to be as noiseless as possible while
making use of metrics where possible. These changes include tracking
failures for "double ack failure" and standardizing maintain progress
task logging.

The "ciruitbreaker" with 4 max messages has been removed as a result of
these changes. Not only is it an anti-pattern for handling NACK'd
messages, but it also masks over potential poison pill messages for
consumers who allow for more than 4 re-deliveries.

Finally, the graceful shutdown time is now tunable for veritech and its
default has moved from 6 hours to 20 minutes.

Signed-off-by: Nick Gerace <[email protected]>
nickgerace added a commit that referenced this pull request Mar 11, 2025
This commit adds a heartbeat app with force reconnection abilities to
veritech, adds naxum metrics and tunes metrics and telemetry everywhere.

Many of the changes in this commit come from the testing grounds PR
(#5566), but not all changes were included. The "pause-resume" stream
wrapper and client hot-swapper changes are either not ready for
production use or are not tactics we should employ for one reason or
another.

For veritech, the new heartbeat app is disabled in tests (by default)
and enabled in all other scenarios (by default). The heartbeat app
publishes messages via core NATS on a cadence and works with subject
prefixes. It uses the instance ID to determine the destination subject.

Metrics have been added to track pool exhaustion, publisher stats,
client stats, handlers doing work and more.

For naxum, metrics have been added for the core service and in
middleware. Tracing has been tuned to be as noiseless as possible while
making use of metrics where possible. These changes include tracking
failures for "double ack failure" and standardizing maintain progress
task logging.

The "ciruitbreaker" with 4 max messages has been removed as a result of
these changes. Not only is it an anti-pattern for handling NACK'd
messages, but it also masks over potential poison pill messages for
consumers who allow for more than 4 re-deliveries.

For NATS event callbacks, we will now add the "sid" in the label for
slow consumers and have removed the trace log.

Finally, the graceful shutdown time is now tunable for veritech and its
default has moved from 6 hours to 20 minutes.

Signed-off-by: Nick Gerace <[email protected]>
nickgerace added a commit that referenced this pull request Mar 11, 2025
This commit adds a heartbeat app with force reconnection abilities to
veritech, adds naxum metrics and tunes metrics and telemetry everywhere.

Many of the changes in this commit come from the testing grounds PR
(#5566), but not all changes were included. The "pause-resume" stream
wrapper and client hot-swapper changes are either not ready for
production use or are not tactics we should employ for one reason or
another.

For veritech, the new heartbeat app is disabled in tests (by default)
and enabled in all other scenarios (by default). The heartbeat app
publishes messages via core NATS on a cadence and works with subject
prefixes. It uses the instance ID to determine the destination subject.

Metrics have been added to track pool exhaustion, publisher stats,
client stats, handlers doing work and more.

For naxum, metrics have been added for the core service and in
middleware. Tracing has been tuned to be as noiseless as possible while
making use of metrics where possible. These changes include tracking
failures for "double ack failure" and standardizing maintain progress
task logging.

The "ciruitbreaker" with 4 max messages has been removed as a result of
these changes. Not only is it an anti-pattern for handling NACK'd
messages, but it also masks over potential poison pill messages for
consumers who allow for more than 4 re-deliveries.

For NATS event callbacks, all trace logging is gone and now everything
flows through metrics.

Finally, the graceful shutdown time is now tunable for veritech and its
default has moved from 6 hours to 20 minutes.

Signed-off-by: Nick Gerace <[email protected]>
nickgerace added a commit that referenced this pull request Mar 11, 2025
This commit adds a heartbeat app with force reconnection abilities to
veritech, adds naxum metrics and tunes metrics and telemetry everywhere.

Many of the changes in this commit come from the testing grounds PR
(#5566), but not all changes were included. The "pause-resume" stream
wrapper and client hot-swapper changes are either not ready for
production use or are not tactics we should employ for one reason or
another.

For veritech, the new heartbeat app is disabled in tests (by default)
and enabled in all other scenarios (by default). The heartbeat app
publishes messages via core NATS on a cadence and works with subject
prefixes. It uses the instance ID to determine the destination subject.

Metrics have been added to track pool exhaustion, publisher stats,
client stats, handlers doing work and more.

For naxum, metrics have been added for the core service and in
middleware. Tracing has been tuned to be as noiseless as possible while
making use of metrics where possible. These changes include tracking
failures for "double ack failure" and standardizing maintain progress
task logging.

The "ciruitbreaker" with 4 max messages has been removed as a result of
these changes. Not only is it an anti-pattern for handling NACK'd
messages, but it also masks over potential poison pill messages for
consumers who allow for more than 4 re-deliveries.

For NATS event callbacks, all trace logging is gone and now everything
flows through metrics.

Finally, the graceful shutdown time is now tunable for veritech and its
default has moved from 6 hours to 20 minutes.

Signed-off-by: Nick Gerace <[email protected]>
nickgerace added a commit that referenced this pull request Mar 11, 2025
This commit adds a heartbeat app with force reconnection abilities to
veritech, adds naxum metrics and tunes metrics and telemetry everywhere.

Many of the changes in this commit come from the testing grounds PR
(#5566), but not all changes were included. The "pause-resume" stream
wrapper and client hot-swapper changes are either not ready for
production use or are not tactics we should employ for one reason or
another.

For veritech, the new heartbeat app is disabled in tests (by default)
and enabled in all other scenarios (by default). The heartbeat app
publishes messages via core NATS on a cadence and works with subject
prefixes. It uses the instance ID to determine the destination subject.

Metrics have been added to track pool exhaustion, publisher stats,
client stats, handlers doing work and more.

Upon shutdown, we will now explicitly cancel the shutdown token if the
stream has closed for the core app so that the kill app also closes.

For naxum, metrics have been added for the core service and in
middleware. Tracing has been tuned to be as noiseless as possible while
making use of metrics where possible. These changes include tracking
failures for "double ack failure" and standardizing maintain progress
task logging.

The "ciruitbreaker" with 4 max messages has been removed as a result of
these changes. Not only is it an anti-pattern for handling NACK'd
messages, but it also masks over potential poison pill messages for
consumers who allow for more than 4 re-deliveries.

For NATS event callbacks, all trace logging is gone and now everything
flows through metrics.

Finally, the graceful shutdown time is now tunable for veritech and its
default has moved from 6 hours to 20 minutes.

Signed-off-by: Nick Gerace <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants