Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve revised integration tests for local use by devs #17838

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

capistrant
Copy link
Contributor

@capistrant capistrant commented Mar 26, 2025

Description

The main goal of this work is to make running the revised ITs locally easier for devs while not impacting the robustness of ITs already successfully running in Github actions CI pipeline

Improved flakiness of cluster startup caused by race condition in container startup (druid vs MySQL)

MySQL has to be up accepting connections before coordinator/overlord containers start, otherwise their is a risk of them trying to create tables and indexes before MySQL is healthy. When this happened on my local it led to undesired behavior (services starting up, failing to create metadata tables and indexes, and then complaining about them not existing for remainder of cluster uptime).

To accomplish this, I made two main changes:

  1. I employed a docker-compose healthcheck for MySQL that waits for the ability to connect to the Druid database before determining itself healthy. I've been tinkering with the intervals, timeouts, etc trying to find a balance between waiting to long in a hopeless scenario and quitting just before success.
  2. I modified the docker-compose files generated to be more specific in defining dependency services. Dependencies now come with a condition that must be met before starting. In the case of services that rely on MySQL, they are forced to wait until it is healthy before their containers start.

Improved flakiness of Security IT suite

Locally I was still seeing flakiness that seemed to be caused by authz policies not syncing in time for each test. I saw that one test had it's "policy sync sleep" increased to 4 seconds with the others left at 1. I centralized sleep time with a static variable so we don't run into drift like this again. I also upped the sleep to 10 seconds, but am not wed to that number if we want to try to squeeze it down some. After this change, I saw much more consistent success running the Security suite locally.

Improved UX for folks using machines with Apple Silicon

In the past the docker-compose template had a commented out platform key that told Apple Silicon users to uncomment it. This will inevitably lead to folks using Apple chips to become fatigued with remembering this bit of information. Instead I a modified the template.py script that generates the docker-compose file to key off of the host platform and set the proper override if we are running on arm.

Migrated Query and MultiStageQueryWithMM

Query and MultiStageQueryWithMM were using their own static docker-compose files. This goes against the pattern of generating the compose file dynamically in the other IT categories. I am suspicious that it is a case of the tests not fully migrating from the "old" IT pattern to the "revised" pattern. After migrating them to the new generated compose file pattern, I had no issues locally. Interested to see them run in CI.

These categories rely on Kafka, which pushed me to add Kafka as a core dependency generated in the compose file by template.py. I updated the categories that don't need kafka to exclude it from their compose files going forward.

Release note

N/A as this is test facing only.

For Druid Devs: Improve the user experience running integration-tests-ex on your local machine.


Key changed/added classes in this PR
  • integration-tests-ex/cases/cluster/template.py
  • integration-tests-ex/cases/src/test/java/org/apache/druid/testsEx/auth/ITSecurityBasicQuery.java
  • integration-tests-ex/cases/cluster/Common/dependencies.yaml

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster local dev env

The goal of this work is to make running the revised ITs locally easier for devs while not impacting the robustness of ITs running in Github actions CI pipeline
@@ -74,6 +74,10 @@ public class ITSecurityBasicQuery
public static final String USER_1_PASSWORD = "password1";
private static final String EXPORT_TASK = "/indexer/export_task.json";

// Time in ms to sleep after updating role permissions in each test. This intends to give the
// underlying test cluster enough time to sync permissions and be ready when test execution starts.
private static final int SYNC_SLEEP = 10000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will total in 50s - but I guess if someone is bothered that these tests are slow someone will take a look at that...

I personally like slow tests better than flaky ones :)


def define_master_service(self, name, base) -> dict:
'''
Defines a "master" service: one which depends on the metadata service.
'''
service = self.define_druid_service(name, base)
self.add_depends(service, [ZOO_KEEPER, METADATA])
dep_dict = { ZOO_KEEPER: {'condition': 'service_started'}, METADATA: {'condition': 'service_healthy'}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the original call sites were communicating the intention better - it also feels like the service_healthy and service_started are abstraction leakage.

it would be nice to get the full condition differently - I feel like that has more to do with what the service want to depend on - so the the condition parts like service_started could be in a map or something ; and the add_depends could look that up - that will also leave all the callsites as is.

Having dependency-service startup conditional setup use a centralized lookup with a sane default simplifies individual IT cases and will drive consistency across all existing and future IT cases
Copy link
Member

@kgyrtkirk kgyrtkirk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great! thank you for the updates!

@capistrant capistrant marked this pull request as draft March 29, 2025 01:58
@capistrant
Copy link
Contributor Author

Marking as draft so this doesn't get merged over the weekend. I think I found an issue that could impact CI stability.

In my local I've been hammering this docker-compose with ups and downs to test and noticed intermittent bad behavior where the metadata container gets marked Error and then has docker-compose fail on the next health check. At first, I thought it was my health check or config that was somehow bad, but after a bunch of banging on it I realized that the metadata container was experiencing a restart every time it started up. In the logs I found that it is complaining about the init.sql file it tries to run at startup. https://github.com/apache/druid/blob/master/integration-tests-ex/cases/cluster/Common/dependencies.yaml#L90 ... Turns out this is actually a dir cuz it doesn't exist locally, and docker default must be to create a dir when mounting something that doesn't exist (which I guess makes sense). Startup does not like this being a dir and it crashes the server. somehow on the next startup it works though (probably skips the seed step?).

https://github.com/apache/druid/blob/master/integration-tests-ex/cases/cluster.sh#L132 we are removing the db dir with the comment above making it seem like this is the mysql data dir even though we don't configure that anywhere and mysql is definitely using it's own auto managed volume for the data dir.

From official mysql docker image docks:

When a container is started for the first time, a new database with the specified name will be created and initialized with the provided configuration variables. Furthermore, it will execute files with extensions .sh, .sql and .sql.gz that are found in /docker-entrypoint-initdb.d. Files will be executed in alphabetical order. You can easily populate your mysql services by mounting a SQL dump into that directory⁠ and provide custom images⁠ with contributed data. SQL files will be imported by default to the database specified by the MYSQL_DATABASE variable.

Long story short, we can either have a proper seed file available even though it isn't doing anything. Or we can just forget about it and nuke the extra mysql volume all together

@capistrant
Copy link
Contributor Author

As for why this now causes an issue even though it has been around for a long time - we were never waiting on container health so docker-compose never cared that the container restarted. now it seems to sometimes notice and blow up. It is inconsistent on my local, maybe 10% of the time with the current code, but enough to cause CI nightmares if it also happens in github actions. The changes I have tested seem to appear to completely eradicate the problem. I'll push one up shortly after I decide which I like more.

…ql docker entrypoint script

The mysql docker container allows you to prive startup scripts for the database in this entrypoint directory. We were nmounting a directory named like a file, and it confuses the docker startup for mysql foring a restart on startup. This restart is now problematic since we are telling docker-compose to wait for mysql health before starting services that depend on it. Since we aren't using the startup script mechanism at all, simply removing it seems like cleanest strategy
@capistrant capistrant marked this pull request as ready for review March 30, 2025 21:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants