Improve revised integration tests for local use by devs #17838

capistrant · 2025-03-26T22:28:19Z

Description

The main goal of this work is to make running the revised ITs locally easier for devs while not impacting the robustness of ITs already successfully running in Github actions CI pipeline

Improved flakiness of cluster startup caused by race condition in container startup (druid vs MySQL)

MySQL has to be up accepting connections before coordinator/overlord containers start, otherwise their is a risk of them trying to create tables and indexes before MySQL is healthy. When this happened on my local it led to undesired behavior (services starting up, failing to create metadata tables and indexes, and then complaining about them not existing for remainder of cluster uptime).

To accomplish this, I made two main changes:

I employed a docker-compose healthcheck for MySQL that waits for the ability to connect to the Druid database before determining itself healthy. I've been tinkering with the intervals, timeouts, etc trying to find a balance between waiting to long in a hopeless scenario and quitting just before success.
I modified the docker-compose files generated to be more specific in defining dependency services. Dependencies now come with a condition that must be met before starting. In the case of services that rely on MySQL, they are forced to wait until it is healthy before their containers start.

Improved flakiness of Security IT suite

Locally I was still seeing flakiness that seemed to be caused by authz policies not syncing in time for each test. I saw that one test had it's "policy sync sleep" increased to 4 seconds with the others left at 1. I centralized sleep time with a static variable so we don't run into drift like this again. I also upped the sleep to 10 seconds, but am not wed to that number if we want to try to squeeze it down some. After this change, I saw much more consistent success running the Security suite locally.

Improved UX for folks using machines with Apple Silicon

In the past the docker-compose template had a commented out platform key that told Apple Silicon users to uncomment it. This will inevitably lead to folks using Apple chips to become fatigued with remembering this bit of information. Instead I a modified the template.py script that generates the docker-compose file to key off of the host platform and set the proper override if we are running on arm.

Migrated Query and MultiStageQueryWithMM

Query and MultiStageQueryWithMM were using their own static docker-compose files. This goes against the pattern of generating the compose file dynamically in the other IT categories. I am suspicious that it is a case of the tests not fully migrating from the "old" IT pattern to the "revised" pattern. After migrating them to the new generated compose file pattern, I had no issues locally. Interested to see them run in CI.

These categories rely on Kafka, which pushed me to add Kafka as a core dependency generated in the compose file by template.py. I updated the categories that don't need kafka to exclude it from their compose files going forward.

Release note

N/A as this is test facing only.

For Druid Devs: Improve the user experience running integration-tests-ex on your local machine.

Key changed/added classes in this PR

integration-tests-ex/cases/cluster/template.py
integration-tests-ex/cases/src/test/java/org/apache/druid/testsEx/auth/ITSecurityBasicQuery.java
integration-tests-ex/cases/cluster/Common/dependencies.yaml

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
a release note entry in the PR description.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a ~~test Druid cluster~~ local dev env

The goal of this work is to make running the revised ITs locally easier for devs while not impacting the robustness of ITs running in Github actions CI pipeline

integration-tests-ex/cases/cluster/MultiStageQueryWithMM/docker-compose.py

integration-tests-ex/cases/cluster/Query/docker-compose.py

kgyrtkirk · 2025-03-27T08:30:10Z

...gration-tests-ex/cases/src/test/java/org/apache/druid/testsEx/auth/ITSecurityBasicQuery.java

@@ -74,6 +74,10 @@ public class ITSecurityBasicQuery
  public static final String USER_1_PASSWORD = "password1";
  private static final String EXPORT_TASK = "/indexer/export_task.json";

+  // Time in ms to sleep after updating role permissions in each test. This intends to give the
+  // underlying test cluster enough time to sync permissions and be ready when test execution starts.
+  private static final int SYNC_SLEEP = 10000;


this will total in 50s - but I guess if someone is bothered that these tests are slow someone will take a look at that...

I personally like slow tests better than flaky ones :)

kgyrtkirk · 2025-03-27T08:38:54Z

integration-tests-ex/cases/cluster/template.py


    def define_master_service(self, name, base) -> dict:
        '''
        Defines a "master" service: one which depends on the metadata service.
        '''
        service = self.define_druid_service(name, base)
-        self.add_depends(service, [ZOO_KEEPER, METADATA])
+        dep_dict = { ZOO_KEEPER: {'condition': 'service_started'}, METADATA: {'condition': 'service_healthy'}}


I feel like the original call sites were communicating the intention better - it also feels like the service_healthy and service_started are abstraction leakage.

it would be nice to get the full condition differently - I feel like that has more to do with what the service want to depend on - so the the condition parts like service_started could be in a map or something ; and the add_depends could look that up - that will also leave all the callsites as is.

Having dependency-service startup conditional setup use a centralized lookup with a sane default simplifies individual IT cases and will drive consistency across all existing and future IT cases

kgyrtkirk

looks great! thank you for the updates!

capistrant · 2025-03-29T02:16:45Z

Marking as draft so this doesn't get merged over the weekend. I think I found an issue that could impact CI stability.

In my local I've been hammering this docker-compose with ups and downs to test and noticed intermittent bad behavior where the metadata container gets marked Error and then has docker-compose fail on the next health check. At first, I thought it was my health check or config that was somehow bad, but after a bunch of banging on it I realized that the metadata container was experiencing a restart every time it started up. In the logs I found that it is complaining about the init.sql file it tries to run at startup. https://github.com/apache/druid/blob/master/integration-tests-ex/cases/cluster/Common/dependencies.yaml#L90 ... Turns out this is actually a dir cuz it doesn't exist locally, and docker default must be to create a dir when mounting something that doesn't exist (which I guess makes sense). Startup does not like this being a dir and it crashes the server. somehow on the next startup it works though (probably skips the seed step?).

https://github.com/apache/druid/blob/master/integration-tests-ex/cases/cluster.sh#L132 we are removing the db dir with the comment above making it seem like this is the mysql data dir even though we don't configure that anywhere and mysql is definitely using it's own auto managed volume for the data dir.

From official mysql docker image docks:

When a container is started for the first time, a new database with the specified name will be created and initialized with the provided configuration variables. Furthermore, it will execute files with extensions .sh, .sql and .sql.gz that are found in /docker-entrypoint-initdb.d. Files will be executed in alphabetical order. You can easily populate your mysql services by mounting a SQL dump into that directory⁠ and provide custom images⁠ with contributed data. SQL files will be imported by default to the database specified by the MYSQL_DATABASE variable.

Long story short, we can either have a proper seed file available even though it isn't doing anything. Or we can just forget about it and nuke the extra mysql volume all together

capistrant · 2025-03-29T12:20:27Z

As for why this now causes an issue even though it has been around for a long time - we were never waiting on container health so docker-compose never cared that the container restarted. now it seems to sometimes notice and blow up. It is inconsistent on my local, maybe 10% of the time with the current code, but enough to cause CI nightmares if it also happens in github actions. The changes I have tested seem to appear to completely eradicate the problem. I'll push one up shortly after I decide which I like more.

…ql docker entrypoint script The mysql docker container allows you to prive startup scripts for the database in this entrypoint directory. We were nmounting a directory named like a file, and it confuses the docker startup for mysql foring a restart on startup. This restart is now problematic since we are telling docker-compose to wait for mysql health before starting services that depend on it. Since we aren't using the startup script mechanism at all, simply removing it seems like cleanest strategy

Improve revised integration tests for local use by devs

1e55a16

The goal of this work is to make running the revised ITs locally easier for devs while not impacting the robustness of ITs running in Github actions CI pipeline

github-advanced-security bot found potential problems Mar 26, 2025

View reviewed changes

integration-tests-ex/cases/cluster/MultiStageQueryWithMM/docker-compose.py Fixed Show fixed Hide fixed

integration-tests-ex/cases/cluster/Query/docker-compose.py Fixed Show fixed Hide fixed

kgyrtkirk reviewed Mar 27, 2025

View reviewed changes

Internalize dependency startup conditional within template.py

ea87656

Having dependency-service startup conditional setup use a centralized lookup with a sane default simplifies individual IT cases and will drive consistency across all existing and future IT cases

capistrant added the Area - Testing label Mar 27, 2025

kgyrtkirk approved these changes Mar 28, 2025

View reviewed changes

capistrant marked this pull request as draft March 29, 2025 01:58

capistrant marked this pull request as ready for review March 30, 2025 21:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve revised integration tests for local use by devs #17838

Improve revised integration tests for local use by devs #17838

capistrant commented Mar 26, 2025 •

edited

Loading

kgyrtkirk Mar 27, 2025

kgyrtkirk Mar 27, 2025

kgyrtkirk left a comment

capistrant commented Mar 29, 2025

capistrant commented Mar 29, 2025

Improve revised integration tests for local use by devs #17838

Are you sure you want to change the base?

Improve revised integration tests for local use by devs #17838

Conversation

capistrant commented Mar 26, 2025 • edited Loading

Description

Improved flakiness of cluster startup caused by race condition in container startup (druid vs MySQL)

Improved flakiness of Security IT suite

Improved UX for folks using machines with Apple Silicon

Migrated Query and MultiStageQueryWithMM

Release note

Key changed/added classes in this PR

kgyrtkirk Mar 27, 2025

Choose a reason for hiding this comment

kgyrtkirk Mar 27, 2025

Choose a reason for hiding this comment

kgyrtkirk left a comment

Choose a reason for hiding this comment

capistrant commented Mar 29, 2025

capistrant commented Mar 29, 2025

capistrant commented Mar 26, 2025 •

edited

Loading