|
| 1 | +# Garbage Collection for Assets |
| 2 | + |
| 3 | +* **Original Author(s):**: @eladb, @kaizencc |
| 4 | +* **Tracking Issue**: #64 |
| 5 | +* **API Bar Raiser**: @rix0rrr |
| 6 | + |
| 7 | +The asset garbage collection CLI will identify and/or delete unused CDK assets, |
| 8 | +resulting in smaller bucket/repo size and less cost for customers. |
| 9 | + |
| 10 | +## Working Backwards |
| 11 | + |
| 12 | +**CHANGELOG**: |
| 13 | + |
| 14 | +- feat(cli): garbage collect s3 assets (under --unstable flag) |
| 15 | +- feat(cli): garbage collect ecr assets (under --unstable flag) |
| 16 | + |
| 17 | +**Help**: |
| 18 | + |
| 19 | +```shell |
| 20 | +➜ cdk gc --help |
| 21 | +cdk gc [ENVIRONMENT...] |
| 22 | + |
| 23 | +Finds and deletes all unused S3 and ECR assets in the ENVIRONMENT |
| 24 | + |
| 25 | +Options: |
| 26 | + --type=[s3|ecr|all] filters for type of asset |
| 27 | + --action=[print|tag|delete-tagged|full] type of action to perform on unused assets |
| 28 | + --rollback-buffer-days=number number of days an asset should be isolated before deletion |
| 29 | + --created-buffer-days=number number of days old an asset must be before its elligible for deletion |
| 30 | + --bootstrap-stack-name=string name of a custom bootstrap stack if not CDK Toolkit |
| 31 | + --confirm=boolean confirm with user before deleting assets |
| 32 | + |
| 33 | +Examples: |
| 34 | + cdk gc |
| 35 | + cdk gc aws://ACCOUNT/REGION |
| 36 | + cdk gc --type=s3 --action=delete-tagged |
| 37 | +``` |
| 38 | +
|
| 39 | +**README:** |
| 40 | +
|
| 41 | +> [!CAUTION] |
| 42 | +> CDK Garbage Collection is under development and therefore must be opted in via the `--unstable` flag: `cdk gc --unstable=gc`. |
| 43 | +
|
| 44 | +`cdk gc` garbage collects unused assets from your bootstrap bucket via the following mechanism: |
| 45 | +
|
| 46 | +- for each object in the bootstrap S3 Bucket, check to see if it is referenced in any existing CloudFormation templates |
| 47 | +- if not, it is treated as unused and gc will either tag it or delete it, depending on your configuration. |
| 48 | +
|
| 49 | +The high-level mechanism works identically for unused assets in bootstrapped ECR Repositories. |
| 50 | +
|
| 51 | +The most basic usage looks like this: |
| 52 | +
|
| 53 | +```console |
| 54 | +cdk gc --unstable=gc |
| 55 | +``` |
| 56 | +
|
| 57 | +This will garbage collect all unused assets in all environments of the existing CDK App. |
| 58 | +
|
| 59 | +To specify one type of asset, use the `type` option (options are `all`, `s3`, `ecr`): |
| 60 | +
|
| 61 | +```console |
| 62 | +cdk gc --unstable=gc --type=s3 |
| 63 | +``` |
| 64 | +
|
| 65 | +Otherwise `cdk gc` defaults to collecting assets in both the bootstrapped S3 Bucket and ECR Repository. |
| 66 | +
|
| 67 | +`cdk gc` will garbage collect S3 and ECR assets from the current bootstrapped environment(s) and immediately |
| 68 | +delete them. Note that, since the default bootstrap S3 Bucket is versioned, object deletion will be handled by |
| 69 | +the lifecycle policy on the bucket. |
| 70 | +
|
| 71 | +Before we begin to delete your assets, you will be prompted: |
| 72 | +
|
| 73 | +```console |
| 74 | +cdk gc --unstable=gc |
| 75 | +
|
| 76 | +Found X objects to delete based off of the following criteria: |
| 77 | +- objects have been isolated for > 0 days |
| 78 | +- objects were created > 1 days ago |
| 79 | +
|
| 80 | +Delete this batch (yes/no/delete-all)? |
| 81 | +``` |
| 82 | +
|
| 83 | +Since it's quite possible that the bootstrap bucket has many objects, we work in batches of 1000 objects or 100 images. |
| 84 | +To skip the prompt either reply with `delete-all`, or use the `--confirm=false` option. |
| 85 | +
|
| 86 | +```console |
| 87 | +cdk gc --unstable=gc --confirm=false |
| 88 | +``` |
| 89 | +
|
| 90 | +If you are concerned about deleting assets too aggressively, there are multiple levers you can configure: |
| 91 | +
|
| 92 | +- rollback-buffer-days: this is the amount of days an asset has to be marked as isolated before it is elligible for deletion. |
| 93 | +- created-buffer-days: this is the amount of days an asset must live before it is elligible for deletion. |
| 94 | +
|
| 95 | +When using `rollback-buffer-days`, instead of deleting unused objects, `cdk gc` will tag them with |
| 96 | +today's date instead. It will also check if any objects have been tagged by previous runs of `cdk gc` |
| 97 | +and delete them if they have been tagged for longer than the buffer days. |
| 98 | +
|
| 99 | +When using `created-buffer-days`, we simply filter out any assets that have not persisted that number |
| 100 | +of days. |
| 101 | +
|
| 102 | +```console |
| 103 | +cdk gc --unstable=gc --rollback-buffer-days=30 --created-buffer-days=1 |
| 104 | +``` |
| 105 | +
|
| 106 | +You can also configure the scope that `cdk gc` performs via the `--action` option. By default, all actions |
| 107 | +are performed, but you can specify `print`, `tag`, or `delete-tagged`. |
| 108 | +
|
| 109 | +- `print` performs no changes to your AWS account, but finds and prints the number of unused assets. |
| 110 | +- `tag` tags any newly unused assets, but does not delete any unused assets. |
| 111 | +- `delete-tagged` deletes assets that have been tagged for longer than the buffer days, but does not tag newly unused assets. |
| 112 | +
|
| 113 | +```console |
| 114 | +cdk gc --unstable=gc --action=delete-tagged --rollback-buffer-days=30 |
| 115 | +``` |
| 116 | +
|
| 117 | +This will delete assets that have been unused for >30 days, but will not tag additional assets. |
| 118 | +
|
| 119 | +### Theoretical Race Condition with `REVIEW_IN_PROGRESS` stacks |
| 120 | +
|
| 121 | +When gathering stack templates, we are currently ignoring `REVIEW_IN_PROGRESS` stacks as no template |
| 122 | +is available during the time the stack is in that state. However, stacks in `REVIEW_IN_PROGRESS` have already |
| 123 | +passed through the asset uploading step, where it either uploads new assets or ensures that the asset exists. |
| 124 | +Therefore it is possible the assets it references are marked as isolated and garbage collected before the stack |
| 125 | +template is available. |
| 126 | +
|
| 127 | +Our recommendation is to not deploy stacks and run garbage collection at the same time. If that is unavoidable, |
| 128 | +setting `--created-buffer-days` will help as garbage collection will avoid deleting assets that are recently |
| 129 | +created. Finally, if you do result in a failed deployment, the mitigation is to redeploy, as the asset upload step |
| 130 | +will be able to reupload the missing asset. |
| 131 | +
|
| 132 | +In practice, this race condition is only for a specific edge case and unlikely to happen but please open an |
| 133 | +issue if you think that this has happened to your stack. |
| 134 | +
|
| 135 | +--- |
| 136 | +
|
| 137 | +# |
| 138 | +
|
| 139 | +Ticking the box below indicates that the public API of this RFC has been |
| 140 | +signed-off by the API bar raiser (the `api-approved` label was applied to the |
| 141 | +RFC pull request): |
| 142 | +
|
| 143 | +``` |
| 144 | +[x] Signed-off by API Bar Raiser @rix0rrr |
| 145 | +``` |
| 146 | +
|
| 147 | +## Public FAQ |
| 148 | +
|
| 149 | +### What are we launching today? |
| 150 | +
|
| 151 | +The `cdk gc` command features, with support for garbage collection of unused S3 and ECR |
| 152 | +assets. |
| 153 | +
|
| 154 | +### Why should I use this feature? |
| 155 | +
|
| 156 | +Currently unused assets are left in the S3 bucket or ECR repository and contribute |
| 157 | +additional cost for customers. This feature provides a swift way to identify and delete |
| 158 | +unutilized assets. |
| 159 | +
|
| 160 | +### How does the command identify unused assets? |
| 161 | +
|
| 162 | +`cdk gc` will look at all the deployed stacks in the environment and store the |
| 163 | +assets that are being referenced by these stacks. All assets that are not reached via |
| 164 | +tracing are determined to be unused. |
| 165 | +
|
| 166 | +#### A note on pipeline rollbacks and the `--rollback-buffer-days` option |
| 167 | +
|
| 168 | +In some pipeline rollback scenarios, the default `cdk gc` options may be overzealous in |
| 169 | +deleting assets. A CI/CD system that offers indeterminate rollbacks without redeploying |
| 170 | +are expecting that previously deployed assets still exist. If `cdk gc` is run between |
| 171 | +the failed deployment and the rollback, the asset will be garbage collected. To mitigate |
| 172 | +this, we recommend the following setting: `--rollback-buffer-days=30`. This will ensure |
| 173 | +that all assets spend 30 days tagged as "unused" _before_ they are deleted, and should |
| 174 | +guard against even the most pessimistic of rollback scenarios. |
| 175 | +
|
| 176 | + |
| 177 | +
|
| 178 | +## Internal FAQ |
| 179 | +
|
| 180 | +> The goal of this section is to help decide if this RFC should be implemented. |
| 181 | +> It should include answers to questions that the team is likely ask. Contrary |
| 182 | +> to the rest of the RFC, answers should be written "from the present" and |
| 183 | +> likely discuss design approach, implementation plans, alternative considered |
| 184 | +> and other considerations that will help decide if this RFC should be |
| 185 | +> implemented. |
| 186 | +
|
| 187 | +### Why are we doing this? |
| 188 | +
|
| 189 | +As customers continue to adopt the CDK and grow their CDK applications over time, their |
| 190 | +asset buckets/repositories grow as well. At least one customer has |
| 191 | +[reported](<https://github.com/aws/aws-cdk-rfcs/issues/64#issuecomment-897548306>) 0.5TB of |
| 192 | +assets in their staging bucket. Most of these assets are unused and can be safely removed. |
| 193 | +
|
| 194 | +### Why should we _not_ do this? |
| 195 | +
|
| 196 | +There is risk of removing assets that are in use, providing additional pain to the |
| 197 | +customer. See [this](<https://github.com/aws/aws-cdk-rfcs/issues/64#issuecomment-833758638>) |
| 198 | +github comment. |
| 199 | +
|
| 200 | +### What is the technical solution (design) of this feature? |
| 201 | +
|
| 202 | + |
| 203 | +
|
| 204 | +At a high level, garbage collection consists of two parallel processes - refreshing CFN stack templates |
| 205 | +in the background and garbage collecting objects/images in the foreground. CFN stack templates are queried |
| 206 | +every ~5 minutes and stored in memory. Then we go through the bootstrapped bucket/repository and check if |
| 207 | +the hash in the object's key exists in _any_ template. |
| 208 | +
|
| 209 | +If `--rollback-buffer-days` is set, we tag the object as isolated, otherwise we delete it immediately. |
| 210 | +Also depending on if `--rollback-buffer-days` is set, we check if any isolated objects have previously |
| 211 | +been marked as isolated and are ready to be deleted, and if any in-use assets are erroneously marked |
| 212 | +as isolated that should be unmarked. |
| 213 | +
|
| 214 | +> Why are we storing the entire template in memory and not just the asset hashes? |
| 215 | +
|
| 216 | +We don't expect that the bottleneck for `cdk gc` is going to be memory storage but rather |
| 217 | +the (potentially) large number of AWS API calls. Storing hashes alone opens up the possibility |
| 218 | +of missing an asset hash an inadvertently deleting something in-use. |
| 219 | +
|
| 220 | +> What happens if we run `cdk deploy` (or `cdk destroy`) while `cdk gc` is in progress? |
| 221 | +
|
| 222 | +We mitigate this issue with the following redundancies: |
| 223 | +
|
| 224 | +- we refresh the in-memory state of CloudFormation Stacks periodically to catch any new or updated stacks |
| 225 | +- as a baseline, we do not delete any assets that are created after `cdk gc` is started (and this can |
| 226 | +be increased via the `--created-buffer-days` option) |
| 227 | +
|
| 228 | +> Are there any race conditions between the refresher and garbage collection? |
| 229 | +Yes, a small one. Stacks in `REVIEW_IN_PROGRESS` do not yet have a template to query, but these stacks |
| 230 | +have already gone through asset uploading. There is a theoretical situation where a previously isolated |
| 231 | +asset is referenced by a `REVIEW_IN_PROGRESS` stack, and since we are unaware that that is happening, |
| 232 | +we may delete the asset in the meantime. In practice though, I am not expecting this to be a frequent |
| 233 | +scenario. |
| 234 | +
|
| 235 | +### Is this a breaking change? |
| 236 | +
|
| 237 | +No. |
| 238 | +
|
| 239 | +### What alternative solutions did you consider? |
| 240 | +
|
| 241 | +Eventually, a zero-touch solution where garbage collection makes scheduled runs in the |
| 242 | +background is what users would want. However, `cdk gc` would be the building block for the |
| 243 | +automated garbage collection, so it makes sense to start with a CLI experience and iterate |
| 244 | +from there. After `cdk gc` stabilizes, we can vend a construct that runs periodically and |
| 245 | +at some point add this to the bootstrapping stack. |
| 246 | +
|
| 247 | +### What are the drawbacks of this solution? |
| 248 | +
|
| 249 | +The main drawback is that we will own a CLI command capable of deleting assets in customer |
| 250 | +accounts. They will rely on the correctness of the command to ensure that we are not deleting |
| 251 | +in-use assets and crashing live applications. |
| 252 | +
|
| 253 | +### What is the high-level project plan? |
| 254 | +
|
| 255 | +`cdk gc` will trace all assets referenced by deployed stacks in the environment and delete |
| 256 | +the assets that were not reached. As for how to implement this trace, I have not yet |
| 257 | +settled on a plan. The command will focus on garbage collecting v2 assets, where there is a |
| 258 | +separate S3 bucket and ECR repository in each boostrapped account. Preliminary thoughts are |
| 259 | +that we can either search for a string pattern that represents an asset location or utilize |
| 260 | +stack metadata that indicates which assets are being used. |
| 261 | +
|
| 262 | +### Are there any open issues that need to be addressed later? |
| 263 | +
|
| 264 | +No |
0 commit comments