Skip to content

Commit dc8b49a

Browse files
kaizenccrix0rrr
andauthoredOct 30, 2024
RFC 64: Asset garbage collection (#379)
This is a request for comments about Asset Garbage Collection. See #64 for additional details. APIs are signed off by @rix0rrr . --- _By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license_ --------- Co-authored-by: Rico Hermans <[email protected]>
1 parent f19164f commit dc8b49a

File tree

3 files changed

+264
-0
lines changed

3 files changed

+264
-0
lines changed
 

‎images/PipelineRollback.png

342 KB
Loading

‎images/garbagecollection.png

318 KB
Loading

‎text/0064-asset-garbage-collection.md

+264
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,264 @@
1+
# Garbage Collection for Assets
2+
3+
* **Original Author(s):**: @eladb, @kaizencc
4+
* **Tracking Issue**: #64
5+
* **API Bar Raiser**: @rix0rrr
6+
7+
The asset garbage collection CLI will identify and/or delete unused CDK assets,
8+
resulting in smaller bucket/repo size and less cost for customers.
9+
10+
## Working Backwards
11+
12+
**CHANGELOG**:
13+
14+
- feat(cli): garbage collect s3 assets (under --unstable flag)
15+
- feat(cli): garbage collect ecr assets (under --unstable flag)
16+
17+
**Help**:
18+
19+
```shell
20+
➜ cdk gc --help
21+
cdk gc [ENVIRONMENT...]
22+
23+
Finds and deletes all unused S3 and ECR assets in the ENVIRONMENT
24+
25+
Options:
26+
--type=[s3|ecr|all] filters for type of asset
27+
--action=[print|tag|delete-tagged|full] type of action to perform on unused assets
28+
--rollback-buffer-days=number number of days an asset should be isolated before deletion
29+
--created-buffer-days=number number of days old an asset must be before its elligible for deletion
30+
--bootstrap-stack-name=string name of a custom bootstrap stack if not CDK Toolkit
31+
--confirm=boolean confirm with user before deleting assets
32+
33+
Examples:
34+
cdk gc
35+
cdk gc aws://ACCOUNT/REGION
36+
cdk gc --type=s3 --action=delete-tagged
37+
```
38+
39+
**README:**
40+
41+
> [!CAUTION]
42+
> CDK Garbage Collection is under development and therefore must be opted in via the `--unstable` flag: `cdk gc --unstable=gc`.
43+
44+
`cdk gc` garbage collects unused assets from your bootstrap bucket via the following mechanism:
45+
46+
- for each object in the bootstrap S3 Bucket, check to see if it is referenced in any existing CloudFormation templates
47+
- if not, it is treated as unused and gc will either tag it or delete it, depending on your configuration.
48+
49+
The high-level mechanism works identically for unused assets in bootstrapped ECR Repositories.
50+
51+
The most basic usage looks like this:
52+
53+
```console
54+
cdk gc --unstable=gc
55+
```
56+
57+
This will garbage collect all unused assets in all environments of the existing CDK App.
58+
59+
To specify one type of asset, use the `type` option (options are `all`, `s3`, `ecr`):
60+
61+
```console
62+
cdk gc --unstable=gc --type=s3
63+
```
64+
65+
Otherwise `cdk gc` defaults to collecting assets in both the bootstrapped S3 Bucket and ECR Repository.
66+
67+
`cdk gc` will garbage collect S3 and ECR assets from the current bootstrapped environment(s) and immediately
68+
delete them. Note that, since the default bootstrap S3 Bucket is versioned, object deletion will be handled by
69+
the lifecycle policy on the bucket.
70+
71+
Before we begin to delete your assets, you will be prompted:
72+
73+
```console
74+
cdk gc --unstable=gc
75+
76+
Found X objects to delete based off of the following criteria:
77+
- objects have been isolated for > 0 days
78+
- objects were created > 1 days ago
79+
80+
Delete this batch (yes/no/delete-all)?
81+
```
82+
83+
Since it's quite possible that the bootstrap bucket has many objects, we work in batches of 1000 objects or 100 images.
84+
To skip the prompt either reply with `delete-all`, or use the `--confirm=false` option.
85+
86+
```console
87+
cdk gc --unstable=gc --confirm=false
88+
```
89+
90+
If you are concerned about deleting assets too aggressively, there are multiple levers you can configure:
91+
92+
- rollback-buffer-days: this is the amount of days an asset has to be marked as isolated before it is elligible for deletion.
93+
- created-buffer-days: this is the amount of days an asset must live before it is elligible for deletion.
94+
95+
When using `rollback-buffer-days`, instead of deleting unused objects, `cdk gc` will tag them with
96+
today's date instead. It will also check if any objects have been tagged by previous runs of `cdk gc`
97+
and delete them if they have been tagged for longer than the buffer days.
98+
99+
When using `created-buffer-days`, we simply filter out any assets that have not persisted that number
100+
of days.
101+
102+
```console
103+
cdk gc --unstable=gc --rollback-buffer-days=30 --created-buffer-days=1
104+
```
105+
106+
You can also configure the scope that `cdk gc` performs via the `--action` option. By default, all actions
107+
are performed, but you can specify `print`, `tag`, or `delete-tagged`.
108+
109+
- `print` performs no changes to your AWS account, but finds and prints the number of unused assets.
110+
- `tag` tags any newly unused assets, but does not delete any unused assets.
111+
- `delete-tagged` deletes assets that have been tagged for longer than the buffer days, but does not tag newly unused assets.
112+
113+
```console
114+
cdk gc --unstable=gc --action=delete-tagged --rollback-buffer-days=30
115+
```
116+
117+
This will delete assets that have been unused for >30 days, but will not tag additional assets.
118+
119+
### Theoretical Race Condition with `REVIEW_IN_PROGRESS` stacks
120+
121+
When gathering stack templates, we are currently ignoring `REVIEW_IN_PROGRESS` stacks as no template
122+
is available during the time the stack is in that state. However, stacks in `REVIEW_IN_PROGRESS` have already
123+
passed through the asset uploading step, where it either uploads new assets or ensures that the asset exists.
124+
Therefore it is possible the assets it references are marked as isolated and garbage collected before the stack
125+
template is available.
126+
127+
Our recommendation is to not deploy stacks and run garbage collection at the same time. If that is unavoidable,
128+
setting `--created-buffer-days` will help as garbage collection will avoid deleting assets that are recently
129+
created. Finally, if you do result in a failed deployment, the mitigation is to redeploy, as the asset upload step
130+
will be able to reupload the missing asset.
131+
132+
In practice, this race condition is only for a specific edge case and unlikely to happen but please open an
133+
issue if you think that this has happened to your stack.
134+
135+
---
136+
137+
#
138+
139+
Ticking the box below indicates that the public API of this RFC has been
140+
signed-off by the API bar raiser (the `api-approved` label was applied to the
141+
RFC pull request):
142+
143+
```
144+
[x] Signed-off by API Bar Raiser @rix0rrr
145+
```
146+
147+
## Public FAQ
148+
149+
### What are we launching today?
150+
151+
The `cdk gc` command features, with support for garbage collection of unused S3 and ECR
152+
assets.
153+
154+
### Why should I use this feature?
155+
156+
Currently unused assets are left in the S3 bucket or ECR repository and contribute
157+
additional cost for customers. This feature provides a swift way to identify and delete
158+
unutilized assets.
159+
160+
### How does the command identify unused assets?
161+
162+
`cdk gc` will look at all the deployed stacks in the environment and store the
163+
assets that are being referenced by these stacks. All assets that are not reached via
164+
tracing are determined to be unused.
165+
166+
#### A note on pipeline rollbacks and the `--rollback-buffer-days` option
167+
168+
In some pipeline rollback scenarios, the default `cdk gc` options may be overzealous in
169+
deleting assets. A CI/CD system that offers indeterminate rollbacks without redeploying
170+
are expecting that previously deployed assets still exist. If `cdk gc` is run between
171+
the failed deployment and the rollback, the asset will be garbage collected. To mitigate
172+
this, we recommend the following setting: `--rollback-buffer-days=30`. This will ensure
173+
that all assets spend 30 days tagged as "unused" _before_ they are deleted, and should
174+
guard against even the most pessimistic of rollback scenarios.
175+
176+
![Illustration of Pipeline Rollback](../images/PipelineRollback.png)
177+
178+
## Internal FAQ
179+
180+
> The goal of this section is to help decide if this RFC should be implemented.
181+
> It should include answers to questions that the team is likely ask. Contrary
182+
> to the rest of the RFC, answers should be written "from the present" and
183+
> likely discuss design approach, implementation plans, alternative considered
184+
> and other considerations that will help decide if this RFC should be
185+
> implemented.
186+
187+
### Why are we doing this?
188+
189+
As customers continue to adopt the CDK and grow their CDK applications over time, their
190+
asset buckets/repositories grow as well. At least one customer has
191+
[reported](<https://github.com/aws/aws-cdk-rfcs/issues/64#issuecomment-897548306>) 0.5TB of
192+
assets in their staging bucket. Most of these assets are unused and can be safely removed.
193+
194+
### Why should we _not_ do this?
195+
196+
There is risk of removing assets that are in use, providing additional pain to the
197+
customer. See [this](<https://github.com/aws/aws-cdk-rfcs/issues/64#issuecomment-833758638>)
198+
github comment.
199+
200+
### What is the technical solution (design) of this feature?
201+
202+
![Garbage Collection Design](../images/garbagecollection.png)
203+
204+
At a high level, garbage collection consists of two parallel processes - refreshing CFN stack templates
205+
in the background and garbage collecting objects/images in the foreground. CFN stack templates are queried
206+
every ~5 minutes and stored in memory. Then we go through the bootstrapped bucket/repository and check if
207+
the hash in the object's key exists in _any_ template.
208+
209+
If `--rollback-buffer-days` is set, we tag the object as isolated, otherwise we delete it immediately.
210+
Also depending on if `--rollback-buffer-days` is set, we check if any isolated objects have previously
211+
been marked as isolated and are ready to be deleted, and if any in-use assets are erroneously marked
212+
as isolated that should be unmarked.
213+
214+
> Why are we storing the entire template in memory and not just the asset hashes?
215+
216+
We don't expect that the bottleneck for `cdk gc` is going to be memory storage but rather
217+
the (potentially) large number of AWS API calls. Storing hashes alone opens up the possibility
218+
of missing an asset hash an inadvertently deleting something in-use.
219+
220+
> What happens if we run `cdk deploy` (or `cdk destroy`) while `cdk gc` is in progress?
221+
222+
We mitigate this issue with the following redundancies:
223+
224+
- we refresh the in-memory state of CloudFormation Stacks periodically to catch any new or updated stacks
225+
- as a baseline, we do not delete any assets that are created after `cdk gc` is started (and this can
226+
be increased via the `--created-buffer-days` option)
227+
228+
> Are there any race conditions between the refresher and garbage collection?
229+
Yes, a small one. Stacks in `REVIEW_IN_PROGRESS` do not yet have a template to query, but these stacks
230+
have already gone through asset uploading. There is a theoretical situation where a previously isolated
231+
asset is referenced by a `REVIEW_IN_PROGRESS` stack, and since we are unaware that that is happening,
232+
we may delete the asset in the meantime. In practice though, I am not expecting this to be a frequent
233+
scenario.
234+
235+
### Is this a breaking change?
236+
237+
No.
238+
239+
### What alternative solutions did you consider?
240+
241+
Eventually, a zero-touch solution where garbage collection makes scheduled runs in the
242+
background is what users would want. However, `cdk gc` would be the building block for the
243+
automated garbage collection, so it makes sense to start with a CLI experience and iterate
244+
from there. After `cdk gc` stabilizes, we can vend a construct that runs periodically and
245+
at some point add this to the bootstrapping stack.
246+
247+
### What are the drawbacks of this solution?
248+
249+
The main drawback is that we will own a CLI command capable of deleting assets in customer
250+
accounts. They will rely on the correctness of the command to ensure that we are not deleting
251+
in-use assets and crashing live applications.
252+
253+
### What is the high-level project plan?
254+
255+
`cdk gc` will trace all assets referenced by deployed stacks in the environment and delete
256+
the assets that were not reached. As for how to implement this trace, I have not yet
257+
settled on a plan. The command will focus on garbage collecting v2 assets, where there is a
258+
separate S3 bucket and ECR repository in each boostrapped account. Preliminary thoughts are
259+
that we can either search for a string pattern that represents an asset location or utilize
260+
stack metadata that indicates which assets are being used.
261+
262+
### Are there any open issues that need to be addressed later?
263+
264+
No

0 commit comments

Comments
 (0)
Please sign in to comment.