-
Notifications
You must be signed in to change notification settings - Fork 998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PyPI serving stale content from JSON/Simple API #12214
Comments
Hmmm, that does certainly appear to be a cache invalidation issue. I'm not able to determine that we have had any systemic failures of the purge task (no reported failures in the last 48 hours). I'm wondering if the open incident impacting Fastly API layer has anything to do with it, but it makes no mention of purges. |
Issuing a purge by the project key The only failure mechanism I can see inside of our codebase that would lead to this result is if the purge task is not being properly enqueued, though I don't see any errors in our telemetry indicating such a failure either. |
The manual purge resolved the problem, but another one has cropped up this morning. This seems to happen somewhat frequently when a new release adds more than one file very close together. It seems like there's a race between when the cache invalidation occurs that perhaps is preventing a subsequent overlapping invalidation from executing. ['tencentcloud-sdk-python-bmlb', '3.0.732', 1663201857, 'new release', 15098168]
['tencentcloud-sdk-python-bmlb', '3.0.732', 1663201857, 'add py2.py3 file tencentcloud_sdk_python_bmlb-3.0.732-py2.py3-none-any.whl', 15098169]
['tencentcloud-sdk-python-bmlb', '3.0.732', 1663201862, 'add source file tencentcloud-sdk-python-bmlb-3.0.732.tar.gz', 15098170]
|
Yeah, I wonder if our combination of soft purges and stale-while-revalidate is introducing a race. @dstufft any ideas here? |
This section seems relevant: https://developer.fastly.com/learning/concepts/stale/#shielding-considerations |
Seems we already try to protect against that edge case with this VCL: https://github.com/python/pypi-infra/blob/985133617a45bdf1ad24c217627707445a7435ff/terraform/warehouse/vcl/main.vcl#L5-L11 |
We had some discussion here, one theory is that this may be due to us issuing multiple purge requests in quick succession. We're going to leave this open for now, as we might need to reach out to Fastly support to help debug. |
In #12272 it was noted that this is happening with the Simple API as well, as would be expected. |
This also seems to be affecting the latest release of mypy (0.981) which is not currently installable via poetry. |
Values from the backends are matching what I'm seeing for both compressed and uncompressed responses currently (from CHI point of presence). @sciyoshi can you share output from the following commands:
|
@ewdurbin sure - here's the output:
FWIW the uncompressed version does include 0.981 for me but the --compressed one does not. |
Another instance here: piwheels/packages#326 |
OK, I'll go ahead and get a support request opened with Fastly. |
Ticket filed with Fastly, ID 535394 |
Another report here: #12290 |
Thank you guys for looking into this, in the meantime, is there anything we can do as a workaround on the user side to force a cache refresh of our project? |
Just realizing that the Origin Cache is no longer configured to purge the right service for test.pypi.org after pypi/infra#95 🤦🏼 We broke up the legacy hand curated Fastly service into terraformed services that are in line with production, but did not update the configuration for the service to purge. Since the old service still exists... no failure was evident. I'm updating the backend config for test.pypi.org now and will issue a purge all for test.pypi.org |
test.pypi.org configuration is repaired and a purge all for that service has been issued. |
Thanks you! my project is now refreshed and working fine... ready for production, yay... |
Hi, @ewdurbin , glad to hear that! Had a look on the log, the last long-lasting error was on Some stale content still appears, but quickly got purged after a few retries without causing any problem. I wonder if that was because I kept the pace too close, just between the update & purge. I've disabled my "check uncompressed" workaround and will keep monitoring the status. I'll let you know if the mirror is broken. Thanks all! |
Just hours after switching off the "check uncompressed" workaround, the |
Package I've paste the result of Please note the log is in UTC+8. Sadly, this seems like we're still far from actually fixing that problem. |
Thanks @TechCiel, we're continuing to follow up with Fastly support. @chriseldredge do you have any updates? |
Hi @ewdurbin, unfortunately it looks like the mitigations have not addressed the stale cache bug in a meaningful way. It may be coincidence but anecdotally things actually look worse on some recent days than they did in January. Here's a breakdown of the count of times my replicator gave up after waiting 10 minutes for the serial to agree on the package info page with what the XML-RPC api said was highest serial at the time:
Valentines day was rough! |
Oof, that's not looking great at all. @chriseldredge can you help us understand what those numbers mean? Are these all from a single "replicator" and do they include repeats/retries? We're gonna keep working with Fastly to see if there's something else we can do 😓 |
Those are counts of the number of times my replicator service gave up after 10 minutes waiting for the package info page to be up to date with xmlrpc. There are many duplicates since sometimes the cache purges appear to fail around the globe. If it is useful, please let me know more specifically what data and format would help you continue to investigate the issue. I'll reiterate that the problem appears to affect packages with a high number of releases (e.g. checkov), which makes me wonder if the origin is not actually completing whatever asynchronous workflows prior to the cache purge being requested. In other words, if the purge happens before the next call to the package page would return up to date information, the race condition is in the origin. |
There's no async workflows in uploading a package and having the origin serve that currently. There's a single http handler that makes changes to the database, commits it, then fires off a purge task. There's another http handler that reads committed things from the database to serve the simple pages. It shouldn't be possible for the purge task to fire until after the database has committed it, at which point the origin should already be serving the updated content. |
Just to document this all here, I've been doing some thinking on this problem to try and figure out if we can do something to fix or mitigate it. I've yet to come up with a reasonable way for us to fix this on our end unless there's some combination of Fastly features that we're unaware of that would make it work, and so far Fastly hasn't been able to get us a resolution either. That pretty much leaves us with what we can do to mitigate the issue, which I think maybe I've come up with a workable strategy, but it requires some subtle changes to how PyPI functions so I figured I'd lay it out here and see what people think. Essentially this problem boils down to the fact that Fastly + Warehouse form a complex and large distributed system where it is always going to be some levels of eventually consistent. The hope was that by using purging we could force the CDN to become consistent "right now", which meant that we could tune our parameters so that eventually consistent could be for quite a long time. Unfortunately for whatever reason, purging has not been reliable enough for this use case, which means that we're falling back to "quite a long time" regularly which breaks mirroring until the service becomes consistent again (eventually!). So the basic thrust of our mitigation here is that we retune our TTL to drop it down from 24 hours to a value that is more reasonable, maybe 10 minutes or 15 minutes, which means that we should expect the system to self heal and become consistent on it's own even without purging in 10-15 minutes (or maybe 20-30 minutes with shielding? I'm not sure offhand). On it's own that would mean that the number of requests that actually hit the origin servers will go up by quite a lot, each cached request will go from hitting the origin servers every 24h to every 10-30 minutes, or like a ~150x worst case increased number of requests to the origin servers for some of our heaviest views (if I recall at least). Fastly does have a feature to mitigate the impact of these (which we already employ) where rather than just blindly fetching a whole new response after the existing object expires, it will instead attempt to re-validate its existing object using a conditional HTTP request, where it will say "Hey I want X, but only if the Unfortunately the way that Warehouse works is that to process a conditional http request it first runs the view code and does all of the work to generate a response, then at the last minute it looks to see if the request and response have matching Fundamentally there's no reason why you have to generate the response prior to processing the conditional request, except that the way Warehouse handles generating an However, if we had a lightweight way to predetermine what the That would be great, except that there's another wrinkle. We have middleware in Warehouse that handles compression so that we can safely and intelligently have compression enabled without falling victim to CRIME and also reduces our egress to Fastly (at the cost of the CPU overhead to compress). Obviously to compress the response body we have to have the response body, so that middleware doesn't function without having the full response body. That middleware also has to implement it's own The simplest thing to do would be to remove compression from Warehouse and shift that to happen in the Fastly layer (pypi/infra#120), unfortunately that means that our egress costs would end up going up. If we did that, then we could have Warehouse short circuit on conditional requests, which would hopefully mean that we could reduce our TTL and make this whole problem a lot more self healing. I'm experimenting with options to keep the compression in Warehouse and still allow this short circuiting on conditional requests behavior. One option is to turn our responses into streaming responses, but Pyramid can sometimes implicitly consume the streaming response and make it non-streaming if anything attempts to access the response body. Since that happens implciitly it means we wouldn't short circuit if anything caused that to happen but it would be non obvious that it was happening, so it would end up being fairly fragile I think. I'll see if I can come up with any other options though. Footnotes
|
It seems to me that your Fastly cache invalidation does work most of the time, and the cases where my replicator runs into trouble are generally identical to a case I observed earlier today. My replicator running in us-east-1 got stuck processing changes to
Looking at relevant entries in the Warehouse xmlrpc
This is similar to all cases I've investigated. It looks like the cache invalidation manages to nudge Fastly to refresh the package pages the first and maybe second time (I don't know off hand if "new release" triggers invalidation). But two seconds after the py3 file is added, the source file gets added. This is the change that appears to fail to propagate some of the time through Fastly. Supposing that the origin Warehouse system appears consistent in the absence of a CDN, e.g. that information that appears in
The precise order may not matter, with the key concept being that the final cache invalidation instruction gets processed before an in-flight request/response completes. I don't have familiarity of the timing of cache invalidation requests sent to Fastly, but if this hypothesis is correct then it seems like debouncing invalidation requests could resolve the issue. In the above case, the two events came within 2 seconds of each other. A debounce threshold of 5 or 10 seconds could be adequate. This can also happen when a maintainer publishes more than one version of their package in quick succession, so the debouncing needs to be aggregated at the package level, not at a release version level. |
Thanks @chriseldredge, that aligns with what I suspect is happening as well. @ewdurbin and I are roughing out an approach to batch purge multiple unique keys with a short delay (essentially, debouncing) to see if that mitigates the issue. |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
Another report of a stale cache here: #13222 (comment) |
This comment was marked as off-topic.
This comment was marked as off-topic.
Hi. I'm going to be taking a closer look at this issue again tomorrow with Fastly. I haven't seen reports unrelated to #12933 here since February. If you have reports of stale caches (not stuck serials) Please let me know! |
As far as I am currently aware, we are only seeing stuck/stale stuff due to #12933, which #13936 may resolve. If we are not seeing fastly cache issues any longer, it may be that https://github.com/pypi/warehouse/blame/74d3a4335acc3840f3f718554c7868d704a322eb/warehouse/cache/origin/fastly.py#L131 resolved the issue? |
Hi, I'm a developer and some of my workflows involve releasing a new version of my pypi package and then subsequently using that version. Lately I've been making the following observations:
I came here to report this issue, and quickly found this thread. I'm not knowledgeable about some of the terms in this issue, such as "fastly cache", but I get the feeling that my issue is likely the same as this issue. Given that, I though I should comment here before making a new issue. So I suspect you all are already aware and working on this. But in the meantime, is there any recommended workaround to ensure that the metadata I get is always valid? Ideally a workaaround that works both for |
In the last week, I've seen
|
Likely related to #11936, I am seeing stale content served for the URL https://pypi.org/pypi/dbnd-postgres/json.
The following events show in
changes_since_serial
today:However this curl command shows the serial ID did not pick up the most recent event:
Output:
When I repeat this command without the
--compressed
flag I see the expected output:Output:
This indicates to me that cache invalidation did not work.
I see from recent GitHub issues this has been a recurring issue recently. It is painful since without manual intervention it won't self-resolve until the TTL of 24 hours expires.
The text was updated successfully, but these errors were encountered: