Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyPI serving stale content from JSON/Simple API #12214

Open
chriseldredge opened this issue Sep 14, 2022 · 77 comments
Open

PyPI serving stale content from JSON/Simple API #12214

chriseldredge opened this issue Sep 14, 2022 · 77 comments
Labels
bug 🐛 CDN/network Issues related to our CDN, users having problems connecting to PyPI needs discussion a product management/policy issue maintainers and users should discuss

Comments

@chriseldredge
Copy link

Likely related to #11936, I am seeing stale content served for the URL https://pypi.org/pypi/dbnd-postgres/json.

The following events show in changes_since_serial today:

['dbnd-postgres', '1.0.2.2', 1663173894, 'new release', 15093808]
['dbnd-postgres', '1.0.2.2', 1663173894, 'add py2.py3 file dbnd_postgres-1.0.2.2-py2.py3-none-any.whl', 15093809]
['dbnd-postgres', '1.0.2.2', 1663173939, 'add source file dbnd-postgres-1.0.2.2.tar.gz', 15093840]

However this curl command shows the serial ID did not pick up the most recent event:

curl -sq -I https://pypi.org/pypi/dbnd-postgres/json -H 'Fastly-Debug: 1' --compressed

Output:

HTTP/2 200
access-control-allow-headers: Content-Type, If-Match, If-Modified-Since, If-None-Match, If-Unmodified-Since
access-control-allow-methods: GET
access-control-allow-origin: *
access-control-expose-headers: X-PyPI-Last-Serial
access-control-max-age: 86400
cache-control: max-age=900, public
content-encoding: gzip
content-security-policy: base-uri 'self'; block-all-mixed-content; connect-src 'self' https://api.github.com/repos/ fastly-insights.com *.fastly-insights.com *.ethicalads.io https://api.pwnedpasswords.com https://cdn.jsdelivr.net/npm/[email protected]/es5/sre/mathmaps/ https://2p66nmmycsj3.statuspage.io; default-src 'none'; font-src 'self' fonts.gstatic.com; form-action 'self'; frame-ancestors 'none'; frame-src 'none'; img-src 'self' https://warehouse-camo.ingress.cmh1.psfhosted.org/ www.google-analytics.com *.fastly-insights.com *.ethicalads.io; script-src 'self' www.googletagmanager.com www.google-analytics.com *.fastly-insights.com *.ethicalads.io 'sha256-U3hKDidudIaxBDEzwGJApJgPEf2mWk6cfMWghrAa6i0=' https://cdn.jsdelivr.net/npm/[email protected]/ 'sha256-1CldwzdEg2k1wTmf7s5RWVd7NMXI/7nxxjJM2C4DqII=' 'sha256-0POaN8stWYQxhzjKS+/eOfbbJ/u4YHO5ZagJvLpMypo='; style-src 'self' fonts.googleapis.com *.ethicalads.io 'sha256-2YHqZokjiizkHi1Zt+6ar0XJ0OeEy/egBnlm+MDMtrM=' 'sha256-47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=' 'sha256-JLEjeN9e5dGsz5475WyRaoA4eQOdNPxDIeUhclnJDCE=' 'sha256-mQyxHEuwZJqpxCw3SLmc4YOySNKXunyu2Oiz1r3/wAE=' 'sha256-OCf+kv5Asiwp++8PIevKBYSgnNLNUZvxAp4a7wMLuKA=' 'sha256-h5LOiLhk6wiJrGsG5ItM0KimwzWQH/yAcmoJDJL//bY='; worker-src *.fastly-insights.com
content-type: application/json
etag: "OQffMU7lDwd/T0iw2gQQVQ"
referrer-policy: origin-when-cross-origin
server: nginx/1.13.9
surrogate-control: max-age=86400, stale-while-revalidate=300, stale-if-error=86400
surrogate-key: all-legacy-json project-legacy-json project/dbnd-postgres
x-pypi-last-serial: 15093809
accept-ranges: bytes
date: Wed, 14 Sep 2022 20:20:20 GMT
fastly-debug-path: (D cache-iad-kiad7000060-IAD 1663186820) (F cache-iad-kcgs7200149-IAD 1663173907)
fastly-debug-ttl: (H cache-iad-kiad7000060-IAD 73486.697 86400.000 12913)
fastly-debug-digest: d7cf4a7d85536d23624f76df11d1059e7f9287eea031e6c73846429273a5acfd
x-served-by: cache-iad-kiad7000060-IAD
x-cache: HIT
x-cache-hits: 147
x-timer: S1663186820.267731,VS0,VE0
vary: Accept-Encoding
strict-transport-security: max-age=31536000; includeSubDomains; preload
x-frame-options: deny
x-xss-protection: 1; mode=block
x-content-type-options: nosniff
x-permitted-cross-domain-policies: none
content-length: 118223

When I repeat this command without the --compressed flag I see the expected output:

curl -sq -I https://pypi.org/pypi/dbnd-postgres/json -H 'Fastly-Debug: 1'

Output:

HTTP/2 200
access-control-allow-headers: Content-Type, If-Match, If-Modified-Since, If-None-Match, If-Unmodified-Since
access-control-allow-methods: GET
access-control-allow-origin: *
access-control-expose-headers: X-PyPI-Last-Serial
access-control-max-age: 86400
cache-control: max-age=900, public
content-security-policy: base-uri 'self'; block-all-mixed-content; connect-src 'self' https://api.github.com/repos/ fastly-insights.com *.fastly-insights.com *.ethicalads.io https://api.pwnedpasswords.com https://cdn.jsdelivr.net/npm/[email protected]/es5/sre/mathmaps/ https://2p66nmmycsj3.statuspage.io; default-src 'none'; font-src 'self' fonts.gstatic.com; form-action 'self'; frame-ancestors 'none'; frame-src 'none'; img-src 'self' https://warehouse-camo.ingress.cmh1.psfhosted.org/ www.google-analytics.com *.fastly-insights.com *.ethicalads.io; script-src 'self' www.googletagmanager.com www.google-analytics.com *.fastly-insights.com *.ethicalads.io 'sha256-U3hKDidudIaxBDEzwGJApJgPEf2mWk6cfMWghrAa6i0=' https://cdn.jsdelivr.net/npm/[email protected]/ 'sha256-1CldwzdEg2k1wTmf7s5RWVd7NMXI/7nxxjJM2C4DqII=' 'sha256-0POaN8stWYQxhzjKS+/eOfbbJ/u4YHO5ZagJvLpMypo='; style-src 'self' fonts.googleapis.com *.ethicalads.io 'sha256-2YHqZokjiizkHi1Zt+6ar0XJ0OeEy/egBnlm+MDMtrM=' 'sha256-47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=' 'sha256-JLEjeN9e5dGsz5475WyRaoA4eQOdNPxDIeUhclnJDCE=' 'sha256-mQyxHEuwZJqpxCw3SLmc4YOySNKXunyu2Oiz1r3/wAE=' 'sha256-OCf+kv5Asiwp++8PIevKBYSgnNLNUZvxAp4a7wMLuKA=' 'sha256-h5LOiLhk6wiJrGsG5ItM0KimwzWQH/yAcmoJDJL//bY='; worker-src *.fastly-insights.com
content-type: application/json
etag: "XcI/eU0vTs6ZFLleB2JpKQ"
referrer-policy: origin-when-cross-origin
server: nginx/1.13.9
surrogate-control: max-age=86400, stale-while-revalidate=300, stale-if-error=86400
surrogate-key: all-legacy-json project-legacy-json project/dbnd-postgres
x-pypi-last-serial: 15093840
accept-ranges: bytes
date: Wed, 14 Sep 2022 20:27:40 GMT
fastly-debug-path: (D cache-iad-kiad7000106-IAD 1663187261) (F cache-iad-kiad7000060-IAD 1663173992)
fastly-debug-ttl: (H cache-iad-kiad7000106-IAD - - 13269)
fastly-debug-digest: d7cf4a7d85536d23624f76df11d1059e7f9287eea031e6c73846429273a5acfd
x-served-by: cache-iad-kiad7000106-IAD
x-cache: HIT
x-cache-hits: 1
x-timer: S1663187261.581965,VS0,VE3
vary: Accept-Encoding, Accept-Encoding
strict-transport-security: max-age=31536000; includeSubDomains; preload
x-frame-options: deny
x-xss-protection: 1; mode=block
x-content-type-options: nosniff
x-permitted-cross-domain-policies: none
content-length: 553040

This indicates to me that cache invalidation did not work.

I see from recent GitHub issues this has been a recurring issue recently. It is painful since without manual intervention it won't self-resolve until the TTL of 24 hours expires.

@chriseldredge chriseldredge added bug 🐛 requires triaging maintainers need to do initial inspection of issue labels Sep 14, 2022
@ewdurbin
Copy link
Member

Hmmm, that does certainly appear to be a cache invalidation issue. I'm not able to determine that we have had any systemic failures of the purge task (no reported failures in the last 48 hours).

I'm wondering if the open incident impacting Fastly API layer has anything to do with it, but it makes no mention of purges.

@ewdurbin
Copy link
Member

Issuing a purge by the project key project/dbnd-postgres does seem to have resolved this specific instance (bringing the compressed response in line with the uncompressed one). So purges are working as expected by key at this moment.

The only failure mechanism I can see inside of our codebase that would lead to this result is if the purge task is not being properly enqueued, though I don't see any errors in our telemetry indicating such a failure either.

@chriseldredge
Copy link
Author

chriseldredge commented Sep 15, 2022

The manual purge resolved the problem, but another one has cropped up this morning. This seems to happen somewhat frequently when a new release adds more than one file very close together. It seems like there's a race between when the cache invalidation occurs that perhaps is preventing a subsequent overlapping invalidation from executing.

['tencentcloud-sdk-python-bmlb', '3.0.732', 1663201857, 'new release', 15098168]
['tencentcloud-sdk-python-bmlb', '3.0.732', 1663201857, 'add py2.py3 file tencentcloud_sdk_python_bmlb-3.0.732-py2.py3-none-any.whl', 15098169]
['tencentcloud-sdk-python-bmlb', '3.0.732', 1663201862, 'add source file tencentcloud-sdk-python-bmlb-3.0.732.tar.gz', 15098170]
curl -sq -I https://pypi.org/pypi/tencentcloud-sdk-python-bmlb/json -H 'Fastly-Debug: 1' --compressed
HTTP/2 200
access-control-allow-headers: Content-Type, If-Match, If-Modified-Since, If-None-Match, If-Unmodified-Since
access-control-allow-methods: GET
access-control-allow-origin: *
access-control-expose-headers: X-PyPI-Last-Serial
access-control-max-age: 86400
cache-control: max-age=900, public
content-encoding: gzip
content-security-policy: base-uri 'self'; block-all-mixed-content; connect-src 'self' https://api.github.com/repos/ fastly-insights.com *.fastly-insights.com *.ethicalads.io https://api.pwnedpasswords.com https://cdn.jsdelivr.net/npm/[email protected]/es5/sre/mathmaps/ https://2p66nmmycsj3.statuspage.io; default-src 'none'; font-src 'self' fonts.gstatic.com; form-action 'self' https://checkout.stripe.com; frame-ancestors 'none'; frame-src 'none'; img-src 'self' https://warehouse-camo.ingress.cmh1.psfhosted.org/ www.google-analytics.com *.fastly-insights.com *.ethicalads.io; script-src 'self' www.googletagmanager.com www.google-analytics.com *.fastly-insights.com *.ethicalads.io 'sha256-U3hKDidudIaxBDEzwGJApJgPEf2mWk6cfMWghrAa6i0=' https://cdn.jsdelivr.net/npm/[email protected]/ 'sha256-1CldwzdEg2k1wTmf7s5RWVd7NMXI/7nxxjJM2C4DqII=' 'sha256-0POaN8stWYQxhzjKS+/eOfbbJ/u4YHO5ZagJvLpMypo='; style-src 'self' fonts.googleapis.com *.ethicalads.io 'sha256-2YHqZokjiizkHi1Zt+6ar0XJ0OeEy/egBnlm+MDMtrM=' 'sha256-47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=' 'sha256-JLEjeN9e5dGsz5475WyRaoA4eQOdNPxDIeUhclnJDCE=' 'sha256-mQyxHEuwZJqpxCw3SLmc4YOySNKXunyu2Oiz1r3/wAE=' 'sha256-OCf+kv5Asiwp++8PIevKBYSgnNLNUZvxAp4a7wMLuKA=' 'sha256-h5LOiLhk6wiJrGsG5ItM0KimwzWQH/yAcmoJDJL//bY='; worker-src *.fastly-insights.com
content-type: application/json
etag: "XnNNuWLsw0Qy0v0Sa/p3iw"
referrer-policy: origin-when-cross-origin
server: nginx/1.13.9
surrogate-control: max-age=86400, stale-while-revalidate=300, stale-if-error=86400
surrogate-key: all-legacy-json project-legacy-json project/tencentcloud-sdk-python-bmlb
x-pypi-last-serial: 15098169
accept-ranges: bytes
date: Thu, 15 Sep 2022 14:39:45 GMT
fastly-debug-path: (D cache-iad-kjyo7100176-IAD 1663252785) (F cache-iad-kiad7000106-IAD 1663201864)
fastly-debug-ttl: (H cache-iad-kjyo7100176-IAD - - 50921)
fastly-debug-digest: 6e4df74f912d01b6d565ab286c5aa8b0ce76887517b9db6f1e2b112994b00c27
x-served-by: cache-iad-kjyo7100176-IAD
x-cache: HIT
x-cache-hits: 1
x-timer: S1663252785.243295,VS0,VE2
vary: Accept-Encoding
strict-transport-security: max-age=31536000; includeSubDomains; preload
x-frame-options: deny
x-xss-protection: 1; mode=block
x-content-type-options: nosniff
x-permitted-cross-domain-policies: none
content-length: 129393
curl -sq -I https://pypi.org/pypi/tencentcloud-sdk-python-bmlb/json -H 'Fastly-Debug: 1'
HTTP/2 200
access-control-allow-headers: Content-Type, If-Match, If-Modified-Since, If-None-Match, If-Unmodified-Since
access-control-allow-methods: GET
access-control-allow-origin: *
access-control-expose-headers: X-PyPI-Last-Serial
access-control-max-age: 86400
cache-control: max-age=900, public
content-security-policy: base-uri 'self'; block-all-mixed-content; connect-src 'self' https://api.github.com/repos/ fastly-insights.com *.fastly-insights.com *.ethicalads.io https://api.pwnedpasswords.com https://cdn.jsdelivr.net/npm/[email protected]/es5/sre/mathmaps/ https://2p66nmmycsj3.statuspage.io; default-src 'none'; font-src 'self' fonts.gstatic.com; form-action 'self' https://checkout.stripe.com; frame-ancestors 'none'; frame-src 'none'; img-src 'self' https://warehouse-camo.ingress.cmh1.psfhosted.org/ www.google-analytics.com *.fastly-insights.com *.ethicalads.io; script-src 'self' www.googletagmanager.com www.google-analytics.com *.fastly-insights.com *.ethicalads.io 'sha256-U3hKDidudIaxBDEzwGJApJgPEf2mWk6cfMWghrAa6i0=' https://cdn.jsdelivr.net/npm/[email protected]/ 'sha256-1CldwzdEg2k1wTmf7s5RWVd7NMXI/7nxxjJM2C4DqII=' 'sha256-0POaN8stWYQxhzjKS+/eOfbbJ/u4YHO5ZagJvLpMypo='; style-src 'self' fonts.googleapis.com *.ethicalads.io 'sha256-2YHqZokjiizkHi1Zt+6ar0XJ0OeEy/egBnlm+MDMtrM=' 'sha256-47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=' 'sha256-JLEjeN9e5dGsz5475WyRaoA4eQOdNPxDIeUhclnJDCE=' 'sha256-mQyxHEuwZJqpxCw3SLmc4YOySNKXunyu2Oiz1r3/wAE=' 'sha256-OCf+kv5Asiwp++8PIevKBYSgnNLNUZvxAp4a7wMLuKA=' 'sha256-h5LOiLhk6wiJrGsG5ItM0KimwzWQH/yAcmoJDJL//bY='; worker-src *.fastly-insights.com
content-type: application/json
etag: "bkVDS2B7X2PsUt8Lq+m3VQ"
referrer-policy: origin-when-cross-origin
server: nginx/1.13.9
surrogate-control: max-age=86400, stale-while-revalidate=300, stale-if-error=86400
surrogate-key: all-legacy-json project-legacy-json project/tencentcloud-sdk-python-bmlb
x-pypi-last-serial: 15098170
accept-ranges: bytes
date: Thu, 15 Sep 2022 14:39:50 GMT
fastly-debug-path: (D cache-iad-kjyo7100162-IAD 1663252791) (F cache-iad-kiad7000106-IAD 1663201996)
fastly-debug-ttl: (H cache-iad-kjyo7100162-IAD - - 50795)
fastly-debug-digest: 6e4df74f912d01b6d565ab286c5aa8b0ce76887517b9db6f1e2b112994b00c27
x-served-by: cache-iad-kjyo7100162-IAD
x-cache: HIT
x-cache-hits: 1
x-timer: S1663252791.880861,VS0,VE4
vary: Accept-Encoding, Accept-Encoding
strict-transport-security: max-age=31536000; includeSubDomains; preload
x-frame-options: deny
x-xss-protection: 1; mode=block
x-content-type-options: nosniff
x-permitted-cross-domain-policies: none
content-length: 640713

@ewdurbin
Copy link
Member

Yeah, I wonder if our combination of soft purges and stale-while-revalidate is introducing a race. @dstufft any ideas here?

@ewdurbin
Copy link
Member

@ewdurbin
Copy link
Member

@di di added CDN/network Issues related to our CDN, users having problems connecting to PyPI needs discussion a product management/policy issue maintainers and users should discuss and removed requires triaging maintainers need to do initial inspection of issue labels Sep 23, 2022
@di
Copy link
Member

di commented Sep 23, 2022

We had some discussion here, one theory is that this may be due to us issuing multiple purge requests in quick succession. We're going to leave this open for now, as we might need to reach out to Fastly support to help debug.

@di
Copy link
Member

di commented Sep 26, 2022

In #12272 it was noted that this is happening with the Simple API as well, as would be expected.

@sciyoshi
Copy link

This also seems to be affecting the latest release of mypy (0.981) which is not currently installable via poetry.

@ewdurbin
Copy link
Member

Values from the backends are matching what I'm seeing for both compressed and uncompressed responses currently (from CHI point of presence).

@sciyoshi can you share output from the following commands:

curl -sq -H 'Fastly-Debug: 1' -I https://pypi.org/pypi/mypy/json
curl -sq -H 'Fastly-Debug: 1' -I --compressed https://pypi.org/pypi/mypy/json

@sciyoshi
Copy link

@ewdurbin sure - here's the output:

> curl -sq -H 'Fastly-Debug: 1' -I https://pypi.org/pypi/mypy/json
HTTP/2 200
access-control-allow-headers: Content-Type, If-Match, If-Modified-Since, If-None-Match, If-Unmodified-Since
access-control-allow-methods: GET
access-control-allow-origin: *
access-control-expose-headers: X-PyPI-Last-Serial
access-control-max-age: 86400
cache-control: max-age=900, public
content-security-policy: base-uri 'self'; block-all-mixed-content; connect-src 'self' https://api.github.com/repos/ fastly-insights.com *.fastly-insights.com *.ethicalads.io https://api.pwnedpasswords.com https://cdn.jsdelivr.net/npm/[email protected]/es5/sre/mathmaps/ https://2p66nmmycsj3.statuspage.io; default-src 'none'; font-src 'self' fonts.gstatic
.com; form-action 'self' https://checkout.stripe.com; frame-ancestors 'none'; frame-src 'none'; img-src 'self' https://warehouse-camo.ingress.cmh1.psfhosted.org/ www.google-analytics.com *.fastly-insights.com *.ethicalads.io; script-src 'self' www.googletagmanager.com www.google-analytics.com *.fastly-insights.com *.ethicalads.io 'sha256-U3hKDidudIa
xBDEzwGJApJgPEf2mWk6cfMWghrAa6i0=' https://cdn.jsdelivr.net/npm/[email protected]/ 'sha256-1CldwzdEg2k1wTmf7s5RWVd7NMXI/7nxxjJM2C4DqII=' 'sha256-0POaN8stWYQxhzjKS+/eOfbbJ/u4YHO5ZagJvLpMypo='; style-src 'self' fonts.googleapis.com *.ethicalads.io 'sha256-2YHqZokjiizkHi1Zt+6ar0XJ0OeEy/egBnlm+MDMtrM=' 'sha256-47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=' '
sha256-JLEjeN9e5dGsz5475WyRaoA4eQOdNPxDIeUhclnJDCE=' 'sha256-mQyxHEuwZJqpxCw3SLmc4YOySNKXunyu2Oiz1r3/wAE=' 'sha256-OCf+kv5Asiwp++8PIevKBYSgnNLNUZvxAp4a7wMLuKA=' 'sha256-h5LOiLhk6wiJrGsG5ItM0KimwzWQH/yAcmoJDJL//bY='; worker-src *.fastly-insights.com
content-type: application/json
etag: "aAQN6E+jltnU36zCGEdw6Q"
referrer-policy: origin-when-cross-origin
server: nginx/1.13.9
surrogate-control: max-age=86400, stale-while-revalidate=300, stale-if-error=86400
surrogate-key: all-legacy-json project-legacy-json project/mypy
x-pypi-last-serial: 15218304
accept-ranges: bytes
date: Tue, 27 Sep 2022 15:55:03 GMT
fastly-debug-path: (D cache-yul12832-YUL 1664294103) (F cache-yul12830-YUL 1664270144) (D cache-iad-kiad7000173-IAD 1664270144) (F cache-iad-kcgs7200155-IAD 1664217287)
fastly-debug-ttl: (H cache-yul12832-YUL - - 23959) (H cache-iad-kiad7000173-IAD 33542.955 86400.000 52857)
fastly-debug-digest: 4bbbfa91b65930e47ecdf33fd12d0f32abb49ea94de23346281ae11d8d7a7fb0
x-served-by: cache-iad-kiad7000173-IAD, cache-yul12832-YUL
x-cache: HIT, HIT
x-cache-hits: 493, 1
x-timer: S1664294103.018667,VS0,VE1
vary: Accept-Encoding, Accept-Encoding
strict-transport-security: max-age=31536000; includeSubDomains; preload
x-frame-options: deny
x-xss-protection: 1; mode=block
x-content-type-options: nosniff
x-permitted-cross-domain-policies: none
content-length: 475053
> curl -sq -H 'Fastly-Debug: 1' -I --compressed https://pypi.org/pypi/mypy/json
HTTP/2 200
access-control-allow-headers: Content-Type, If-Match, If-Modified-Since, If-None-Match, If-Unmodified-Since
access-control-allow-methods: GET
access-control-allow-origin: *
access-control-expose-headers: X-PyPI-Last-Serial
access-control-max-age: 86400
cache-control: max-age=900, public
content-encoding: gzip
content-security-policy: base-uri 'self'; block-all-mixed-content; connect-src 'self' https://api.github.com/repos/ fastly-insights.com *.fastly-insights.com *.ethicalads.io https://api.pwnedpasswords.com https://2p66nmmycsj3.statuspage.io; default-src 'none'; font-src 'self' fonts.gstatic.com; form-action 'self'; frame-ancestors 'none'; frame-src '
none'; img-src 'self' https://warehouse-camo.ingress.cmh1.psfhosted.org/ www.google-analytics.com *.fastly-insights.com *.ethicalads.io; script-src 'self' www.googletagmanager.com www.google-analytics.com *.fastly-insights.com *.ethicalads.io 'sha256-U3hKDidudIaxBDEzwGJApJgPEf2mWk6cfMWghrAa6i0='; style-src 'self' fonts.googleapis.com *.ethicalads.io
 'sha256-2YHqZokjiizkHi1Zt+6ar0XJ0OeEy/egBnlm+MDMtrM=' 'sha256-47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU='; worker-src *.fastly-insights.com
content-type: application/json
etag: "wVyF0KeDmFzsUN0xxzrIfg"
referrer-policy: origin-when-cross-origin
server: nginx/1.13.9
surrogate-control: max-age=86400, stale-while-revalidate=300, stale-if-error=86400
surrogate-key:
x-pypi-last-serial: 14483175
accept-ranges: bytes
date: Tue, 27 Sep 2022 15:55:08 GMT
fastly-debug-path: (D cache-yul12830-YUL 1664294109) (F cache-yul12830-YUL 1664271983) (D cache-iad-kiad7000138-IAD 1664271983) (F cache-iad-kcgs7200155-IAD 1659108366) (D cache-lcy19263-LCY 1659108366) (F cache-lcy19263-LCY 1659108339) (D cache-iad-kiad7000173-IAD 1659108339) (F cache-iad-kcgs7200120-IAD 1658588617)
fastly-debug-ttl: (H cache-yul12830-YUL 64273.801 86400.000 22126) (H cache-iad-kiad7000138-IAD 20723.341 86400.000 65677) (H cache-lcy19263-LCY 86372.920 86400.000 27) (H cache-iad-kiad7000173-IAD 85107.298 86400.000 1293)
fastly-debug-digest: 4bbbfa91b65930e47ecdf33fd12d0f32abb49ea94de23346281ae11d8d7a7fb0
x-served-by: cache-iad-kiad7000173-IAD, cache-lcy19263-LCY, cache-iad-kiad7000138-IAD, cache-yul12830-YUL
x-cache: HIT, HIT, HIT, HIT
x-cache-hits: 2, 1, 62, 36
x-timer: S1664294109.907644,VS0,VE0
vary: Accept-Encoding
strict-transport-security: max-age=31536000; includeSubDomains; preload
x-frame-options: deny
x-xss-protection: 1; mode=block
x-content-type-options: nosniff
x-permitted-cross-domain-policies: none
content-length: 93198

FWIW the uncompressed version does include 0.981 for me but the --compressed one does not.

@di
Copy link
Member

di commented Sep 28, 2022

Another instance here: piwheels/packages#326

@ewdurbin
Copy link
Member

OK, I'll go ahead and get a support request opened with Fastly.

@ewdurbin
Copy link
Member

ewdurbin commented Sep 28, 2022

Ticket filed with Fastly, ID 535394

@di
Copy link
Member

di commented Sep 29, 2022

Another report here: #12290

@Lumute
Copy link

Lumute commented Sep 29, 2022

Thank you guys for looking into this, in the meantime, is there anything we can do as a workaround on the user side to force a cache refresh of our project?

@ewdurbin
Copy link
Member

Just realizing that the Origin Cache is no longer configured to purge the right service for test.pypi.org after pypi/infra#95 🤦🏼

We broke up the legacy hand curated Fastly service into terraformed services that are in line with production, but did not update the configuration for the service to purge. Since the old service still exists... no failure was evident.

I'm updating the backend config for test.pypi.org now and will issue a purge all for test.pypi.org

@ewdurbin
Copy link
Member

test.pypi.org configuration is repaired and a purge all for that service has been issued.

@Lumute
Copy link

Lumute commented Sep 30, 2022

Thanks you! my project is now refreshed and working fine... ready for production, yay...

@TechCiel
Copy link

Hi, @ewdurbin , glad to hear that! Had a look on the log, the last long-lasting error was on serotiny, resolved after Feb 7 23:21 UTC.

Some stale content still appears, but quickly got purged after a few retries without causing any problem. I wonder if that was because I kept the pace too close, just between the update & purge.

I've disabled my "check uncompressed" workaround and will keep monitoring the status. I'll let you know if the mirror is broken. Thanks all!

@TechCiel
Copy link

Just hours after switching off the "check uncompressed" workaround, the mianxiang-wangluo-zhuanjia-de-linux-jifan (16801627) gone brrrrrr.... Re-enabling the workaround fixed the problem, maybe just this time, though.

@TechCiel
Copy link

Package maturin gone stale about 3 hours ago. As I check the recent logs, I've found some outburst of stale content.

I've paste the result of grep 'Giving up' pypi-*.log here: https://pastebin.com/BhzZGxd7

Please note the log is in UTC+8. Sadly, this seems like we're still far from actually fixing that problem.

@ewdurbin
Copy link
Member

Thanks @TechCiel, we're continuing to follow up with Fastly support.

@chriseldredge do you have any updates?

@chriseldredge
Copy link
Author

Hi @ewdurbin, unfortunately it looks like the mitigations have not addressed the stale cache bug in a meaningful way. It may be coincidence but anecdotally things actually look worse on some recent days than they did in January.

Here's a breakdown of the count of times my replicator gave up after waiting 10 minutes for the serial to agree on the package info page with what the XML-RPC api said was highest serial at the time:

   1 2023-01-28
   1 2023-01-29
   3 2023-02-06
  13 2023-01-24
  13 2023-01-26
  14 2023-02-01
  26 2023-01-30
  53 2023-02-02
  81 2023-02-09
  82 2023-02-08
  97 2023-02-07
 101 2023-02-12
 101 2023-02-16
 109 2023-02-11
 160 2023-02-10
 210 2023-02-13
 318 2023-02-15
 909 2023-02-14

Valentines day was rough!

@ewdurbin
Copy link
Member

Oof, that's not looking great at all. @chriseldredge can you help us understand what those numbers mean? Are these all from a single "replicator" and do they include repeats/retries?

We're gonna keep working with Fastly to see if there's something else we can do 😓

@chriseldredge
Copy link
Author

Those are counts of the number of times my replicator service gave up after 10 minutes waiting for the package info page to be up to date with xmlrpc. There are many duplicates since sometimes the cache purges appear to fail around the globe. If it is useful, please let me know more specifically what data and format would help you continue to investigate the issue.

I'll reiterate that the problem appears to affect packages with a high number of releases (e.g. checkov), which makes me wonder if the origin is not actually completing whatever asynchronous workflows prior to the cache purge being requested. In other words, if the purge happens before the next call to the package page would return up to date information, the race condition is in the origin.

@dstufft
Copy link
Member

dstufft commented Feb 21, 2023

There's no async workflows in uploading a package and having the origin serve that currently.

There's a single http handler that makes changes to the database, commits it, then fires off a purge task.

There's another http handler that reads committed things from the database to serve the simple pages.

It shouldn't be possible for the purge task to fire until after the database has committed it, at which point the origin should already be serving the updated content.

@dstufft
Copy link
Member

dstufft commented Feb 23, 2023

Just to document this all here, I've been doing some thinking on this problem to try and figure out if we can do something to fix or mitigate it.

I've yet to come up with a reasonable way for us to fix this on our end unless there's some combination of Fastly features that we're unaware of that would make it work, and so far Fastly hasn't been able to get us a resolution either. That pretty much leaves us with what we can do to mitigate the issue, which I think maybe I've come up with a workable strategy, but it requires some subtle changes to how PyPI functions so I figured I'd lay it out here and see what people think.

Essentially this problem boils down to the fact that Fastly + Warehouse form a complex and large distributed system where it is always going to be some levels of eventually consistent. The hope was that by using purging we could force the CDN to become consistent "right now", which meant that we could tune our parameters so that eventually consistent could be for quite a long time. Unfortunately for whatever reason, purging has not been reliable enough for this use case, which means that we're falling back to "quite a long time" regularly which breaks mirroring until the service becomes consistent again (eventually!).

So the basic thrust of our mitigation here is that we retune our TTL to drop it down from 24 hours to a value that is more reasonable, maybe 10 minutes or 15 minutes, which means that we should expect the system to self heal and become consistent on it's own even without purging in 10-15 minutes (or maybe 20-30 minutes with shielding? I'm not sure offhand).

On it's own that would mean that the number of requests that actually hit the origin servers will go up by quite a lot, each cached request will go from hitting the origin servers every 24h to every 10-30 minutes, or like a ~150x worst case increased number of requests to the origin servers for some of our heaviest views (if I recall at least).

Fastly does have a feature to mitigate the impact of these (which we already employ) where rather than just blindly fetching a whole new response after the existing object expires, it will instead attempt to re-validate its existing object using a conditional HTTP request, where it will say "Hey I want X, but only if the ETag is no longer Z", which Warehouse will look at and go "hey it hasn't changed", 304 Not Modified with an empty response body to tell Fastly to use it's existing object and treat it as new, or "it's changed, 200 OK" with the full response.

Unfortunately the way that Warehouse works is that to process a conditional http request it first runs the view code and does all of the work to generate a response, then at the last minute it looks to see if the request and response have matching ETags, and if they do it throws away the response it just generated and returns an empty 304 Not Modified, but it still had to do all of the work to actually generate that response it threw away.

Fundamentally there's no reason why you have to generate the response prior to processing the conditional request, except that the way Warehouse handles generating an ETag is to generate a md5 hash of the response body 1 because if you want to generically generate an ETag, there's not really a better way to do it 2. That means that as implemented, we can't know what the ETag of a response is going to be, without generating the response.

However, if we had a lightweight way to predetermine what the ETag was going to be, we could process the conditional request up front and skip generating the response body if we're going to return a 304 Not Modified. As it so happens, for the pages that we really care about here, we could just use the last serial as our ETag (or derive it from last serial somehow) which we do have a lightweight way to look up what our last serial for a project is (it's a column off the projects table).

That would be great, except that there's another wrinkle. We have middleware in Warehouse that handles compression so that we can safely and intelligently have compression enabled without falling victim to CRIME and also reduces our egress to Fastly (at the cost of the CPU overhead to compress).

Obviously to compress the response body we have to have the response body, so that middleware doesn't function without having the full response body. That middleware also has to implement it's own ETag scheme because the ETag for a compressed and uncompressed response needs to differ.

The simplest thing to do would be to remove compression from Warehouse and shift that to happen in the Fastly layer (pypi/infra#120), unfortunately that means that our egress costs would end up going up. If we did that, then we could have Warehouse short circuit on conditional requests, which would hopefully mean that we could reduce our TTL and make this whole problem a lot more self healing.

I'm experimenting with options to keep the compression in Warehouse and still allow this short circuiting on conditional requests behavior.

One option is to turn our responses into streaming responses, but Pyramid can sometimes implicitly consume the streaming response and make it non-streaming if anything attempts to access the response body. Since that happens implciitly it means we wouldn't short circuit if anything caused that to happen but it would be non obvious that it was happening, so it would end up being fairly fragile I think.

I'll see if I can come up with any other options though.

Footnotes

  1. Fun fact, this is why last serial is included in those response bodies, because otherwise changes that incremented the serial but didn't affect the response body would mean that Fastly would resurrect the old cached object, with the old last serial header, so we had to force the request body to change for every last serial increment.

  2. To really be accurate, you'd also want to hash some or most of the response headers as well, but just doing the body gets you 95% of the way there.

@chriseldredge
Copy link
Author

It seems to me that your Fastly cache invalidation does work most of the time, and the cases where my replicator runs into trouble are generally identical to a case I observed earlier today.

My replicator running in us-east-1 got stuck processing changes to spanishconjugator:

2023-02-23T09:25:15Z Expected package spanishconjugator refresh serial to be greater or equal to 16990807 but was 16990805

Looking at relevant entries in the Warehouse xmlrpc changelog_since_serial api:

Thu Feb 23 09:24:19 2023 spanishconjugator 2.3.4236 serial=16990804 action=new release
Thu Feb 23 09:24:19 2023 spanishconjugator 2.3.4236 serial=16990805 action=add py3 file spanishconjugator-2.3.4236-py3-none-any.whl
Thu Feb 23 09:24:21 2023 spanishconjugator 2.3.4236 serial=16990807 action=add source file spanishconjugator-2.3.4236.tar.gz

This is similar to all cases I've investigated. It looks like the cache invalidation manages to nudge Fastly to refresh the package pages the first and maybe second time (I don't know off hand if "new release" triggers invalidation). But two seconds after the py3 file is added, the source file gets added. This is the change that appears to fail to propagate some of the time through Fastly.

Supposing that the origin Warehouse system appears consistent in the absence of a CDN, e.g. that information that appears in changelog_since_serial responses is never newer than what appears in the simple/json APIs for a given package, then it seems we can narrow down the scope of this problem to how Fastly works in a distributed network when there are overlapping activities. Hypothetically, here's a sequence that could explain what we see:

  1. Maintainer uploads a py3 file
  2. Warehouse processes the transaction
  3. Warehouse issues a cache invalidation to Fastly
  4. Fastly processes the cache purge request from (3)
  5. A replicator asks for the package info page
  6. The Fastly node sees a cache miss and requests the package info page from origin
  7. Origin begins processing the request. It takes a while to load and transfer because there are hundreds of releases for this package.
  8. Maintainer uploads a source file
  9. Warehouse processes the transaction
  10. Warehouse issues a cache invalidation to Fastly
  11. Fastly processes the cache purge request from (10)
  12. The response from (7) completes transferring back to Fastly where it gets inserted into the cache despite being stale.
  13. The stale response gets returned to the viewer

The precise order may not matter, with the key concept being that the final cache invalidation instruction gets processed before an in-flight request/response completes.

I don't have familiarity of the timing of cache invalidation requests sent to Fastly, but if this hypothesis is correct then it seems like debouncing invalidation requests could resolve the issue. In the above case, the two events came within 2 seconds of each other. A debounce threshold of 5 or 10 seconds could be adequate. This can also happen when a maintainer publishes more than one version of their package in quick succession, so the debouncing needs to be aggregated at the package level, not at a release version level.

@di
Copy link
Member

di commented Feb 24, 2023

Thanks @chriseldredge, that aligns with what I suspect is happening as well. @ewdurbin and I are roughing out an approach to batch purge multiple unique keys with a short delay (essentially, debouncing) to see if that mitigates the issue.

@TechCiel

This comment was marked as off-topic.

@TechCiel

This comment was marked as off-topic.

@shankerwangmiao

This comment was marked as off-topic.

@ewdurbin

This comment was marked as off-topic.

@shankerwangmiao

This comment was marked as off-topic.

@di
Copy link
Member

di commented Apr 19, 2023

Another report of a stale cache here: #13222 (comment)

@shankerwangmiao

This comment was marked as off-topic.

@ewdurbin
Copy link
Member

Hi. I'm going to be taking a closer look at this issue again tomorrow with Fastly.

I haven't seen reports unrelated to #12933 here since February. If you have reports of stale caches (not stuck serials) Please let me know!

@ewdurbin
Copy link
Member

As far as I am currently aware, we are only seeing stuck/stale stuff due to #12933, which #13936 may resolve.

If we are not seeing fastly cache issues any longer, it may be that https://github.com/pypi/warehouse/blame/74d3a4335acc3840f3f718554c7868d704a322eb/warehouse/cache/origin/fastly.py#L131 resolved the issue?

@mgroth0
Copy link

mgroth0 commented Jun 20, 2023

Hi, I'm a developer and some of my workflows involve releasing a new version of my pypi package and then subsequently using that version. Lately I've been making the following observations:

  1. If I update my pypi package and then shortly after try to install that new version with pip install mypackage=x.x.x, the first time it does not work. The second time, it does work.
  2. If I update my pypi package and then check https://pypi.org/pypi/mypackage/json, the first time I load that URL it doesn't include the newest version. The second time, it does.

I came here to report this issue, and quickly found this thread. I'm not knowledgeable about some of the terms in this issue, such as "fastly cache", but I get the feeling that my issue is likely the same as this issue. Given that, I though I should comment here before making a new issue.

So I suspect you all are already aware and working on this. But in the meantime, is there any recommended workaround to ensure that the metadata I get is always valid? Ideally a workaaround that works both for pip and for reading the raw output of the json api directly. I read through https://warehouse.pypa.io/api-reference/json.html but didn't find anything helpful. In this thread people are talking about things that I'm curious about such as -H 'Fastly-Debug: 1' --compressed but I'm not sure what they mean and I'm wondering if those are supposed to ensure that the metadata is valid and up to date?

@chriseldredge
Copy link
Author

In the last week, I've seen spanishconjugator run into issues multiple times, along with some other one-offs:

2023-06-14 08:23:01.905	essenceapi
2023-06-14 08:59:29.640	willow
2023-06-15 00:46:11.293	openet-sims
2023-06-15 14:33:11.496	spanishconjugator
2023-06-17 00:34:55.405	pepperize-cdk-github
2023-06-19 20:32:14.270	spanishconjugator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐛 CDN/network Issues related to our CDN, users having problems connecting to PyPI needs discussion a product management/policy issue maintainers and users should discuss
Projects
None yet
Development

No branches or pull requests