-
Notifications
You must be signed in to change notification settings - Fork 876
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug/partition_pdf/HTTPError: HTTP Error 403: Forbidden #3890
Comments
Got the same error:
Downloading the packages after building the docker image ist quite cumbersome. I already downloaded components during building, any way to avoid this?
EDIT: |
I have downgraded python version to 3.10 |
Still getting the same errors after downgrading python from 3.12 to 3.10 and upgrading unstructured from 0.16.3 to 0.16.6 Anyone else ? |
Python version 3.10.13 and unstructured version 0.16.10 is working for me. |
This issue occurs because the default If you prefer not to upgrade, you can modify your import nltk
def check_for_nltk_package(package_name: str, package_category: str) -> bool:
"""Checks to see if the specified NLTK package exists on the image."""
paths: list[str] = []
for path in nltk.data.path:
if not path.endswith("nltk_data"):
path = os.path.join(path, "nltk_data")
paths.append(path)
try:
nltk.find(f"{package_category}/{package_name}", paths=paths)
return True
except (LookupError, OSError):
return False
def download_nltk_packages():
"""If required NLTK packages are not available, download them."""
tagger_available = check_for_nltk_package(
package_category="taggers",
package_name="averaged_perceptron_tagger_eng",
)
tokenizer_available = check_for_nltk_package(
package_category="tokenizers", package_name="punkt_tab"
)
if (not tokenizer_available) or (not tagger_available):
nltk.download("averaged_perceptron_tagger_eng", quiet=True)
nltk.download("punkt_tab", quiet=True)
# Auto-download NLTK packages if the environment variable is set
if os.getenv("AUTO_DOWNLOAD_NLTK", "True").lower() == "true":
download_nltk_packages() Additionally, set the following environment variable to enable auto-download: import os
os.environ["AUTO_DOWNLOAD_NLTK"] = "True" This ensures that the required NLTK packages are downloaded dynamically when needed. 🚀 |
Description
While extracting data from pdf it gives HTTPError: HTTP Error 403: Forbidden
To Reproduce
Screenshots
Environment Info
"unstructured>=0.15.14"
python version 3.12
The text was updated successfully, but these errors were encountered: