Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug/partition_pdf/HTTPError: HTTP Error 403: Forbidden #3890

Open
nitishk94 opened this issue Jan 28, 2025 · 5 comments
Open

bug/partition_pdf/HTTPError: HTTP Error 403: Forbidden #3890

nitishk94 opened this issue Jan 28, 2025 · 5 comments
Labels
bug Something isn't working

Comments

@nitishk94
Copy link

Description
While extracting data from pdf it gives HTTPError: HTTP Error 403: Forbidden

To Reproduce

from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(filename="./UniMech_Contract_Fin.pdf", strategy="hi_res")
HTTPError: HTTP Error 403: Forbidden

Screenshots

Image

Environment Info
"unstructured>=0.15.14"
python version 3.12

@nitishk94 nitishk94 added the bug Something isn't working label Jan 28, 2025
@karsil
Copy link

karsil commented Jan 28, 2025

Got the same error:

Traceback (most recent call last):
  File "/app/parsing.py", line 189, in <module>
    main(args.input_path)
  File "/app/parsing.py", line 158, in main
    document = parser.parse(pdf_file_like, filename=filename)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/parsing.py", line 57, in parse
    elements = partition(file=file, strategy="hi_res", include_page_breaks=True)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/unstructured/partition/auto.py", line 429, in partition
    elements = _partition_pdf(
               ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/unstructured/documents/elements.py", line 593, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/unstructured/file_utils/filetype.py", line 626, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/unstructured/file_utils/filetype.py", line 582, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/unstructured/chunking/dispatch.py", line 74, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/unstructured/partition/pdf.py", line 202, in partition_pdf
    return partition_pdf_or_image(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/unstructured/partition/pdf.py", line 317, in partition_pdf_or_image
    out_elements = _process_uncategorized_text_elements(elements)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/unstructured/partition/pdf.py", line 920, in _process_uncategorized_text_elements
    new_el = element_from_text(cast(Text, el).text)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/unstructured/partition/text.py", line 294, in element_from_text
    elif is_possible_narrative_text(text):
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/unstructured/partition/text_type.py", line 80, in is_possible_narrative_text
    if exceeds_cap_ratio(text, threshold=cap_threshold):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/unstructured/partition/text_type.py", line 276, in exceeds_cap_ratio
    if sentence_count(text, 3) > 1:
       ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/unstructured/partition/text_type.py", line 225, in sentence_count
    sentences = sent_tokenize(text)
                ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/unstructured/nlp/tokenize.py", line 136, in sent_tokenize
    _download_nltk_packages_if_not_present()
  File "/usr/local/lib/python3.12/site-packages/unstructured/nlp/tokenize.py", line 130, in _download_nltk_packages_if_not_present
    download_nltk_packages()
  File "/usr/local/lib/python3.12/site-packages/unstructured/nlp/tokenize.py", line 88, in download_nltk_packages
    urllib.request.urlretrieve(NLTK_DATA_URL, tgz_file)
  File "/usr/local/lib/python3.12/urllib/request.py", line 240, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
                            ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/urllib/request.py", line 215, in urlopen
    return opener.open(url, data, timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/urllib/request.py", line 521, in open
    response = meth(req, response)
               ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/urllib/request.py", line 630, in http_response
    response = self.parent.error(
               ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/urllib/request.py", line 559, in error
    return self._call_chain(*args)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/urllib/request.py", line 492, in _call_chain
    result = func(*args)
             ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/urllib/request.py", line 639, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

Downloading the packages after building the docker image ist quite cumbersome. I already downloaded components during building, any way to avoid this?

RUN python -m nltk.downloader punkt_tab -d /root/nltk_data
RUN python -m nltk.downloader averaged_perceptron_tagger averaged_perceptron_tagger_eng

EDIT:
Updated my dependencies (unstructered 0.16.6), which solved my problem :)

@nitishk94
Copy link
Author

I have downgraded python version to 3.10
and unstructered 0.16.6 which resolved the problem

@rm2631
Copy link

rm2631 commented Feb 3, 2025

Still getting the same errors after downgrading python from 3.12 to 3.10 and upgrading unstructured from 0.16.3 to 0.16.6

Anyone else ?

@qasim29
Copy link

qasim29 commented Feb 10, 2025

Python version 3.10.13 and unstructured version 0.16.10 is working for me.

@YunghuiHsu
Copy link

This issue occurs because the default NLTK_DATA_URL is no longer valid. It is recommended to download the required NLTK data directly using native NLTK methods (refer to native NLTK download (#3796)).

If you prefer not to upgrade, you can modify your utils module by adding the following code:
(Reference: [Unstructured-IO/unstructured Tokenizer](https://github.com/Unstructured-IO/unstructured/blob/723c0740e0fae1a2e76f93f798c4aa24ec68e577/unstructured/nlp/tokenize.py))

import nltk

def check_for_nltk_package(package_name: str, package_category: str) -> bool:
    """Checks to see if the specified NLTK package exists on the image."""
    paths: list[str] = []
    for path in nltk.data.path:
        if not path.endswith("nltk_data"):
            path = os.path.join(path, "nltk_data")
        paths.append(path)

    try:
        nltk.find(f"{package_category}/{package_name}", paths=paths)
        return True
    except (LookupError, OSError):
        return False


def download_nltk_packages():
    """If required NLTK packages are not available, download them."""

    tagger_available = check_for_nltk_package(
        package_category="taggers",
        package_name="averaged_perceptron_tagger_eng",
    )
    tokenizer_available = check_for_nltk_package(
        package_category="tokenizers", package_name="punkt_tab"
    )

    if (not tokenizer_available) or (not tagger_available):
        nltk.download("averaged_perceptron_tagger_eng", quiet=True)
        nltk.download("punkt_tab", quiet=True)

# Auto-download NLTK packages if the environment variable is set
if os.getenv("AUTO_DOWNLOAD_NLTK", "True").lower() == "true":
    download_nltk_packages()

Additionally, set the following environment variable to enable auto-download:

import os

os.environ["AUTO_DOWNLOAD_NLTK"] = "True"

This ensures that the required NLTK packages are downloaded dynamically when needed. 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants