bug/partition_pdf/HTTPError: HTTP Error 403: Forbidden #3890

nitishk94 · 2025-01-28T06:14:22Z

Description
While extracting data from pdf it gives HTTPError: HTTP Error 403: Forbidden

To Reproduce

from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(filename="./UniMech_Contract_Fin.pdf", strategy="hi_res")

HTTPError: HTTP Error 403: Forbidden

Screenshots

Environment Info
"unstructured>=0.15.14"
python version 3.12

karsil · 2025-01-28T11:49:41Z

Got the same error:

Traceback (most recent call last):
  File "/app/parsing.py", line 189, in <module>
    main(args.input_path)
  File "/app/parsing.py", line 158, in main
    document = parser.parse(pdf_file_like, filename=filename)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/parsing.py", line 57, in parse
    elements = partition(file=file, strategy="hi_res", include_page_breaks=True)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/unstructured/partition/auto.py", line 429, in partition
    elements = _partition_pdf(
               ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/unstructured/documents/elements.py", line 593, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/unstructured/file_utils/filetype.py", line 626, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/unstructured/file_utils/filetype.py", line 582, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/unstructured/chunking/dispatch.py", line 74, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/unstructured/partition/pdf.py", line 202, in partition_pdf
    return partition_pdf_or_image(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/unstructured/partition/pdf.py", line 317, in partition_pdf_or_image
    out_elements = _process_uncategorized_text_elements(elements)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/unstructured/partition/pdf.py", line 920, in _process_uncategorized_text_elements
    new_el = element_from_text(cast(Text, el).text)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/unstructured/partition/text.py", line 294, in element_from_text
    elif is_possible_narrative_text(text):
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/unstructured/partition/text_type.py", line 80, in is_possible_narrative_text
    if exceeds_cap_ratio(text, threshold=cap_threshold):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/unstructured/partition/text_type.py", line 276, in exceeds_cap_ratio
    if sentence_count(text, 3) > 1:
       ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/unstructured/partition/text_type.py", line 225, in sentence_count
    sentences = sent_tokenize(text)
                ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/unstructured/nlp/tokenize.py", line 136, in sent_tokenize
    _download_nltk_packages_if_not_present()
  File "/usr/local/lib/python3.12/site-packages/unstructured/nlp/tokenize.py", line 130, in _download_nltk_packages_if_not_present
    download_nltk_packages()
  File "/usr/local/lib/python3.12/site-packages/unstructured/nlp/tokenize.py", line 88, in download_nltk_packages
    urllib.request.urlretrieve(NLTK_DATA_URL, tgz_file)
  File "/usr/local/lib/python3.12/urllib/request.py", line 240, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
                            ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/urllib/request.py", line 215, in urlopen
    return opener.open(url, data, timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/urllib/request.py", line 521, in open
    response = meth(req, response)
               ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/urllib/request.py", line 630, in http_response
    response = self.parent.error(
               ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/urllib/request.py", line 559, in error
    return self._call_chain(*args)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/urllib/request.py", line 492, in _call_chain
    result = func(*args)
             ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/urllib/request.py", line 639, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

Downloading the packages after building the docker image ist quite cumbersome. I already downloaded components during building, any way to avoid this?

RUN python -m nltk.downloader punkt_tab -d /root/nltk_data
RUN python -m nltk.downloader averaged_perceptron_tagger averaged_perceptron_tagger_eng

EDIT:
Updated my dependencies (unstructered 0.16.6), which solved my problem :)

nitishk94 · 2025-01-30T11:42:00Z

I have downgraded python version to 3.10
and unstructered 0.16.6 which resolved the problem

rm2631 · 2025-02-03T20:37:03Z

Still getting the same errors after downgrading python from 3.12 to 3.10 and upgrading unstructured from 0.16.3 to 0.16.6

Anyone else ?

qasim29 · 2025-02-10T08:23:27Z

Python version 3.10.13 and unstructured version 0.16.10 is working for me.

YunghuiHsu · 2025-02-11T03:09:23Z

This issue occurs because the default NLTK_DATA_URL is no longer valid. It is recommended to download the required NLTK data directly using native NLTK methods (refer to native NLTK download (#3796)).

If you prefer not to upgrade, you can modify your utils module by adding the following code:
(Reference: [Unstructured-IO/unstructured Tokenizer](https://github.com/Unstructured-IO/unstructured/blob/723c0740e0fae1a2e76f93f798c4aa24ec68e577/unstructured/nlp/tokenize.py))

import nltk

def check_for_nltk_package(package_name: str, package_category: str) -> bool:
    """Checks to see if the specified NLTK package exists on the image."""
    paths: list[str] = []
    for path in nltk.data.path:
        if not path.endswith("nltk_data"):
            path = os.path.join(path, "nltk_data")
        paths.append(path)

    try:
        nltk.find(f"{package_category}/{package_name}", paths=paths)
        return True
    except (LookupError, OSError):
        return False


def download_nltk_packages():
    """If required NLTK packages are not available, download them."""

    tagger_available = check_for_nltk_package(
        package_category="taggers",
        package_name="averaged_perceptron_tagger_eng",
    )
    tokenizer_available = check_for_nltk_package(
        package_category="tokenizers", package_name="punkt_tab"
    )

    if (not tokenizer_available) or (not tagger_available):
        nltk.download("averaged_perceptron_tagger_eng", quiet=True)
        nltk.download("punkt_tab", quiet=True)

# Auto-download NLTK packages if the environment variable is set
if os.getenv("AUTO_DOWNLOAD_NLTK", "True").lower() == "true":
    download_nltk_packages()

Additionally, set the following environment variable to enable auto-download:

import os

os.environ["AUTO_DOWNLOAD_NLTK"] = "True"

This ensures that the required NLTK packages are downloaded dynamically when needed. 🚀

nitishk94 added the bug Something isn't working label Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug/partition_pdf/HTTPError: HTTP Error 403: Forbidden #3890

bug/partition_pdf/HTTPError: HTTP Error 403: Forbidden #3890

nitishk94 commented Jan 28, 2025

karsil commented Jan 28, 2025 •

edited

Loading

nitishk94 commented Jan 30, 2025

rm2631 commented Feb 3, 2025

qasim29 commented Feb 10, 2025

YunghuiHsu commented Feb 11, 2025

bug/partition_pdf/HTTPError: HTTP Error 403: Forbidden #3890

bug/partition_pdf/HTTPError: HTTP Error 403: Forbidden #3890

Comments

nitishk94 commented Jan 28, 2025

karsil commented Jan 28, 2025 • edited Loading

nitishk94 commented Jan 30, 2025

rm2631 commented Feb 3, 2025

qasim29 commented Feb 10, 2025

YunghuiHsu commented Feb 11, 2025

karsil commented Jan 28, 2025 •

edited

Loading