You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
After updating to version 0.17.0, I am experiencing the same issue as #3400. When supplying languages=["en"] to partition_pdf with a strategy of either "auto" or "ocr_only", the OCR Agent is not passed through, which causes the following error to occur:
Traceback (most recent call last):
File "/<MY_DIR>/main.py", line 20, in <module>
elements = partition_pdf(
File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/documents/elements.py", line 581, in wrapper
elements = func(*args, **kwargs)
File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 816, in wrapper
elements = func(*args, **kwargs)
File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 774, in wrapper
elements = func(*args, **kwargs)
File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/chunking/dispatch.py", line 74, in wrapper
elements = func(*args, **kwargs)
File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/pdf.py", line 228, in partition_pdf
return partition_pdf_or_image(
File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/pdf.py", line 379, in partition_pdf_or_image
elements = _partition_pdf_or_image_with_ocr(
File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/pdf.py", line 934, in _partition_pdf_or_image_with_ocr
page_elements = _partition_pdf_or_image_with_ocr_from_image(
File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/pdf.py", line 962, in _partition_pdf_or_image_with_ocr_from_image
ocr_agent = OCRAgent.get_agent(language=ocr_languages)
File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/utils/ocr_models/ocr_interface.py", line 34, in get_agent
return cls.get_instance(ocr_agent_cls_qname, language)
File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/utils/ocr_models/ocr_interface.py", line 49, in get_instance
return loaded_class(language)
File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/utils/ocr_models/paddle_ocr.py", line 23, in __init__
self.agent = self.load_agent(language)
File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/utils/ocr_models/paddle_ocr.py", line 45, in load_agent
paddle_ocr = PaddleOCR(
File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured_paddleocr/paddleocr.py", line 610, in __init__
lang, det_lang = parse_lang(params.lang)
File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured_paddleocr/paddleocr.py", line 479, in parse_lang
lang in MODEL_URLS["OCR"][DEFAULT_OCR_MODEL_VERSION]["rec"]
AssertionError: param lang must in dict_keys(['ch', 'en', 'korean', 'japan', 'chinese_cht', 'ta', 'te', 'ka', 'latin', 'arabic', 'cyrillic', 'devanagari']), but got eng
To Reproduce
Setup:
Run pip install "unstructured[pdf]"==0.17.0 paddlepaddle unstructured.paddleocr. I also had to run pip uninstall torch -y and then pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
from unstructured.partition.pdf import partition_pdf
import unstructured.partition.utils.ocr_models.paddle_ocr as paddle_ocr_module
from unstructured_inference.inference.layoutelement import LayoutElements
paddle_ocr_module.LayoutElements = LayoutElements # workaround for #3931
os.environ["OCR_AGENT"] = "unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle" # found in an old issue
os.environ["DEFAULT_PADDLE_LANG"] = "en" # found in an old issue
filename = "path_to_your_file.pdf"
elements = partition_pdf(
filename=filename,
strategy="ocr_only",
languages=["en"],
table_ocr_agent="unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle",
ocr_agent="unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle",
)
Expected behavior
Script would run without errors and return elements.
Screenshots
If applicable, add screenshots to help explain your problem.
Environment Info
Please run python scripts/collect_env.py and paste the output here.
Broken Env:
OS version: Linux-6.8.0-52-generic-x86_64-with-glibc2.35
Python version: 3.10.12
unstructured version: 0.17.0
unstructured-inference version: 0.8.9
pytesseract is not installed
Torch version: 2.6.0+cpu
Detectron2 is not installed
PaddleOCR version: 2.6.2
Libmagic version: file-5.41
magic file from /etc/magic:/usr/share/misc/magic
LibreOffice is not installed
Working Env:
OS version: Linux-6.8.0-52-generic-x86_64-with-glibc2.35
Python version: 3.10.12
unstructured version: 0.16.25
unstructured-inference version: 0.8.9
pytesseract is not installed
Torch version: 2.6.0+cpu
Detectron2 is not installed
PaddleOCR version: 2.6.2
Libmagic version: file-5.41
magic file from /etc/magic:/usr/share/misc/magic
LibreOffice is not installed
Additional context
From what I can tell, the ocr_agent isn't being passed to _partition_pdf_or_image_with_ocr_from_image. Since the tesseract_to_paddle_language call is only being done inside _partition_pdf_or_image_local > process_file_with_ocr > supplement_page_layout_with_ocr (which doesn't get called by _partition_pdf_or_image_with_ocr) passing languages=["en"] has the languages changed to the tesseract language structure ocr_languages = prepare_languages_for_tesseract(languages)here, which causes the call to ocr_agent = OCRAgent.get_agent(language=ocr_languages) to break here
The text was updated successfully, but these errors were encountered:
Describe the bug
After updating to version 0.17.0, I am experiencing the same issue as #3400. When supplying
languages=["en"]
topartition_pdf
with a strategy of either"auto"
or"ocr_only"
, the OCR Agent is not passed through, which causes the following error to occur:To Reproduce
Setup:
Run
pip install "unstructured[pdf]"==0.17.0 paddlepaddle unstructured.paddleocr
. I also had to runpip uninstall torch -y
and thenpip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
Expected behavior
Script would run without errors and return elements.
Screenshots
If applicable, add screenshots to help explain your problem.
Environment Info
Please run
python scripts/collect_env.py
and paste the output here.Broken Env:
Working Env:
Additional context
From what I can tell, the
ocr_agent
isn't being passed to_partition_pdf_or_image_with_ocr_from_image
. Since thetesseract_to_paddle_language
call is only being done inside_partition_pdf_or_image_local
>process_file_with_ocr
>supplement_page_layout_with_ocr
(which doesn't get called by_partition_pdf_or_image_with_ocr
) passinglanguages=["en"]
has the languages changed to the tesseract language structureocr_languages = prepare_languages_for_tesseract(languages)
here, which causes the call toocr_agent = OCRAgent.get_agent(language=ocr_languages)
to break hereThe text was updated successfully, but these errors were encountered: