bug/PaddleOCR language specification issue #3957

joshrbarcodefactory · 2025-03-14T18:22:03Z

Describe the bug
After updating to version 0.17.0, I am experiencing the same issue as #3400. When supplying languages=["en"] to partition_pdf with a strategy of either "auto" or "ocr_only", the OCR Agent is not passed through, which causes the following error to occur:

Traceback (most recent call last):
  File "/<MY_DIR>/main.py", line 20, in <module>
    elements = partition_pdf(
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/documents/elements.py", line 581, in wrapper
    elements = func(*args, **kwargs)
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 816, in wrapper
    elements = func(*args, **kwargs)
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 774, in wrapper
    elements = func(*args, **kwargs)
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/chunking/dispatch.py", line 74, in wrapper
    elements = func(*args, **kwargs)
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/pdf.py", line 228, in partition_pdf
    return partition_pdf_or_image(
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/pdf.py", line 379, in partition_pdf_or_image
    elements = _partition_pdf_or_image_with_ocr(
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/pdf.py", line 934, in _partition_pdf_or_image_with_ocr
    page_elements = _partition_pdf_or_image_with_ocr_from_image(
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/pdf.py", line 962, in _partition_pdf_or_image_with_ocr_from_image
    ocr_agent = OCRAgent.get_agent(language=ocr_languages)
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/utils/ocr_models/ocr_interface.py", line 34, in get_agent
    return cls.get_instance(ocr_agent_cls_qname, language)
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/utils/ocr_models/ocr_interface.py", line 49, in get_instance
    return loaded_class(language)
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/utils/ocr_models/paddle_ocr.py", line 23, in __init__
    self.agent = self.load_agent(language)
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/utils/ocr_models/paddle_ocr.py", line 45, in load_agent
    paddle_ocr = PaddleOCR(
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured_paddleocr/paddleocr.py", line 610, in __init__
    lang, det_lang = parse_lang(params.lang)
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured_paddleocr/paddleocr.py", line 479, in parse_lang
    lang in MODEL_URLS["OCR"][DEFAULT_OCR_MODEL_VERSION]["rec"]
AssertionError: param lang must in dict_keys(['ch', 'en', 'korean', 'japan', 'chinese_cht', 'ta', 'te', 'ka', 'latin', 'arabic', 'cyrillic', 'devanagari']), but got eng

To Reproduce
Setup:
Run pip install "unstructured[pdf]"==0.17.0 paddlepaddle unstructured.paddleocr. I also had to run pip uninstall torch -y and then pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

from unstructured.partition.pdf import partition_pdf
import unstructured.partition.utils.ocr_models.paddle_ocr as paddle_ocr_module
from unstructured_inference.inference.layoutelement import LayoutElements

paddle_ocr_module.LayoutElements = LayoutElements # workaround for #3931

os.environ["OCR_AGENT"] = "unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle" # found in an old issue
os.environ["DEFAULT_PADDLE_LANG"] = "en" # found in an old issue
filename = "path_to_your_file.pdf"

elements = partition_pdf(
    filename=filename,
    strategy="ocr_only",
    languages=["en"],
    table_ocr_agent="unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle",
    ocr_agent="unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle",
)

Expected behavior
Script would run without errors and return elements.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment Info
Please run python scripts/collect_env.py and paste the output here.
Broken Env:

OS version:  Linux-6.8.0-52-generic-x86_64-with-glibc2.35
Python version:  3.10.12
unstructured version:  0.17.0
unstructured-inference version:  0.8.9
pytesseract is not installed
Torch version:  2.6.0+cpu
Detectron2 is not installed
PaddleOCR version:  2.6.2
Libmagic version: file-5.41
magic file from /etc/magic:/usr/share/misc/magic
LibreOffice is not installed

Working Env:

OS version:  Linux-6.8.0-52-generic-x86_64-with-glibc2.35
Python version:  3.10.12
unstructured version:  0.16.25
unstructured-inference version:  0.8.9
pytesseract is not installed
Torch version:  2.6.0+cpu
Detectron2 is not installed
PaddleOCR version:  2.6.2
Libmagic version: file-5.41
magic file from /etc/magic:/usr/share/misc/magic
LibreOffice is not installed

Additional context
From what I can tell, the ocr_agent isn't being passed to _partition_pdf_or_image_with_ocr_from_image. Since the tesseract_to_paddle_language call is only being done inside _partition_pdf_or_image_local > process_file_with_ocr > supplement_page_layout_with_ocr (which doesn't get called by _partition_pdf_or_image_with_ocr) passing languages=["en"] has the languages changed to the tesseract language structure ocr_languages = prepare_languages_for_tesseract(languages) here, which causes the call to ocr_agent = OCRAgent.get_agent(language=ocr_languages) to break here

The text was updated successfully, but these errors were encountered:

joshrbarcodefactory added the bug Something isn't working label Mar 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug/PaddleOCR language specification issue #3957

bug/PaddleOCR language specification issue #3957

joshrbarcodefactory commented Mar 14, 2025

bug/PaddleOCR language specification issue #3957

bug/PaddleOCR language specification issue #3957

Comments

joshrbarcodefactory commented Mar 14, 2025