Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug/PaddleOCR language specification issue #3957

Open
joshrbarcodefactory opened this issue Mar 14, 2025 · 0 comments
Open

bug/PaddleOCR language specification issue #3957

joshrbarcodefactory opened this issue Mar 14, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@joshrbarcodefactory
Copy link

Describe the bug
After updating to version 0.17.0, I am experiencing the same issue as #3400. When supplying languages=["en"] to partition_pdf with a strategy of either "auto" or "ocr_only", the OCR Agent is not passed through, which causes the following error to occur:

Traceback (most recent call last):
  File "/<MY_DIR>/main.py", line 20, in <module>
    elements = partition_pdf(
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/documents/elements.py", line 581, in wrapper
    elements = func(*args, **kwargs)
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 816, in wrapper
    elements = func(*args, **kwargs)
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 774, in wrapper
    elements = func(*args, **kwargs)
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/chunking/dispatch.py", line 74, in wrapper
    elements = func(*args, **kwargs)
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/pdf.py", line 228, in partition_pdf
    return partition_pdf_or_image(
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/pdf.py", line 379, in partition_pdf_or_image
    elements = _partition_pdf_or_image_with_ocr(
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/pdf.py", line 934, in _partition_pdf_or_image_with_ocr
    page_elements = _partition_pdf_or_image_with_ocr_from_image(
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/pdf.py", line 962, in _partition_pdf_or_image_with_ocr_from_image
    ocr_agent = OCRAgent.get_agent(language=ocr_languages)
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/utils/ocr_models/ocr_interface.py", line 34, in get_agent
    return cls.get_instance(ocr_agent_cls_qname, language)
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/utils/ocr_models/ocr_interface.py", line 49, in get_instance
    return loaded_class(language)
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/utils/ocr_models/paddle_ocr.py", line 23, in __init__
    self.agent = self.load_agent(language)
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/utils/ocr_models/paddle_ocr.py", line 45, in load_agent
    paddle_ocr = PaddleOCR(
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured_paddleocr/paddleocr.py", line 610, in __init__
    lang, det_lang = parse_lang(params.lang)
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured_paddleocr/paddleocr.py", line 479, in parse_lang
    lang in MODEL_URLS["OCR"][DEFAULT_OCR_MODEL_VERSION]["rec"]
AssertionError: param lang must in dict_keys(['ch', 'en', 'korean', 'japan', 'chinese_cht', 'ta', 'te', 'ka', 'latin', 'arabic', 'cyrillic', 'devanagari']), but got eng

To Reproduce
Setup:
Run pip install "unstructured[pdf]"==0.17.0 paddlepaddle unstructured.paddleocr. I also had to run pip uninstall torch -y and then pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

from unstructured.partition.pdf import partition_pdf
import unstructured.partition.utils.ocr_models.paddle_ocr as paddle_ocr_module
from unstructured_inference.inference.layoutelement import LayoutElements

paddle_ocr_module.LayoutElements = LayoutElements # workaround for #3931

os.environ["OCR_AGENT"] = "unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle" # found in an old issue
os.environ["DEFAULT_PADDLE_LANG"] = "en" # found in an old issue
filename = "path_to_your_file.pdf"

elements = partition_pdf(
    filename=filename,
    strategy="ocr_only",
    languages=["en"],
    table_ocr_agent="unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle",
    ocr_agent="unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle",
)

Expected behavior
Script would run without errors and return elements.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment Info
Please run python scripts/collect_env.py and paste the output here.
Broken Env:

OS version:  Linux-6.8.0-52-generic-x86_64-with-glibc2.35
Python version:  3.10.12
unstructured version:  0.17.0
unstructured-inference version:  0.8.9
pytesseract is not installed
Torch version:  2.6.0+cpu
Detectron2 is not installed
PaddleOCR version:  2.6.2
Libmagic version: file-5.41
magic file from /etc/magic:/usr/share/misc/magic
LibreOffice is not installed

Working Env:

OS version:  Linux-6.8.0-52-generic-x86_64-with-glibc2.35
Python version:  3.10.12
unstructured version:  0.16.25
unstructured-inference version:  0.8.9
pytesseract is not installed
Torch version:  2.6.0+cpu
Detectron2 is not installed
PaddleOCR version:  2.6.2
Libmagic version: file-5.41
magic file from /etc/magic:/usr/share/misc/magic
LibreOffice is not installed

Additional context
From what I can tell, the ocr_agent isn't being passed to _partition_pdf_or_image_with_ocr_from_image. Since the tesseract_to_paddle_language call is only being done inside _partition_pdf_or_image_local > process_file_with_ocr > supplement_page_layout_with_ocr (which doesn't get called by _partition_pdf_or_image_with_ocr) passing languages=["en"] has the languages changed to the tesseract language structure ocr_languages = prepare_languages_for_tesseract(languages) here, which causes the call to ocr_agent = OCRAgent.get_agent(language=ocr_languages) to break here

@joshrbarcodefactory joshrbarcodefactory added the bug Something isn't working label Mar 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant