Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Workbook data are omitted in Extractor when using DocumentLoaderSpreadSheet and XLSX document #286

Open
henryk000 opened this issue Mar 6, 2025 · 6 comments

Comments

@henryk000
Copy link

henryk000 commented Mar 6, 2025

Problem:
While processing Excel file (XLSX) workbook data are not passed to the LLM (only sheet name is used).

Analysis:

  • Document loader 'DocumentLoaderSpreadSheet' produces a page with 'content' containing Sheet name. Workbook data are stored in the 'data' key of the page_dict ('document_loader_spreadsheet.py' lines 88...94).
  • Invoking Extractor.extract(...) with that document loader we will loose 'data' information because of Extractor._map_to_universal_format(...) -> extractor.py lines 276 ... 277:
    loaded_content = loader.load(source) unified_content = self._map_to_universal_format(loaded_content, vision)
  • Function Extractor._map_to_universal_format do not use 'data' key (of parsed spreadsheet) while building unified 'content'. Probably missing support for 'is_spreadsheet' flag when processing page information.

How to reproduce the problem:

Try to use document loader called 'DocumentLoaderSpreadSheet' with some Excel file to extract data using Extractor.

@enoch3712
Copy link
Owner

@henryk000

Are you a bot?

@henryk000
Copy link
Author

Not

@henryk000
Copy link
Author

Im using Extractor to work with different file formats and when I was using an Excel file I discovered this problem.

@henryk000
Copy link
Author

For now, I'll try to get around this problem by creating my own class that inherits from 'DocumentLoaderSpreadSheet' and make a fix.

@enoch3712
Copy link
Owner

@henryk000

Ok great!

I will fix this by tomorrow, but its better to use DL like Docling or Markitdown

@henryk000
Copy link
Author

Ok, I will see those loaders :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants