[BUG] Workbook data are omitted in Extractor when using DocumentLoaderSpreadSheet and XLSX document #286

henryk000 · 2025-03-06T13:52:39Z

Problem:
While processing Excel file (XLSX) workbook data are not passed to the LLM (only sheet name is used).

Analysis:

Document loader 'DocumentLoaderSpreadSheet' produces a page with 'content' containing Sheet name. Workbook data are stored in the 'data' key of the page_dict ('document_loader_spreadsheet.py' lines 88...94).
Invoking Extractor.extract(...) with that document loader we will loose 'data' information because of Extractor._map_to_universal_format(...) -> extractor.py lines 276 ... 277:
loaded_content = loader.load(source) unified_content = self._map_to_universal_format(loaded_content, vision)
Function Extractor._map_to_universal_format do not use 'data' key (of parsed spreadsheet) while building unified 'content'. Probably missing support for 'is_spreadsheet' flag when processing page information.

How to reproduce the problem:

Try to use document loader called 'DocumentLoaderSpreadSheet' with some Excel file to extract data using Extractor.

The text was updated successfully, but these errors were encountered:

enoch3712 · 2025-03-06T14:00:13Z

Are you a bot?

henryk000 · 2025-03-06T14:00:50Z

Not

henryk000 · 2025-03-06T14:02:05Z

Im using Extractor to work with different file formats and when I was using an Excel file I discovered this problem.

henryk000 · 2025-03-06T14:06:48Z

For now, I'll try to get around this problem by creating my own class that inherits from 'DocumentLoaderSpreadSheet' and make a fix.

enoch3712 · 2025-03-06T14:08:23Z

Ok great!

I will fix this by tomorrow, but its better to use DL like Docling or Markitdown

henryk000 · 2025-03-06T14:09:52Z

Ok, I will see those loaders :)

Provide feedback