Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support AutoTokenizer for text chunking #397

Open
Tracked by #566
chethanuk opened this issue Nov 23, 2024 · 1 comment
Open
Tracked by #566

Support AutoTokenizer for text chunking #397

chethanuk opened this issue Nov 23, 2024 · 1 comment

Comments

@chethanuk
Copy link
Contributor

chethanuk commented Nov 23, 2024

Problem

Build Vector Index is failing: Single text cannot exceed 8194 tokens. 8746 tokens given.

With new Jina Embed Model: jina-embeddings-v3

image

Ingest large PDF is erroring out

background-1  | [2024-11-23 13:48:08,045: ERROR/ForkPoolWorker-5] app.tasks.build_index.build_index_for_document[22305254-69a8-4ec7-bd97-bad0ce25f604]: Failed to build vector index for document 30001: Traceback (most recent call last):
background-1  |   File "/app/app/tasks/build_index.py", line 60, in build_index_for_document
background-1  |     index_service.build_vector_index_for_document(index_session, db_document)
background-1  |   File "/app/app/rag/build_index.py", line 72, in build_vector_index_for_document
background-1  |     vector_index.insert(document, source_uri=db_document.source_uri)
background-1  |   File "/usr/local/lib/python3.11/site-packages/llama_index/core/indices/base.py", line 215, in insert
background-1  |     self.insert_nodes(nodes, **insert_kwargs)
background-1  |   File "/usr/local/lib/python3.11/site-packages/llama_index/core/indices/vector_store/base.py", line 330, in insert_nodes
background-1  |     self._insert(nodes, **insert_kwargs)
background-1  |   File "/usr/local/lib/python3.11/site-packages/llama_index/core/indices/vector_store/base.py", line 311, in _insert
background-1  |     self._add_nodes_to_index(self._index_struct, nodes, **insert_kwargs)
background-1  |   File "/usr/local/lib/python3.11/site-packages/llama_index/core/indices/vector_store/base.py", line 231, in _add_nodes_to_index
background-1  |     nodes_batch = self._get_node_with_embedding(nodes_batch, show_progress)
background-1  |                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
background-1  |   File "/usr/local/lib/python3.11/site-packages/llama_index/core/indices/vector_store/base.py", line 138, in _get_node_with_embedding
background-1  |     id_to_embed_map = embed_nodes(
background-1  |                       ^^^^^^^^^^^^
background-1  |   File "/usr/local/lib/python3.11/site-packages/llama_index/core/indices/utils.py", line 138, in embed_nodes
background-1  |     new_embeddings = embed_model.get_text_embedding_batch(
background-1  |                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
background-1  |   File "/usr/local/lib/python3.11/site-packages/llama_index/core/instrumentation/dispatcher.py", line 265, in wrapper
background-1  |     result = func(*args, **kwargs)
background-1  |              ^^^^^^^^^^^^^^^^^^^^^
background-1  |   File "/usr/local/lib/python3.11/site-packages/llama_index/core/base/embeddings/base.py", line 335, in get_text_embedding_batch
background-1  |     embeddings = self._get_text_embeddings(cur_batch)
background-1  |                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
background-1  |   File "/usr/local/lib/python3.11/site-packages/llama_index/embeddings/jinaai/base.py", line 202, in _get_text_embeddings
background-1  |     return self._api.get_embeddings(
background-1  |            ^^^^^^^^^^^^^^^^^^^^^^^^^
background-1  |   File "/usr/local/lib/python3.11/site-packages/llama_index/embeddings/jinaai/base.py", line 48, in get_embeddings
background-1  |     raise RuntimeError(resp["detail"])
background-1  | RuntimeError: Single text cannot exceed 8194 tokens. 8746 tokens given.
background-1  | 
background-1  | [2024-11-23 13:48:08,185: INFO/ForkPoolWorker-5] Task app.tasks.build_index.build_index_for_document[22305254-69a8-4ec7-bd97-bad0ce25f604] succeeded in 36.22360512241721s: None

Note in docker compose, max Value is already set

EMBEDDING_DIMS=1024
# EMBEDDING_MAX_TOKENS should be equal or smaller than the embedding model's max tokens,
# it indicates the max size of document chunks.
EMBEDDING_MAX_TOKENS=8191

Solution

https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoTokenizer

@Mini256
Copy link
Member

Mini256 commented Nov 26, 2024

The jina embedding model may consume more tokens than the openai embedding model when processing text of the same length.

However, SentenceSplitter uses the tokenizer provided by openai to split chunks by default, which may be the cause of the jina ai embedding error.

Image

Image

@Mini256 Mini256 changed the title Build Vector Index is failing: Single text cannot exceed 8194 tokens. 8746 tokens given. Support AutoTokenizer for chunking Feb 13, 2025
@Mini256 Mini256 changed the title Support AutoTokenizer for chunking Support AutoTokenizer for text chunking Feb 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants