LLM_CHUNK implementation #18471

charleschile · 2024-09-01T16:03:39Z

What type of PR is this?

Which issue(s) this PR fixes:

What this PR does / why we need it:

As part of our document LLM support, we are introducing the LLM_CHUNK function. This function can chunk the content in datalink with 4 chunk strategy available.

Usage: select llm_chunk("<input datalink>", "fixed_width; <width number>"); or select llm_chunk("<input datalink>", "<sentence or paragraph or document>");

Return Value: a JSON-like string representation of an array of chunks with offset and size: [[offset0, size0, "chunk"], [offset1, size1, "chunk"],...]

Example SQL for fixed with:

 select llm_chunk(cast('file:///Users/charles/Desktop/codes/testData/example.txt' as datalink), "fixed_width; 11");

Example return:

[[0, 11, "hello world"], [11, 11, " this is a "], [22, 11, "test? hello"], [33, 11, " world! thi"], [44, 11, "s is a test"], [55, 11, ". hello wor"], [66, 3, "ld."]]

Example SQL for sentence:

 select llm_chunk(cast('file:///Users/charles/Desktop/codes/testData/example.txt' as datalink), "sentence");

Example return:

[[0, 27, "hello world this is a test?"], [27, 13, " hello world!"], [40, 16, " this is a test."], [56, 13, " hello world."]]

qodo-merge-pro · 2024-09-01T16:04:09Z

PR-Agent was enabled for this repository. To continue using it, please link your git user with your CodiumAI identity here.

PR Reviewer Guide 🔍

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Key issues to review Error Handling The `ChunkString` function does not handle errors properly. It returns an empty string in case of an error, which may lead to silent failures. Resource Management The `Chunk` function reads the entire file content into memory, which may cause issues with large files. Consider implementing streaming or chunked reading for better memory management.

qodo-merge-pro · 2024-09-01T16:04:43Z

PR-Agent was enabled for this repository. To continue using it, please link your git user with your CodiumAI identity here.

PR Code Suggestions ✨

Category	Suggestion	Score
Error handling	Improve error handling by adding context to returned errors Consider adding error handling for the `ReadFromFile` function call. Currently, if an error occurs, it's immediately returned without any context, which might make debugging more difficult. pkg/sql/plan/function/func_llm.go [118-122] r, err := ReadFromFile(moUrl, fs) if err != nil { - return err + return fmt.Errorf("failed to read file %s: %w", moUrl, err) } defer r.Close() Apply this suggestion Suggestion importance[1-10]: 9 Why: Adding context to errors improves debugging and maintenance by providing more information about the error's origin, which is crucial for diagnosing issues.	9
Performance	Use strings.Builder for more efficient string concatenation Consider using a more efficient string concatenation method, such as strings.Builder, instead of repeatedly appending to a string slice. This can improve performance, especially for large inputs. pkg/sql/plan/function/func_llm.go [84-89] -var chunkStrings []string -for _, chunk := range chunks { - chunkStrings = append(chunkStrings, fmt.Sprintf("[%d, %d, %q]", chunk[0], chunk[1], chunk[2])) +var builder strings.Builder +builder.WriteString("[") +for i, chunk := range chunks { + if i > 0 { + builder.WriteString(", ") + } + fmt.Fprintf(&builder, "[%d, %d, %q]", chunk[0], chunk[1], chunk[2]) } +builder.WriteString("]") +return builder.String() -return "[" + strings.Join(chunkStrings, ", ") + "]" - Apply this suggestion Suggestion importance[1-10]: 8 Why: The suggestion to use strings.Builder is valid as it can improve performance by reducing memory allocations during string concatenation, especially for large inputs.	8
Maintainability	Use a named constant for the blob size limit Consider using a constant for the blob size limit (65536) to improve maintainability and make the code more self-documenting. pkg/sql/plan/function/func_llm.go [127-129] -if len(ctx) > 65536 /blob size/ { +const maxBlobSize = 65536 +if len(ctx) > maxBlobSize { return moerr.NewInternalError(proc.Ctx, "Data too long for blob") } Apply this suggestion Suggestion importance[1-10]: 8 Why: Using a named constant improves code maintainability and readability by making the purpose of the magic number clear and easier to update in the future.	8
Code structure	✅ Refactor the ChunkString function to use early returns for invalid inputs Suggestion Impact: The commit refactored the ChunkString function to handle invalid inputs early, similar to the suggestion. It introduced error handling and returned errors for invalid inputs, improving readability and handling invalid inputs more explicitly. code diff: -func ChunkString(text, mode string) string { +func ChunkString(text, mode string) (string, error) { modeParts := strings.Split(mode, ";") for i := range modeParts { modeParts[i] = strings.TrimSpace(modeParts[i]) @@ -66,8 +66,8 @@ var chunks [][]interface{} if len(modeParts) == 2 && modeParts[0] == "fixed_width" { width, err := strconv.Atoi(modeParts[1]) - if err != nil { - return "" + if width < 0 \|\| err != nil { + return "", moerr.NewInvalidInputNoCtxf("'%s' is not a valid chunk strategy", mode) } chunks = fixedWidthChunk(text, width) } else { @@ -77,7 +77,7 @@ case "paragraph": chunks = paragraphChunk(text) default: - return "" + return "", moerr.NewInvalidInputNoCtxf("'%s' is not a valid chunk strategy", mode) } The `ChunkString` function could benefit from early returns for invalid inputs. This would improve readability and potentially performance by avoiding unnecessary computations. pkg/sql/plan/function/func_llm.go [61-82] func ChunkString(text, mode string) string { modeParts := strings.Split(mode, ";") for i := range modeParts { modeParts[i] = strings.TrimSpace(modeParts[i]) } + var chunks [][]interface{} if len(modeParts) == 2 && modeParts[0] == "fixed_width" { width, err := strconv.Atoi(modeParts[1]) if err != nil { return "" } chunks = fixedWidthChunk(text, width) - } else { + } else if len(modeParts) == 1 { switch modeParts[0] { case "sentence": chunks = sentenceChunk(text) case "paragraph": chunks = paragraphChunk(text) default: return "" } + } else { + return "" } Apply this suggestion Suggestion importance[1-10]: 7 Why: The refactoring suggestion improves readability and potentially performance by handling invalid inputs early, reducing unnecessary computations.	7

pkg/sql/plan/function/func_llm.go

fengttt

Please add result verification in tests.

fengttt

How about we add a "trivial" chunking method that will simply chunk the whole doc as 1 chunk?

charleschile · 2024-09-03T16:17:25Z

How about we add a "trivial" chunking method that will simply chunk the whole doc as 1 chunk?

A new chunk strategy document has been supported, it returns the whoe document

charleschile · 2024-09-09T13:52:08Z

Please add result verification in tests.

result verification has been added.

pkg/sql/plan/function/func_llm.go

LLM_CHUNK implementation

24e1edb

charleschile requested review from m-schen and zhangxu19830126 as code owners September 1, 2024 16:03

qodo-merge-pro bot added Enhancement Dependencies labels Sep 1, 2024

qodo-merge-pro bot added the Review effort [1-5]: 4 label Sep 1, 2024

Merge branch 'main' into ospp/LLM_CHUNK

1ce50be

mergify bot added the kind/feature label Sep 1, 2024

m-schen approved these changes Sep 2, 2024

View reviewed changes

charleschile added 2 commits September 2, 2024 20:42

add unit tests

d91460c

Merge remote-tracking branch 'origin/ospp/LLM_CHUNK' into ospp/LLM_CHUNK

2c7189d

cpegeric reviewed Sep 2, 2024

View reviewed changes

pkg/sql/plan/function/func_llm.go Outdated Show resolved Hide resolved

cpegeric reviewed Sep 2, 2024

View reviewed changes

pkg/sql/plan/function/func_llm.go Outdated Show resolved Hide resolved

llm_chunk tests

3c987c2

charleschile requested review from aressu1985 and heni02 as code owners September 2, 2024 17:16

charleschile added 2 commits September 3, 2024 01:39

fix bug

a3caf60

fix bug

d6ed94e

fengttt requested changes Sep 3, 2024

View reviewed changes

fengttt reviewed Sep 3, 2024

View reviewed changes

support whole document chunk

94fd55f

charleschile added 3 commits September 9, 2024 14:40

Merge branch 'matrixorigin:main' into ospp/LLM_CHUNK

cc307b1

Merge branch 'matrixorigin:main' into ospp/LLM_CHUNK

ab6532d

result verification

36dccea

Merge branch 'main' into ospp/LLM_CHUNK

22a0fef

matrix-meow added the size/M Denotes a PR that changes [100,499] lines label Sep 9, 2024

charleschile added 2 commits September 9, 2024 22:13

fix bug

a64c6ef

Merge branch 'main' into ospp/LLM_CHUNK

d5bba04

fengttt approved these changes Sep 16, 2024

View reviewed changes

zhangxu19830126 approved these changes Sep 20, 2024

View reviewed changes

cpegeric requested changes Sep 20, 2024

View reviewed changes

pkg/sql/plan/function/func_llm.go Show resolved Hide resolved

m-schen approved these changes Sep 20, 2024

View reviewed changes

aressu1985 approved these changes Oct 23, 2024

View reviewed changes

charleschile changed the base branch from main to LLMIndex October 31, 2024 14:26

mergify bot requested a review from sukki37 October 31, 2024 14:26

charleschile had a problem deploying to ci October 31, 2024 14:27 — with GitHub Actions Failure

charleschile temporarily deployed to ci October 31, 2024 14:27 — with GitHub Actions Inactive

mergify bot mentioned this pull request Oct 31, 2024

Ospp/new llm chunk #19726

Open

7 tasks

Merge branch 'LLMIndex' into ospp/LLM_CHUNK

866939e

charleschile had a problem deploying to ci October 31, 2024 14:37 — with GitHub Actions Failure

charleschile temporarily deployed to ci October 31, 2024 14:37 — with GitHub Actions Inactive

charleschile had a problem deploying to ci October 31, 2024 14:37 — with GitHub Actions Failure

charleschile temporarily deployed to ci October 31, 2024 14:37 — with GitHub Actions Inactive

charleschile had a problem deploying to ci October 31, 2024 14:37 — with GitHub Actions Failure

charleschile temporarily deployed to ci October 31, 2024 14:37 — with GitHub Actions Inactive

sukki37 merged commit 77e7294 into matrixorigin:LLMIndex Oct 31, 2024
9 of 16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM_CHUNK implementation #18471

LLM_CHUNK implementation #18471

charleschile commented Sep 1, 2024 •

edited

Loading

qodo-merge-pro bot commented Sep 1, 2024

qodo-merge-pro bot commented Sep 1, 2024 •

edited

Loading

fengttt left a comment

fengttt left a comment

charleschile commented Sep 3, 2024

charleschile commented Sep 9, 2024

LLM_CHUNK implementation #18471

LLM_CHUNK implementation #18471

Conversation

charleschile commented Sep 1, 2024 • edited Loading

What type of PR is this?

Which issue(s) this PR fixes:

What this PR does / why we need it:

qodo-merge-pro bot commented Sep 1, 2024

PR Reviewer Guide 🔍

qodo-merge-pro bot commented Sep 1, 2024 • edited Loading

PR Code Suggestions ✨

fengttt left a comment

Choose a reason for hiding this comment

fengttt left a comment

Choose a reason for hiding this comment

charleschile commented Sep 3, 2024

charleschile commented Sep 9, 2024

charleschile commented Sep 1, 2024 •

edited

Loading

qodo-merge-pro bot commented Sep 1, 2024 •

edited

Loading