Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLM_CHUNK implementation #18471

Merged
merged 15 commits into from
Oct 31, 2024
Merged

Conversation

charleschile
Copy link
Contributor

@charleschile charleschile commented Sep 1, 2024

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

issue #18664

What this PR does / why we need it:

As part of our document LLM support, we are introducing the LLM_CHUNK function. This function can chunk the content in datalink with 4 chunk strategy available.

Usage: select llm_chunk("<input datalink>", "fixed_width; <width number>"); or select llm_chunk("<input datalink>", "<sentence or paragraph or document>");

Return Value: a JSON-like string representation of an array of chunks with offset and size: [[offset0, size0, "chunk"], [offset1, size1, "chunk"],...]

Example SQL for fixed with:

 select llm_chunk(cast('file:///Users/charles/Desktop/codes/testData/example.txt' as datalink), "fixed_width; 11");

Example return:

[[0, 11, "hello world"], [11, 11, " this is a "], [22, 11, "test? hello"], [33, 11, " world! thi"], [44, 11, "s is a test"], [55, 11, ". hello wor"], [66, 3, "ld."]] 

Example SQL for sentence:

 select llm_chunk(cast('file:///Users/charles/Desktop/codes/testData/example.txt' as datalink), "sentence");

Example return:

[[0, 27, "hello world this is a test?"], [27, 13, " hello world!"], [40, 16, " this is a test."], [56, 13, " hello world."]]

Copy link

qodo-merge-pro bot commented Sep 1, 2024

PR-Agent was enabled for this repository. To continue using it, please link your git user with your CodiumAI identity here.

PR Reviewer Guide 🔍

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Key issues to review

Error Handling
The ChunkString function does not handle errors properly. It returns an empty string in case of an error, which may lead to silent failures.

Resource Management
The Chunk function reads the entire file content into memory, which may cause issues with large files. Consider implementing streaming or chunked reading for better memory management.

@mergify mergify bot added the kind/feature label Sep 1, 2024
Copy link

qodo-merge-pro bot commented Sep 1, 2024

PR-Agent was enabled for this repository. To continue using it, please link your git user with your CodiumAI identity here.

PR Code Suggestions ✨

CategorySuggestion                                                                                                                                    Score
Error handling
Improve error handling by adding context to returned errors

Consider adding error handling for the ReadFromFile function call. Currently, if an
error occurs, it's immediately returned without any context, which might make
debugging more difficult.

pkg/sql/plan/function/func_llm.go [118-122]

 r, err := ReadFromFile(moUrl, fs)
 if err != nil {
-    return err
+    return fmt.Errorf("failed to read file %s: %w", moUrl, err)
 }
 defer r.Close()
 
  • Apply this suggestion
Suggestion importance[1-10]: 9

Why: Adding context to errors improves debugging and maintenance by providing more information about the error's origin, which is crucial for diagnosing issues.

9
Performance
Use strings.Builder for more efficient string concatenation

Consider using a more efficient string concatenation method, such as
strings.Builder, instead of repeatedly appending to a string slice. This can improve
performance, especially for large inputs.

pkg/sql/plan/function/func_llm.go [84-89]

-var chunkStrings []string
-for _, chunk := range chunks {
-    chunkStrings = append(chunkStrings, fmt.Sprintf("[%d, %d, %q]", chunk[0], chunk[1], chunk[2]))
+var builder strings.Builder
+builder.WriteString("[")
+for i, chunk := range chunks {
+    if i > 0 {
+        builder.WriteString(", ")
+    }
+    fmt.Fprintf(&builder, "[%d, %d, %q]", chunk[0], chunk[1], chunk[2])
 }
+builder.WriteString("]")
+return builder.String()
 
-return "[" + strings.Join(chunkStrings, ", ") + "]"
-
  • Apply this suggestion
Suggestion importance[1-10]: 8

Why: The suggestion to use strings.Builder is valid as it can improve performance by reducing memory allocations during string concatenation, especially for large inputs.

8
Maintainability
Use a named constant for the blob size limit

Consider using a constant for the blob size limit (65536) to improve maintainability
and make the code more self-documenting.

pkg/sql/plan/function/func_llm.go [127-129]

-if len(ctx) > 65536 /*blob size*/ {
+const maxBlobSize = 65536
+if len(ctx) > maxBlobSize {
     return moerr.NewInternalError(proc.Ctx, "Data too long for blob")
 }
 
  • Apply this suggestion
Suggestion importance[1-10]: 8

Why: Using a named constant improves code maintainability and readability by making the purpose of the magic number clear and easier to update in the future.

8
Code structure
✅ Refactor the ChunkString function to use early returns for invalid inputs
Suggestion Impact:The commit refactored the ChunkString function to handle invalid inputs early, similar to the suggestion. It introduced error handling and returned errors for invalid inputs, improving readability and handling invalid inputs more explicitly.

code diff:

-func ChunkString(text, mode string) string {
+func ChunkString(text, mode string) (string, error) {
 	modeParts := strings.Split(mode, ";")
 	for i := range modeParts {
 		modeParts[i] = strings.TrimSpace(modeParts[i])
@@ -66,8 +66,8 @@
 	var chunks [][]interface{}
 	if len(modeParts) == 2 && modeParts[0] == "fixed_width" {
 		width, err := strconv.Atoi(modeParts[1])
-		if err != nil {
-			return ""
+		if width < 0 || err != nil {
+			return "", moerr.NewInvalidInputNoCtxf("'%s' is not a valid chunk strategy", mode)
 		}
 		chunks = fixedWidthChunk(text, width)
 	} else {
@@ -77,7 +77,7 @@
 		case "paragraph":
 			chunks = paragraphChunk(text)
 		default:
-			return ""
+			return "", moerr.NewInvalidInputNoCtxf("'%s' is not a valid chunk strategy", mode)
 		}

The ChunkString function could benefit from early returns for invalid inputs. This
would improve readability and potentially performance by avoiding unnecessary
computations.

pkg/sql/plan/function/func_llm.go [61-82]

 func ChunkString(text, mode string) string {
     modeParts := strings.Split(mode, ";")
     for i := range modeParts {
         modeParts[i] = strings.TrimSpace(modeParts[i])
     }
+    
     var chunks [][]interface{}
     if len(modeParts) == 2 && modeParts[0] == "fixed_width" {
         width, err := strconv.Atoi(modeParts[1])
         if err != nil {
             return ""
         }
         chunks = fixedWidthChunk(text, width)
-    } else {
+    } else if len(modeParts) == 1 {
         switch modeParts[0] {
         case "sentence":
             chunks = sentenceChunk(text)
         case "paragraph":
             chunks = paragraphChunk(text)
         default:
             return ""
         }
+    } else {
+        return ""
     }
 
  • Apply this suggestion
Suggestion importance[1-10]: 7

Why: The refactoring suggestion improves readability and potentially performance by handling invalid inputs early, reducing unnecessary computations.

7

Copy link
Contributor

@fengttt fengttt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add result verification in tests.

Copy link
Contributor

@fengttt fengttt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we add a "trivial" chunking method that will simply chunk the whole doc as 1 chunk?

@charleschile
Copy link
Contributor Author

How about we add a "trivial" chunking method that will simply chunk the whole doc as 1 chunk?

A new chunk strategy document has been supported, it returns the whoe document

@charleschile
Copy link
Contributor Author

Please add result verification in tests.

result verification has been added.

@matrix-meow matrix-meow added the size/M Denotes a PR that changes [100,499] lines label Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants