-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(indexing): enhance search, chunking and file watching #1
Conversation
Major improvements to indexing and search functionality: - Add scoring explanations and custom weights - Improve document chunking with better overlap handling - Enhance file watching reliability - Add debug features and logging - Improve test coverage and error handling Co-authored-by: Bob <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❌ Changes requested. Reviewed everything up to a1843c3 in 2 minutes and 18 seconds
More details
- Looked at
1501
lines of code in13
files - Skipped
1
files when reviewing. - Skipped posting
0
drafted comments based on config settings.
Workflow ID: wflow_N9D9GylvlYY6OX61
Want Ellipsis to fix these issues? Tag @ellipsis-dev
in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet
mode, and more.
Major improvements to indexing and file watching systems: - Refactor search functionality in indexer - Add dedicated methods for result grouping and scoring - Improve code organization and readability - Better handling of chunk grouping - Enhance file watcher reliability - Add robust error handling and retries - Improve file processing logic - Better logging for debugging - Improve test stability - Use unique collection names per test - Add better assertions and timeouts - More reliable test cleanup Co-authored-by: Bob <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❌ Changes requested. Incremental review on 95b6585 in 48 seconds
More details
- Looked at
808
lines of code in5
files - Skipped
0
files when reviewing. - Skipped posting
3
drafted comments based on config settings.
1. gptme_rag/indexing/watcher.py:68
- Draft comment:
Usingtime.sleep
to wait for file writes can be unreliable. Consider implementing a more robust method to ensure the file is fully written, such as checking file size stability over a short period. - Reason this comment was not posted:
Confidence changes required:50%
The_queue_update
method inwatcher.py
directly callstime.sleep
to wait for the file to be fully written. This could be improved by using a more robust method to ensure the file is ready, such as checking file size stability over a short period.
2. gptme_rag/indexing/watcher.py:218
- Draft comment:
Consider making the initial wait time for exponential backoff configurable to allow flexibility in different environments. - Reason this comment was not posted:
Confidence changes required:30%
Inwatcher.py
, the_update_index_with_retries
method uses exponential backoff for retries, but the initial wait time is hardcoded. It might be beneficial to make this configurable for flexibility in different environments.
3. tests/test_indexing.py:78
- Draft comment:
Ensure that the collection is reset before each test that modifies the index to maintain a clean state. This is done here but should be consistent across all relevant tests. - Reason this comment was not posted:
Confidence changes required:20%
Intest_indexing.py
, thetest_indexer_add_documents
function includes a reset of the collection before adding documents. This is a good practice to ensure a clean state, but it should be consistently applied across all tests that modify the index.
Workflow ID: wflow_rKmO5ZLJTHWJmxRk
Want Ellipsis to fix these issues? Tag @ellipsis-dev
in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet
mode, and more.
Welcome to Codecov 🎉Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests. Thanks for integrating Codecov - We've got you covered ☂️ |
- Add shared indexer fixture with automatic cleanup - Add cleanup_chroma fixture to reset ChromaDB between tests - Refactor test files to use shared fixtures - Remove duplicated setup code - Use tmp_path fixture instead of custom temp_dir Co-authored-by: Bob <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me! Incremental review on d79047d in 29 seconds
More details
- Looked at
573
lines of code in4
files - Skipped
0
files when reviewing. - Skipped posting
8
drafted comments based on config settings.
1. tests/test_chunking.py:54
- Draft comment:
Usingprint
statements for debugging in tests is not recommended. Consider using logging or assertions to verify test conditions. - Reason this comment was not posted:
Confidence changes required:50%
The use ofprint
statements for debugging in tests is not a best practice. It is better to use logging or assertions to verify test conditions.
2. tests/test_chunking.py:56
- Draft comment:
Usingprint
statements for debugging in tests is not recommended. Consider using logging or assertions to verify test conditions. - Reason this comment was not posted:
Confidence changes required:50%
The use ofprint
statements for debugging in tests is not a best practice. It is better to use logging or assertions to verify test conditions.
3. tests/test_chunking.py:58
- Draft comment:
Usingprint
statements for debugging in tests is not recommended. Consider using logging or assertions to verify test conditions. - Reason this comment was not posted:
Confidence changes required:50%
The use ofprint
statements for debugging in tests is not a best practice. It is better to use logging or assertions to verify test conditions.
4. tests/test_chunking.py:61
- Draft comment:
Usingprint
statements for debugging in tests is not recommended. Consider using logging or assertions to verify test conditions. - Reason this comment was not posted:
Confidence changes required:50%
The use ofprint
statements for debugging in tests is not a best practice. It is better to use logging or assertions to verify test conditions.
5. tests/test_chunking.py:63
- Draft comment:
Usingprint
statements for debugging in tests is not recommended. Consider using logging or assertions to verify test conditions. - Reason this comment was not posted:
Confidence changes required:50%
The use ofprint
statements for debugging in tests is not a best practice. It is better to use logging or assertions to verify test conditions.
6. tests/test_chunking.py:66
- Draft comment:
Usingprint
statements for debugging in tests is not recommended. Consider using logging or assertions to verify test conditions. - Reason this comment was not posted:
Confidence changes required:50%
The use ofprint
statements for debugging in tests is not a best practice. It is better to use logging or assertions to verify test conditions.
7. tests/test_chunking.py:70
- Draft comment:
Usingprint
statements for debugging in tests is not recommended. Consider using logging or assertions to verify test conditions. - Reason this comment was not posted:
Confidence changes required:50%
The use ofprint
statements for debugging in tests is not a best practice. It is better to use logging or assertions to verify test conditions.
8. tests/test_chunking.py:72
- Draft comment:
Usingprint
statements for debugging in tests is not recommended. Consider using logging or assertions to verify test conditions. - Reason this comment was not posted:
Confidence changes required:50%
The use ofprint
statements for debugging in tests is not a best practice. It is better to use logging or assertions to verify test conditions.
Workflow ID: wflow_Aj1IjtJ2rW0dg7N6
You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet
mode, and more.
Replace debug print statements with proper logging in indexer. Clean up test files by removing debug prints and improving assertions. Add more descriptive error messages in tests. Co-authored-by: Bob <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me! Incremental review on 77fbd13 in 25 seconds
More details
- Looked at
198
lines of code in4
files - Skipped
0
files when reviewing. - Skipped posting
2
drafted comments based on config settings.
1. tests/test_chunking.py:108
- Draft comment:
Consider replacing print statements with logging for consistency and better control over output. This applies to other test files as well. - Reason this comment was not posted:
Confidence changes required:50%
The PR description mentions removing print statements and adding logging. The test files still contain print statements, which should be replaced with logging for consistency and better control over output.
2. tests/test_document_processor.py:95
- Draft comment:
Consider replacing print statements with logging for consistency and better control over output. This applies to other test files as well. - Reason this comment was not posted:
Confidence changes required:50%
The PR description mentions removing print statements and adding logging. The test files still contain print statements, which should be replaced with logging for consistency and better control over output.
Workflow ID: wflow_OW32mc6LKAHWMenl
You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet
mode, and more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me! Incremental review on 7f59ae2 in 20 seconds
More details
- Looked at
65
lines of code in1
files - Skipped
0
files when reviewing. - Skipped posting
0
drafted comments based on config settings.
Workflow ID: wflow_Zn6mz4NUZ3Qm2xMM
You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet
mode, and more.
- Add progress bar using tqdm for document indexing - Refactor document collection and processing for better efficiency - Improve error handling and logging - Add JSON support for scoring weights - Split indexing into collection and processing phases Co-authored-by: Bob <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me! Incremental review on a4a63d0 in 1 minute and 1 seconds
More details
- Looked at
430
lines of code in3
files - Skipped
1
files when reviewing. - Skipped posting
3
drafted comments based on config settings.
1. gptme_rag/cli.py:138
- Draft comment:
Redundant import ofjson
. It's already imported at the top of the file. Remove this import to clean up the code. - Reason this comment was not posted:
Confidence changes required:50%
The import ofjson
incli.py
is redundant as it is already imported at the top of the file. This should be removed to clean up the code.
2. gptme_rag/indexing/indexer.py:178
- Draft comment:
Converting the generatoradd_documents_progress
to a list is unnecessary and can be removed to improve performance. - Reason this comment was not posted:
Confidence changes required:50%
Theadd_documents
method inindexer.py
callsadd_documents_progress
and converts its result to a list, which is unnecessary sinceadd_documents_progress
is a generator. This conversion should be removed to improve performance.
3. gptme_rag/indexing/indexer.py:859
- Draft comment:
The condition should be>
instead of>=
when checking if the file limit is reached. This ensures the warning is only logged when the limit is actually exceeded. - Reason this comment was not posted:
Confidence changes required:50%
Inindexer.py
, the_get_valid_files
method checks if the number of valid files is greater than or equal to the file limit, but the warning message suggests the limit is already reached. The condition should be>
instead of>=
. This applies to the_get_valid_files
method.
Workflow ID: wflow_Ai5mijsS4NPLuRsw
You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet
mode, and more.
Major improvements to indexing and search functionality:
Important
Enhances indexing and search with scoring explanations, improved chunking, reliable file watching, and better test coverage.
Indexer.search()
.document_processor.py
.watcher.py
.indexer.py
andwatcher.py
.cli.py
to support new search options like scoring explanations and custom weights.test_chunking.py
,test_indexing.py
, andtest_watcher.py
.typecheck
command toMakefile
.This description was created by
for a4a63d0. It will automatically update as commits are pushed.