Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(indexing): enhance search, chunking and file watching #1

Merged
merged 6 commits into from
Dec 11, 2024

Conversation

ErikBjare
Copy link
Member

@ErikBjare ErikBjare commented Dec 11, 2024

Major improvements to indexing and search functionality:

  • Add scoring explanations and custom weights
  • Improve document chunking with better overlap handling
  • Enhance file watching reliability
  • Add debug features and logging
  • Improve test coverage and error handling

Important

Enhances indexing and search with scoring explanations, improved chunking, reliable file watching, and better test coverage.

  • Behavior:
    • Adds scoring explanations and custom weights to Indexer.search().
    • Improves document chunking with better overlap handling in document_processor.py.
    • Enhances file watching reliability in watcher.py.
    • Adds debug features and logging in indexer.py and watcher.py.
  • CLI:
    • Updates cli.py to support new search options like scoring explanations and custom weights.
  • Tests:
    • Improves test coverage for chunking, indexing, and file watching in test_chunking.py, test_indexing.py, and test_watcher.py.
  • Misc:
    • Adds typecheck command to Makefile.
    • Updates example scripts to reflect changes in search API.

This description was created by Ellipsis for a4a63d0. It will automatically update as commits are pushed.

Major improvements to indexing and search functionality:
- Add scoring explanations and custom weights
- Improve document chunking with better overlap handling
- Enhance file watching reliability
- Add debug features and logging
- Improve test coverage and error handling

Co-authored-by: Bob <[email protected]>
Copy link

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested. Reviewed everything up to a1843c3 in 2 minutes and 18 seconds

More details
  • Looked at 1501 lines of code in 13 files
  • Skipped 1 files when reviewing.
  • Skipped posting 0 drafted comments based on config settings.

Workflow ID: wflow_N9D9GylvlYY6OX61


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

Major improvements to indexing and file watching systems:

- Refactor search functionality in indexer
  - Add dedicated methods for result grouping and scoring
  - Improve code organization and readability
  - Better handling of chunk grouping

- Enhance file watcher reliability
  - Add robust error handling and retries
  - Improve file processing logic
  - Better logging for debugging

- Improve test stability
  - Use unique collection names per test
  - Add better assertions and timeouts
  - More reliable test cleanup

Co-authored-by: Bob <[email protected]>
Copy link

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested. Incremental review on 95b6585 in 48 seconds

More details
  • Looked at 808 lines of code in 5 files
  • Skipped 0 files when reviewing.
  • Skipped posting 3 drafted comments based on config settings.
1. gptme_rag/indexing/watcher.py:68
  • Draft comment:
    Using time.sleep to wait for file writes can be unreliable. Consider implementing a more robust method to ensure the file is fully written, such as checking file size stability over a short period.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The _queue_update method in watcher.py directly calls time.sleep to wait for the file to be fully written. This could be improved by using a more robust method to ensure the file is ready, such as checking file size stability over a short period.
2. gptme_rag/indexing/watcher.py:218
  • Draft comment:
    Consider making the initial wait time for exponential backoff configurable to allow flexibility in different environments.
  • Reason this comment was not posted:
    Confidence changes required: 30%
    In watcher.py, the _update_index_with_retries method uses exponential backoff for retries, but the initial wait time is hardcoded. It might be beneficial to make this configurable for flexibility in different environments.
3. tests/test_indexing.py:78
  • Draft comment:
    Ensure that the collection is reset before each test that modifies the index to maintain a clean state. This is done here but should be consistent across all relevant tests.
  • Reason this comment was not posted:
    Confidence changes required: 20%
    In test_indexing.py, the test_indexer_add_documents function includes a reset of the collection before adding documents. This is a good practice to ensure a clean state, but it should be consistently applied across all tests that modify the index.

Workflow ID: wflow_rKmO5ZLJTHWJmxRk


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

@codecov-commenter
Copy link

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered ☂️

- Add shared indexer fixture with automatic cleanup
- Add cleanup_chroma fixture to reset ChromaDB between tests
- Refactor test files to use shared fixtures
- Remove duplicated setup code
- Use tmp_path fixture instead of custom temp_dir

Co-authored-by: Bob <[email protected]>
Copy link

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Incremental review on d79047d in 29 seconds

More details
  • Looked at 573 lines of code in 4 files
  • Skipped 0 files when reviewing.
  • Skipped posting 8 drafted comments based on config settings.
1. tests/test_chunking.py:54
  • Draft comment:
    Using print statements for debugging in tests is not recommended. Consider using logging or assertions to verify test conditions.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The use of print statements for debugging in tests is not a best practice. It is better to use logging or assertions to verify test conditions.
2. tests/test_chunking.py:56
  • Draft comment:
    Using print statements for debugging in tests is not recommended. Consider using logging or assertions to verify test conditions.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The use of print statements for debugging in tests is not a best practice. It is better to use logging or assertions to verify test conditions.
3. tests/test_chunking.py:58
  • Draft comment:
    Using print statements for debugging in tests is not recommended. Consider using logging or assertions to verify test conditions.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The use of print statements for debugging in tests is not a best practice. It is better to use logging or assertions to verify test conditions.
4. tests/test_chunking.py:61
  • Draft comment:
    Using print statements for debugging in tests is not recommended. Consider using logging or assertions to verify test conditions.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The use of print statements for debugging in tests is not a best practice. It is better to use logging or assertions to verify test conditions.
5. tests/test_chunking.py:63
  • Draft comment:
    Using print statements for debugging in tests is not recommended. Consider using logging or assertions to verify test conditions.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The use of print statements for debugging in tests is not a best practice. It is better to use logging or assertions to verify test conditions.
6. tests/test_chunking.py:66
  • Draft comment:
    Using print statements for debugging in tests is not recommended. Consider using logging or assertions to verify test conditions.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The use of print statements for debugging in tests is not a best practice. It is better to use logging or assertions to verify test conditions.
7. tests/test_chunking.py:70
  • Draft comment:
    Using print statements for debugging in tests is not recommended. Consider using logging or assertions to verify test conditions.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The use of print statements for debugging in tests is not a best practice. It is better to use logging or assertions to verify test conditions.
8. tests/test_chunking.py:72
  • Draft comment:
    Using print statements for debugging in tests is not recommended. Consider using logging or assertions to verify test conditions.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The use of print statements for debugging in tests is not a best practice. It is better to use logging or assertions to verify test conditions.

Workflow ID: wflow_Aj1IjtJ2rW0dg7N6


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

Replace debug print statements with proper logging in indexer.
Clean up test files by removing debug prints and improving assertions.
Add more descriptive error messages in tests.

Co-authored-by: Bob <[email protected]>
Copy link

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Incremental review on 77fbd13 in 25 seconds

More details
  • Looked at 198 lines of code in 4 files
  • Skipped 0 files when reviewing.
  • Skipped posting 2 drafted comments based on config settings.
1. tests/test_chunking.py:108
  • Draft comment:
    Consider replacing print statements with logging for consistency and better control over output. This applies to other test files as well.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The PR description mentions removing print statements and adding logging. The test files still contain print statements, which should be replaced with logging for consistency and better control over output.
2. tests/test_document_processor.py:95
  • Draft comment:
    Consider replacing print statements with logging for consistency and better control over output. This applies to other test files as well.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The PR description mentions removing print statements and adding logging. The test files still contain print statements, which should be replaced with logging for consistency and better control over output.

Workflow ID: wflow_OW32mc6LKAHWMenl


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

Copy link

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Incremental review on 7f59ae2 in 20 seconds

More details
  • Looked at 65 lines of code in 1 files
  • Skipped 0 files when reviewing.
  • Skipped posting 0 drafted comments based on config settings.

Workflow ID: wflow_Zn6mz4NUZ3Qm2xMM


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

- Add progress bar using tqdm for document indexing
- Refactor document collection and processing for better efficiency
- Improve error handling and logging
- Add JSON support for scoring weights
- Split indexing into collection and processing phases

Co-authored-by: Bob <[email protected]>
Copy link

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Incremental review on a4a63d0 in 1 minute and 1 seconds

More details
  • Looked at 430 lines of code in 3 files
  • Skipped 1 files when reviewing.
  • Skipped posting 3 drafted comments based on config settings.
1. gptme_rag/cli.py:138
  • Draft comment:
    Redundant import of json. It's already imported at the top of the file. Remove this import to clean up the code.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The import of json in cli.py is redundant as it is already imported at the top of the file. This should be removed to clean up the code.
2. gptme_rag/indexing/indexer.py:178
  • Draft comment:
    Converting the generator add_documents_progress to a list is unnecessary and can be removed to improve performance.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The add_documents method in indexer.py calls add_documents_progress and converts its result to a list, which is unnecessary since add_documents_progress is a generator. This conversion should be removed to improve performance.
3. gptme_rag/indexing/indexer.py:859
  • Draft comment:
    The condition should be > instead of >= when checking if the file limit is reached. This ensures the warning is only logged when the limit is actually exceeded.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    In indexer.py, the _get_valid_files method checks if the number of valid files is greater than or equal to the file limit, but the warning message suggests the limit is already reached. The condition should be > instead of >=. This applies to the _get_valid_files method.

Workflow ID: wflow_Ai5mijsS4NPLuRsw


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

@ErikBjare ErikBjare merged commit 18ba7cb into master Dec 11, 2024
1 check passed
@ErikBjare ErikBjare deleted the dev/lots-of-improvements branch December 11, 2024 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants