-
-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added notifications query perf testing #405
base: main
Are you sure you want to change the base?
Added notifications query perf testing #405
Conversation
WalkthroughThe changes introduce a notifications system to manage follow event data. A new database table named ✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
cedar/schema/tables.sql (1)
69-81
:⚠️ Potential issueMissing index on user_id column
While the table structure looks good, the AI summary mentioned an index named
idx_notifications_user_id
on theuser_id
column, but it's not present in this code. Since querying notifications by user will be a common operation, this index is important for performance.Add the following index after the table creation:
CREATE TABLE notifications ( internal_id INT AUTO_INCREMENT PRIMARY KEY, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, event_type TINYINT UNSIGNED NOT NULL, -- 1=LIKE, 2=REPOST, 3=REPLY, 4=FOLLOW user_id INT NOT NULL, -- The user the notification is for account_id INT NOT NULL, -- The account that "did" the notification post_id INT NULL, -- NULL for FOLLOW events FOREIGN KEY (user_id) REFERENCES users(internal_id) ON DELETE CASCADE, FOREIGN KEY (account_id) REFERENCES accounts(internal_id) ON DELETE CASCADE, FOREIGN KEY (post_id) REFERENCES posts(internal_id) ON DELETE CASCADE ); + +CREATE INDEX idx_notifications_user_id ON notifications(user_id);
🧹 Nitpick comments (8)
cedar/schema/indexes.sql (1)
30-33
: New Notification Index Added
The addition of theidx_notifications_user_id
index on thenotifications
table is a solid step toward improving lookup performance for queries filtering byuser_id
. Please ensure that any future queries filtering by additional columns (e.g.,event_type
) consider whether a composite index might yield further benefits.cedar/data-generation/Dockerfile (1)
16-17
: Include New Notification Generation Scripts in Docker Build
The new COPY instructions forgenerate-follows-notifications.js
andgenerate-follows-notifications-worker.js
ensure that the Docker image contains all necessary files for handling notifications. For clarity and future maintainability, consider adding a comment explaining the purpose of these new files.cedar/data-generation/data-upload.sh (1)
37-47
: Import Loop for Notifications Data CSV Files
The newly added loop for importing notifications via CSV is well integrated with the rest of the script. For consistency and to avoid potential issues with filenames containing spaces, it’s recommended to wrap the$file
variable in quotes, similar to what’s done in the follows import loop. For example:- $file + "$file"This minor tweak enhances robustness without affecting functionality.
cedar/data-generation/README.md (1)
51-70
: Documentation clearly explains the notifications generation processThe new section provides a clear explanation of the notifications generation process, following the same structure as the existing "Feeds Generation" section and effectively communicating the challenges and approach.
However, there's a minor grammatical error:
-Find all the accounts that follows the user's account +Find all the accounts that follow the user's account🧰 Tools
🪛 LanguageTool
[uncategorized] ~56-~56: This verb does not appear to agree with the subject. Consider using a different form.
Context: ...ith a user - Find all the accounts that follows the user's account - Insert into the no...(AI_EN_LECTOR_REPLACEMENT_VERB_AGREEMENT)
[style] ~59-~59: Consider removing “of” to be more concise
Context: ... each of the resolved accounts Finding all of the accounts that follow a user's account i...(ALL_OF_THE)
[style] ~60-~60: Consider removing “of” to be more concise
Context: ...ount is difficult because we can't load all of the follow relationships into memory (as th...(ALL_OF_THE)
cedar/data-generation/generate-follows-notifications.js (1)
22-24
: Consider adding validation for CSV filesThe code filters for files that start with 'follows_' and end with '.csv', but doesn't verify if the files are readable or properly formatted before distribution to workers.
Consider adding basic validation:
const followsFiles = fs .readdirSync(DATA_DIR) - .filter((file) => file.startsWith('follows_') && file.endsWith('.csv')); + .filter((file) => file.startsWith('follows_') && file.endsWith('.csv')) + .filter((file) => { + try { + // Check if file is readable + fs.accessSync(`${DATA_DIR}/${file}`, fs.constants.R_OK); + return true; + } catch (err) { + console.warn(`Skipping inaccessible file: ${file}`); + return false; + } + });cedar/data-generation/generate-follows-notifications-worker.js (3)
10-23
: Consider adding error handling for CSV parsingThe
parseCSV
function assumes well-formed CSV data. If the data has an incorrect format (e.g., inconsistent number of values per row), this could cause issues.function parseCSV(data) { const lines = data.trim().split('\n'); const headers = lines[0].split(','); return lines.slice(1).map((line) => { const values = line.split(','); + + // Skip malformed rows + if (values.length !== headers.length) { + console.warn(`Skipping malformed CSV line: ${line}`); + return null; + } return headers.reduce((object, header, index) => { object[header] = values[index]; return object; }, {}); - }); + }).filter(Boolean); }
28-35
: Consider using streaming for large user data filesLoading the entire
users.csv
file synchronously withreadFileSync
could cause memory issues if the file is very large. Consider using a streaming approach similar to how you're handling the follows files.- const usersData = fs.readFileSync( - path.join(DATA_DIR, 'users.csv'), - 'utf-8', - ); - - for (const user of parseCSV(usersData)) { - users[user.account_id] = user; - } + // Load users with stream processing + const usersReadStream = fs.createReadStream(path.join(DATA_DIR, 'users.csv')); + const usersRL = readline.createInterface({ + input: usersReadStream, + crlfDelay: Number.POSITIVE_INFINITY, + }); + + let isFirstUserLine = true; + let headers; + + for await (const line of usersRL) { + if (isFirstUserLine) { + headers = line.split(','); + isFirstUserLine = false; + continue; + } + + const values = line.split(','); + const user = headers.reduce((obj, header, idx) => { + obj[header] = values[idx]; + return obj; + }, {}); + + users[user.account_id] = user; + } + + usersRL.close();
61-68
: Add logging for skipped accounts without usersThe code silently skips accounts that don't have an associated user. It would be helpful to add some logging to track how many accounts are being skipped.
const [followerId, followingId] = line.split(','); const user = users[followingId]; // Not all accounts are associated with a user if (!user) { + // Add a counter or logging for diagnostic purposes + // console.log(`No user found for account ID: ${followingId}`); continue; }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
cedar/data-generation/Dockerfile
(1 hunks)cedar/data-generation/README.md
(1 hunks)cedar/data-generation/data-generate.sh
(1 hunks)cedar/data-generation/data-stats.sh
(1 hunks)cedar/data-generation/data-upload.sh
(1 hunks)cedar/data-generation/generate-follows-notifications-worker.js
(1 hunks)cedar/data-generation/generate-follows-notifications.js
(1 hunks)cedar/query-runner/queries/read-notifications.sql
(1 hunks)cedar/schema/indexes.sql
(1 hunks)cedar/schema/tables.sql
(1 hunks)cedar/upload-data.sh
(1 hunks)
🧰 Additional context used
🧬 Code Definitions (2)
cedar/data-generation/generate-follows-notifications.js (1)
cedar/data-generation/generate-follows-notifications-worker.js (3) (3)
require
(1:1)require
(6:6)fs
(2:2)
cedar/data-generation/generate-follows-notifications-worker.js (1)
cedar/data-generation/generate-follows-notifications.js (4) (4)
require
(1:1)require
(6:6)fs
(3:3)followsFiles
(22:24)
🪛 LanguageTool
cedar/data-generation/README.md
[uncategorized] ~56-~56: This verb does not appear to agree with the subject. Consider using a different form.
Context: ...ith a user - Find all the accounts that follows the user's account - Insert into the no...
(AI_EN_LECTOR_REPLACEMENT_VERB_AGREEMENT)
[style] ~59-~59: Consider removing “of” to be more concise
Context: ... each of the resolved accounts Finding all of the accounts that follow a user's account i...
(ALL_OF_THE)
[style] ~60-~60: Consider removing “of” to be more concise
Context: ...ount is difficult because we can't load all of the follow relationships into memory (as th...
(ALL_OF_THE)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Build, Test and Deploy
🔇 Additional comments (13)
cedar/data-generation/data-generate.sh (1)
12-12
: Integrate Notifications Data Generation
Adding the commandnode generate-follows-notifications.js
integrates notifications data generation into the overall workflow. Ensure that the new script includes proper error handling and logs meaningful output in case of failures.cedar/data-generation/data-stats.sh (1)
11-17
: Extended Data Statistics for Notifications
Adding the line for counting notifications via the glob pattern./data/follows_notifications_*.csv
improves overall insight into the data generation process. Verify that the glob pattern correctly matches the intended files in different environments.cedar/upload-data.sh (1)
35-45
: Well-structured implementation for notifications importThe implementation follows the established pattern for importing feeds and follows data, correctly specifying the notifications table and the required columns (user_id, account_id, event_type).
cedar/query-runner/queries/read-notifications.sql (1)
1-28
: Well-structured query with effective joins and paginationThe query is well-organized with:
- Appropriate column selections with clear aliases
- Proper LEFT JOIN for posts (allowing notifications without posts)
- INNER JOIN for accounts (requiring an account for each notification)
- Efficient filtering and pagination using internal_id as a cursor
- Helpful comments explaining the filtering and ordering constraints
The query's structure will work well for performance testing notifications retrieval.
cedar/data-generation/generate-follows-notifications.js (4)
1-9
: Good module imports and worker thread configurationThe file correctly imports necessary modules and configures the number of worker threads based on available CPU cores, which allows for efficient parallelization of the notification generation process.
19-44
: Effective worker distribution strategyThe main function effectively:
- Reads and filters follows files
- Calculates an appropriate number of files per worker
- Creates and configures worker threads with their portion of files
This approach ensures balanced distribution of work across available CPU cores.
51-78
: Robust worker communication and error handlingThe code effectively:
- Uses Promise.all to wait for all workers to complete
- Implements proper message handling for progress updates
- Captures and propagates worker errors
- Detects abnormal worker exits
This ensures reliable operation and proper error reporting during the multi-threaded processing.
92-99
: Clean process exit handlingThe main function is properly wrapped with promise handling that ensures the process exits with the appropriate status code (0 for success, 1 for error), and errors are properly logged.
cedar/schema/tables.sql (1)
70-80
: Good schema design for notifications tableThe notifications table has a well-designed schema with:
- An auto-incrementing primary key
- A timestamp field with appropriate default
- Enumerated event types with comments explaining the values
- Proper foreign key constraints with ON DELETE CASCADE
- Appropriate NULL constraint for post_id since FOLLOW events won't reference posts
cedar/data-generation/generate-follows-notifications-worker.js (4)
1-7
: Proper module imports and configurationThe worker thread imports and configuration look good. The code properly uses the Node.js worker threads API and filesystem modules.
69-76
: Good use of batch processingThe implementation efficiently processes data in batches using a configurable chunk size, which is good for performance when dealing with large datasets.
84-87
: Progress tracking implementationGood implementation of progress tracking by sending messages to the parent thread after each file is processed.
49-49
: CSV header consistencyThe output CSV header matches the expected columns for the notifications table. Ensure this remains consistent with the database schema if any changes are made.
0b188f1
to
817102a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (6)
cedar/data-generation/README.md (1)
51-70
: Well-structured documentation for the new Notifications Generation process.The section clearly explains the process and challenges of generating notifications from follows data, mirroring the structure of the existing Feeds Generation section, which provides consistency to readers.
Consider making these minor stylistic improvements:
-Finding all of the accounts that follow a user's account is difficult because -we can't load all of the follow relationships into memory (as they are contained +Finding all accounts that follow a user's account is difficult because +we can't load all follow relationships into memory (as they are contained🧰 Tools
🪛 LanguageTool
[style] ~59-~59: Consider removing “of” to be more concise
Context: ... each of the resolved accounts Finding all of the accounts that follow a user's account i...(ALL_OF_THE)
[style] ~60-~60: Consider removing “of” to be more concise
Context: ...ount is difficult because we can't load all of the follow relationships into memory (as th...(ALL_OF_THE)
cedar/query-runner/queries/read-notifications.sql (2)
1-31
: Well-structured query for retrieving notifications with related data.The query properly joins the notifications table with accounts and posts tables, applies appropriate filtering, and includes pagination.
I notice you're doing a LEFT JOIN to the
reply_posts
alias (line 21) but not selecting any fields from it. Consider either:
- Adding fields from reply_posts to the SELECT statement:
posts.title AS post_title, posts.content AS post_content, posts.type AS post_type + reply_posts.title AS reply_post_title, + reply_posts.content AS reply_post_content, + reply_posts.type AS reply_post_type
- Or removing the join if it's not needed for this query:
LEFT JOIN posts on posts.internal_id = notifications.post_id -LEFT JOIN - posts AS reply_posts on reply_posts.internal_id = notifications.reply_post_id
24-27
: Consider parameterizing the user_id and cursor values.The query currently uses hardcoded values (user_id = 2340, internal_id < 20) which appear to be for testing.
For production use, these should be parameterized. Depending on your query execution system, you might use:
- notifications.user_id = 2340 + notifications.user_id = :user_id AND - notifications.internal_id < 20 -- the cursor for pagination + notifications.internal_id < :cursor -- the cursor for paginationcedar/data-generation/generate-follows-notifications.js (2)
10-17
: Consider enhancing the progress bar format.The current progress bar only shows percentage completion.
Add more information to make monitoring more useful:
{ clearOnComplete: false, hideCursor: true, - format: '{bar} {percentage}%', + format: '{bar} {percentage}% | {value}/{total} files | ETA: {eta}s', },
92-99
: Consider alternate exit strategy.Using
process.exit()
can abruptly terminate the Node.js process, potentially interrupting other asynchronous operations.For cleaner shutdown, consider allowing natural program completion:
main() .then(() => { - process.exit(0); + console.log('Process completed successfully'); }) .catch((error) => { console.error(error); - process.exit(1); + // The uncaught exception will naturally exit with non-zero code });cedar/schema/tables.sql (1)
70-82
: Well-designed notifications table with appropriate relationships.The table structure captures the necessary data for different notification types and includes proper foreign key constraints.
Consider these schema improvements:
- Use ENUM for event_type for better readability and data integrity:
- event_type TINYINT UNSIGNED NOT NULL, -- 1=LIKE, 2=REPOST, 3=REPLY, 4=FOLLOW + event_type ENUM('LIKE', 'REPOST', 'REPLY', 'FOLLOW') NOT NULL,
- Add a composite index to optimize filtering by user_id and event_type:
CREATE INDEX idx_notifications_user_event ON notifications (user_id, event_type);
- Consider adding a unique constraint if duplicate notifications should be prevented:
UNIQUE KEY unique_notification (user_id, event_type, account_id, post_id, reply_post_id)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
cedar/data-generation/Dockerfile
(1 hunks)cedar/data-generation/README.md
(1 hunks)cedar/data-generation/data-generate.sh
(1 hunks)cedar/data-generation/data-stats.sh
(1 hunks)cedar/data-generation/data-upload.sh
(1 hunks)cedar/data-generation/generate-follows-notifications-worker.js
(1 hunks)cedar/data-generation/generate-follows-notifications.js
(1 hunks)cedar/query-runner/queries/read-notifications.sql
(1 hunks)cedar/schema/indexes.sql
(1 hunks)cedar/schema/tables.sql
(1 hunks)cedar/upload-data.sh
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
- cedar/data-generation/data-generate.sh
- cedar/schema/indexes.sql
- cedar/data-generation/data-upload.sh
- cedar/data-generation/Dockerfile
- cedar/data-generation/data-stats.sh
- cedar/upload-data.sh
- cedar/data-generation/generate-follows-notifications-worker.js
🧰 Additional context used
🪛 LanguageTool
cedar/data-generation/README.md
[style] ~59-~59: Consider removing “of” to be more concise
Context: ... each of the resolved accounts Finding all of the accounts that follow a user's account i...
(ALL_OF_THE)
[style] ~60-~60: Consider removing “of” to be more concise
Context: ...ount is difficult because we can't load all of the follow relationships into memory (as th...
(ALL_OF_THE)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Build, Test and Deploy
🔇 Additional comments (2)
cedar/data-generation/generate-follows-notifications.js (2)
1-18
: Good setup for multi-threaded processing with progress reporting.The code properly imports necessary modules and configures the progress bar for visual feedback.
19-90
: Well-structured main function with proper error handling.The implementation handles file distribution across workers, tracks progress, and reports completion time effectively.
refs https://linear.app/ghost/issue/AP-897/[notifications-perf]-step-1-validate-performance-improvement-at Added data generation scripts for notifications and a query to read notifications
817102a
to
8e082a7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (7)
cedar/args.sh (1)
1-5
: Add a shebang and declare variables as exported.The script is missing a shebang line and the variables aren't exported, which might cause issues when they're referenced from other scripts.
+#!/bin/bash -INSTANCE_NAME="cedar-test-db-mb170325" -TIER="db-f1-micro" -REGION="us-central1" -DB_NAME="activitypub" -BUCKET_NAME="cedar-test-data-mb170325" +export INSTANCE_NAME="cedar-test-db-mb170325" +export TIER="db-f1-micro" +export REGION="us-central1" +export DB_NAME="activitypub" +export BUCKET_NAME="cedar-test-data-mb170325"🧰 Tools
🪛 Shellcheck (0.10.0)
[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.
(SC2148)
[warning] 1-1: INSTANCE_NAME appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 2-2: TIER appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 3-3: REGION appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 4-4: DB_NAME appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 5-5: BUCKET_NAME appears unused. Verify use (or export if used externally).
(SC2034)
cedar/data-generation/README.md (2)
56-56
: Fix verb agreement.The verb should agree with the plural subject "accounts".
- Find all the accounts that follows the user's account + Find all the accounts that follow the user's account🧰 Tools
🪛 LanguageTool
[uncategorized] ~56-~56: This verb does not appear to agree with the subject. Consider using a different form.
Context: ...ith a user - Find all the accounts that follows the user's account - Insert into the no...(AI_EN_LECTOR_REPLACEMENT_VERB_AGREEMENT)
59-60
: Consider more concise phrasing.For better readability, consider removing "of" in these phrases.
- Finding all of the accounts that follow a user's account is difficult because - we can't load all of the follow relationships into memory (as they are contained + Finding all the accounts that follow a user's account is difficult because + we can't load all the follow relationships into memory (as they are contained🧰 Tools
🪛 LanguageTool
[style] ~59-~59: Consider removing “of” to be more concise
Context: ... each of the resolved accounts Finding all of the accounts that follow a user's account i...(ALL_OF_THE)
[style] ~60-~60: Consider removing “of” to be more concise
Context: ...ount is difficult because we can't load all of the follow relationships into memory (as th...(ALL_OF_THE)
cedar/data-generation/generate-follows-notifications.js (2)
19-90
: Consider adding documentation for the main function.Adding a JSDoc comment would help clarify the purpose and behavior of the main function.
+/** + * Main function to generate follow notifications. + * - Reads all follows CSV files + * - Distributes processing across multiple worker threads + * - Tracks progress and handles errors + * - Reports completion time + */ async function main() {
54-76
: Add error details to rejection message.When handling worker errors, it would be helpful to include the error details in the rejection.
- worker.on('error', reject); + worker.on('error', (error) => { + reject(new Error(`Worker error: ${error.message}`)); + });cedar/schema/tables.sql (2)
70-82
: Consider adding a composite index for filtering notifications.Users will likely filter notifications by both user_id and event_type, so a composite index would be beneficial.
FOREIGN KEY (account_id) REFERENCES accounts(internal_id) ON DELETE CASCADE, FOREIGN KEY (post_id) REFERENCES posts(internal_id) ON DELETE CASCADE, FOREIGN KEY (reply_post_id) REFERENCES posts(internal_id) ON DELETE CASCADE, + INDEX idx_user_event (user_id, event_type) );
72-73
: Consider using an ENUM for event_type.Using an ENUM for event_type would provide better type safety and readability than a TINYINT with comments.
- event_type TINYINT UNSIGNED NOT NULL, -- 1=LIKE, 2=REPOST, 3=REPLY, 4=FOLLOW + event_type ENUM('LIKE', 'REPOST', 'REPLY', 'FOLLOW') NOT NULL,
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
cedar/args.sh
(1 hunks)cedar/data-generation/Dockerfile
(1 hunks)cedar/data-generation/README.md
(1 hunks)cedar/data-generation/config.js
(1 hunks)cedar/data-generation/data-generate.sh
(1 hunks)cedar/data-generation/data-stats.sh
(1 hunks)cedar/data-generation/data-upload.sh
(1 hunks)cedar/data-generation/generate-follows-notifications-worker.js
(1 hunks)cedar/data-generation/generate-follows-notifications.js
(1 hunks)cedar/query-runner/queries/read-notifications.sql
(1 hunks)cedar/schema/indexes.sql
(1 hunks)cedar/schema/tables.sql
(1 hunks)cedar/upload-data.sh
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
- cedar/data-generation/data-generate.sh
- cedar/schema/indexes.sql
- cedar/data-generation/data-stats.sh
- cedar/data-generation/Dockerfile
- cedar/data-generation/data-upload.sh
- cedar/upload-data.sh
- cedar/query-runner/queries/read-notifications.sql
- cedar/data-generation/generate-follows-notifications-worker.js
🧰 Additional context used
🪛 Shellcheck (0.10.0)
cedar/args.sh
[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.
(SC2148)
[warning] 1-1: INSTANCE_NAME appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 2-2: TIER appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 3-3: REGION appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 4-4: DB_NAME appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 5-5: BUCKET_NAME appears unused. Verify use (or export if used externally).
(SC2034)
🪛 LanguageTool
cedar/data-generation/README.md
[uncategorized] ~56-~56: This verb does not appear to agree with the subject. Consider using a different form.
Context: ...ith a user - Find all the accounts that follows the user's account - Insert into the no...
(AI_EN_LECTOR_REPLACEMENT_VERB_AGREEMENT)
[style] ~59-~59: Consider removing “of” to be more concise
Context: ... each of the resolved accounts Finding all of the accounts that follow a user's account i...
(ALL_OF_THE)
[style] ~60-~60: Consider removing “of” to be more concise
Context: ...ount is difficult because we can't load all of the follow relationships into memory (as th...
(ALL_OF_THE)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Build, Test and Deploy
🔇 Additional comments (4)
cedar/data-generation/config.js (1)
1-1
: Scaling factor reduced by 100x for testing purposes.The
SCALING_FACTOR
has been reduced from 1 to 0.01, which will significantly decrease the volume of test data:
- NUM_SITES: 60,000 → 600
- NUM_USERS: 300,000 → 3,000
- NUM_ACCOUNTS: 2.5M → 25,000
- NUM_POSTS: 3M → 30,000
This change will speed up notification performance testing by requiring less data generation time and resources, while still maintaining proper proportional relationships between data types.
cedar/args.sh (1)
1-1
:❓ Verification inconclusive
Personalized instance name might cause issues for other users.
The instance name appears to include a personal identifier (
mb170325
), which could cause naming conflicts or confusion if other developers use this script directly.Consider whether this change should be committed or if it would be better to use environment variables or a gitignored configuration file for personalized settings.
🏁 Script executed:
#!/bin/bash # Check if this is a personal setting that shouldn't be committed git diff --cached cedar/args.sh | grep -E "cedar-test-db-mb[0-9]+"Length of output: 68
Action: Verify Instance Name Configuration
- The file
cedar/args.sh
still setsINSTANCE_NAME
to a personalized value (cedar-test-db-mb170325
), which could be problematic for other users.- Consider externalizing this setting (e.g., via an environment variable or a gitignored configuration file) to avoid potential naming conflicts.
- Note: The diff check using
git diff --cached cedar/args.sh | grep -E "cedar-test-db-mb[0-9]+"
produced no output, so please manually verify that the personalized instance name is intentionally maintained (or adjust if not).🧰 Tools
🪛 Shellcheck (0.10.0)
[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.
(SC2148)
[warning] 1-1: INSTANCE_NAME appears unused. Verify use (or export if used externally).
(SC2034)
cedar/data-generation/README.md (1)
51-70
: Documentation for Notifications Generation is well-structured.The new section follows the same format as existing sections and clearly explains the notifications generation process.
🧰 Tools
🪛 LanguageTool
[uncategorized] ~56-~56: This verb does not appear to agree with the subject. Consider using a different form.
Context: ...ith a user - Find all the accounts that follows the user's account - Insert into the no...(AI_EN_LECTOR_REPLACEMENT_VERB_AGREEMENT)
[style] ~59-~59: Consider removing “of” to be more concise
Context: ... each of the resolved accounts Finding all of the accounts that follow a user's account i...(ALL_OF_THE)
[style] ~60-~60: Consider removing “of” to be more concise
Context: ...ount is difficult because we can't load all of the follow relationships into memory (as th...(ALL_OF_THE)
cedar/data-generation/generate-follows-notifications.js (1)
1-99
: Well-structured multi-threaded notification generation script.The implementation follows the same patterns as other generation scripts in the codebase, with proper parallelization, error handling, and progress reporting.
CREATE TABLE notifications ( | ||
internal_id INT AUTO_INCREMENT PRIMARY KEY, | ||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, | ||
event_type TINYINT UNSIGNED NOT NULL, -- 1=LIKE, 2=REPOST, 3=REPLY, 4=FOLLOW | ||
user_id INT NOT NULL, -- The user the notification is for | ||
account_id INT NOT NULL, -- The account that "did" the notification | ||
post_id INT NULL, -- NULL for FOLLOW events | ||
reply_post_id INT NULL, -- NULL for non-reply events | ||
FOREIGN KEY (user_id) REFERENCES users(internal_id) ON DELETE CASCADE, | ||
FOREIGN KEY (account_id) REFERENCES accounts(internal_id) ON DELETE CASCADE, | ||
FOREIGN KEY (post_id) REFERENCES posts(internal_id) ON DELETE CASCADE, | ||
FOREIGN KEY (reply_post_id) REFERENCES posts(internal_id) ON DELETE CASCADE | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add an index on the user_id column for notification queries.
Since notifications will likely be queried by user_id frequently, adding an index would improve query performance.
CREATE TABLE notifications (
internal_id INT AUTO_INCREMENT PRIMARY KEY,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
event_type TINYINT UNSIGNED NOT NULL, -- 1=LIKE, 2=REPOST, 3=REPLY, 4=FOLLOW
user_id INT NOT NULL, -- The user the notification is for
account_id INT NOT NULL, -- The account that "did" the notification
post_id INT NULL, -- NULL for FOLLOW events
reply_post_id INT NULL, -- NULL for non-reply events
FOREIGN KEY (user_id) REFERENCES users(internal_id) ON DELETE CASCADE,
FOREIGN KEY (account_id) REFERENCES accounts(internal_id) ON DELETE CASCADE,
FOREIGN KEY (post_id) REFERENCES posts(internal_id) ON DELETE CASCADE,
FOREIGN KEY (reply_post_id) REFERENCES posts(internal_id) ON DELETE CASCADE,
+ INDEX idx_user_id (user_id)
);
Committable suggestion skipped: line range outside the PR's diff.
refs https://linear.app/ghost/issue/AP-897
Added data generation scripts for notifications and a query to read notifications