Added notifications query perf testing #405

mike182uk · 2025-03-18T15:21:22Z

refs https://linear.app/ghost/issue/AP-897

Added data generation scripts for notifications and a query to read notifications

coderabbitai · 2025-03-18T15:21:30Z

Walkthrough

The changes introduce a notifications system to manage follow event data. A new database table named notifications has been created, along with an index on the user_id column to facilitate efficient retrieval of notifications. Two new JavaScript files, generate-follows-notifications.js and generate-follows-notifications-worker.js, implement multi-threaded logic for generating notifications from CSV files. Modifications to existing shell scripts enhance the overall data generation process by integrating the generation and import of notifications data. Additionally, updates to the output formatting in the data-stats.sh script improve readability and include counts for notifications. A new SQL query has been added to retrieve detailed notification information, and the README documentation has been updated to provide instructions on processing notifications data. These changes collectively support the generation, import, and querying of notifications within the data processing workflow.

✨ Finishing Touches

📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🔭 Outside diff range comments (1)

cedar/schema/tables.sql (1)
69-81: ⚠️ Potential issue

Missing index on user_id column

While the table structure looks good, the AI summary mentioned an index named idx_notifications_user_id on the user_id column, but it's not present in this code. Since querying notifications by user will be a common operation, this index is important for performance.

Add the following index after the table creation:
CREATE TABLE notifications (
    internal_id INT AUTO_INCREMENT PRIMARY KEY,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    event_type TINYINT UNSIGNED NOT NULL, -- 1=LIKE, 2=REPOST, 3=REPLY, 4=FOLLOW
    user_id INT NOT NULL, -- The user the notification is for
    account_id INT NOT NULL, -- The account that "did" the notification
    post_id INT NULL, -- NULL for FOLLOW events
    FOREIGN KEY (user_id) REFERENCES users(internal_id) ON DELETE CASCADE,
    FOREIGN KEY (account_id) REFERENCES accounts(internal_id) ON DELETE CASCADE,
    FOREIGN KEY (post_id) REFERENCES posts(internal_id) ON DELETE CASCADE
);
+
+CREATE INDEX idx_notifications_user_id ON notifications(user_id);

🧹 Nitpick comments (8)

cedar/schema/indexes.sql (1)

30-33: New Notification Index Added
The addition of the idx_notifications_user_id index on the notifications table is a solid step toward improving lookup performance for queries filtering by user_id. Please ensure that any future queries filtering by additional columns (e.g., event_type) consider whether a composite index might yield further benefits.

cedar/data-generation/Dockerfile (1)

16-17: Include New Notification Generation Scripts in Docker Build
The new COPY instructions for generate-follows-notifications.js and generate-follows-notifications-worker.js ensure that the Docker image contains all necessary files for handling notifications. For clarity and future maintainability, consider adding a comment explaining the purpose of these new files.
cedar/data-generation/data-upload.sh (1)
37-47: Import Loop for Notifications Data CSV Files
The newly added loop for importing notifications via CSV is well integrated with the rest of the script. For consistency and to avoid potential issues with filenames containing spaces, it’s recommended to wrap the $file variable in quotes, similar to what’s done in the follows import loop. For example:
-        $file
+        "$file"
This minor tweak enhances robustness without affecting functionality.
cedar/data-generation/README.md (1)
51-70: Documentation clearly explains the notifications generation process

The new section provides a clear explanation of the notifications generation process, following the same structure as the existing "Feeds Generation" section and effectively communicating the challenges and approach.

However, there's a minor grammatical error:
-Find all the accounts that follows the user's account
+Find all the accounts that follow the user's account
🧰 Tools

🪛 LanguageTool

[uncategorized] ~56-~56: This verb does not appear to agree with the subject. Consider using a different form.
Context: ...ith a user - Find all the accounts that follows the user's account - Insert into the no...

(AI_EN_LECTOR_REPLACEMENT_VERB_AGREEMENT)

[style] ~59-~59: Consider removing “of” to be more concise
Context: ... each of the resolved accounts Finding all of the accounts that follow a user's account i...

(ALL_OF_THE)

[style] ~60-~60: Consider removing “of” to be more concise
Context: ...ount is difficult because we can't load all of the follow relationships into memory (as th...

(ALL_OF_THE)
cedar/data-generation/generate-follows-notifications.js (1)
22-24: Consider adding validation for CSV files

The code filters for files that start with 'follows_' and end with '.csv', but doesn't verify if the files are readable or properly formatted before distribution to workers.

Consider adding basic validation:
 const followsFiles = fs
     .readdirSync(DATA_DIR)
-    .filter((file) => file.startsWith('follows_') && file.endsWith('.csv'));
+    .filter((file) => file.startsWith('follows_') && file.endsWith('.csv'))
+    .filter((file) => {
+        try {
+            // Check if file is readable
+            fs.accessSync(`${DATA_DIR}/${file}`, fs.constants.R_OK);
+            return true;
+        } catch (err) {
+            console.warn(`Skipping inaccessible file: ${file}`);
+            return false;
+        }
+    });
cedar/data-generation/generate-follows-notifications-worker.js (3)
10-23: Consider adding error handling for CSV parsing

The parseCSV function assumes well-formed CSV data. If the data has an incorrect format (e.g., inconsistent number of values per row), this could cause issues.
function parseCSV(data) {
    const lines = data.trim().split('\n');
    const headers = lines[0].split(',');

    return lines.slice(1).map((line) => {
        const values = line.split(',');
+       
+       // Skip malformed rows
+       if (values.length !== headers.length) {
+           console.warn(`Skipping malformed CSV line: ${line}`);
+           return null;
+       }

        return headers.reduce((object, header, index) => {
            object[header] = values[index];

            return object;
        }, {});
-    });
+    }).filter(Boolean);
}
28-35: Consider using streaming for large user data files

Loading the entire users.csv file synchronously with readFileSync could cause memory issues if the file is very large. Consider using a streaming approach similar to how you're handling the follows files.
-    const usersData = fs.readFileSync(
-        path.join(DATA_DIR, 'users.csv'),
-        'utf-8',
-    );
-
-    for (const user of parseCSV(usersData)) {
-        users[user.account_id] = user;
-    }

+    // Load users with stream processing
+    const usersReadStream = fs.createReadStream(path.join(DATA_DIR, 'users.csv'));
+    const usersRL = readline.createInterface({
+        input: usersReadStream,
+        crlfDelay: Number.POSITIVE_INFINITY,
+    });
+
+    let isFirstUserLine = true;
+    let headers;
+
+    for await (const line of usersRL) {
+        if (isFirstUserLine) {
+            headers = line.split(',');
+            isFirstUserLine = false;
+            continue;
+        }
+
+        const values = line.split(',');
+        const user = headers.reduce((obj, header, idx) => {
+            obj[header] = values[idx];
+            return obj;
+        }, {});
+
+        users[user.account_id] = user;
+    }
+
+    usersRL.close();
61-68: Add logging for skipped accounts without users

The code silently skips accounts that don't have an associated user. It would be helpful to add some logging to track how many accounts are being skipped.
const [followerId, followingId] = line.split(',');
const user = users[followingId];

// Not all accounts are associated with a user
if (!user) {
+   // Add a counter or logging for diagnostic purposes
+   // console.log(`No user found for account ID: ${followingId}`);
    continue;
}

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5dd10f1 and 0b188f1.

📒 Files selected for processing (11)

cedar/data-generation/Dockerfile (1 hunks)
cedar/data-generation/README.md (1 hunks)
cedar/data-generation/data-generate.sh (1 hunks)
cedar/data-generation/data-stats.sh (1 hunks)
cedar/data-generation/data-upload.sh (1 hunks)
cedar/data-generation/generate-follows-notifications-worker.js (1 hunks)
cedar/data-generation/generate-follows-notifications.js (1 hunks)
cedar/query-runner/queries/read-notifications.sql (1 hunks)
cedar/schema/indexes.sql (1 hunks)
cedar/schema/tables.sql (1 hunks)
cedar/upload-data.sh (1 hunks)

🧰 Additional context used

🧬 Code Definitions (2)

cedar/data-generation/generate-follows-notifications.js (1)

cedar/data-generation/generate-follows-notifications-worker.js (3) (3)

require (1:1)

require (6:6)

fs (2:2)

cedar/data-generation/generate-follows-notifications-worker.js (1)

cedar/data-generation/generate-follows-notifications.js (4) (4)

require (1:1)

require (6:6)

fs (3:3)

followsFiles (22:24)

🪛 LanguageTool

cedar/data-generation/README.md

[uncategorized] ~56-~56: This verb does not appear to agree with the subject. Consider using a different form.
Context: ...ith a user - Find all the accounts that follows the user's account - Insert into the no...

(AI_EN_LECTOR_REPLACEMENT_VERB_AGREEMENT)

[style] ~59-~59: Consider removing “of” to be more concise
Context: ... each of the resolved accounts Finding all of the accounts that follow a user's account i...

(ALL_OF_THE)

[style] ~60-~60: Consider removing “of” to be more concise
Context: ...ount is difficult because we can't load all of the follow relationships into memory (as th...

(ALL_OF_THE)

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: Build, Test and Deploy

🔇 Additional comments (13)

cedar/data-generation/data-generate.sh (1)

12-12: Integrate Notifications Data Generation
Adding the command node generate-follows-notifications.js integrates notifications data generation into the overall workflow. Ensure that the new script includes proper error handling and logs meaningful output in case of failures.

cedar/data-generation/data-stats.sh (1)

11-17: Extended Data Statistics for Notifications
Adding the line for counting notifications via the glob pattern ./data/follows_notifications_*.csv improves overall insight into the data generation process. Verify that the glob pattern correctly matches the intended files in different environments.

cedar/upload-data.sh (1)

35-45: Well-structured implementation for notifications import

The implementation follows the established pattern for importing feeds and follows data, correctly specifying the notifications table and the required columns (user_id, account_id, event_type).

cedar/query-runner/queries/read-notifications.sql (1)

1-28: Well-structured query with effective joins and pagination

The query is well-organized with:

Appropriate column selections with clear aliases

Proper LEFT JOIN for posts (allowing notifications without posts)

INNER JOIN for accounts (requiring an account for each notification)

Efficient filtering and pagination using internal_id as a cursor

Helpful comments explaining the filtering and ordering constraints

The query's structure will work well for performance testing notifications retrieval.

cedar/data-generation/generate-follows-notifications.js (4)

1-9: Good module imports and worker thread configuration

The file correctly imports necessary modules and configures the number of worker threads based on available CPU cores, which allows for efficient parallelization of the notification generation process.

19-44: Effective worker distribution strategy

The main function effectively:

Reads and filters follows files

Calculates an appropriate number of files per worker

Creates and configures worker threads with their portion of files

This approach ensures balanced distribution of work across available CPU cores.

51-78: Robust worker communication and error handling

The code effectively:

Uses Promise.all to wait for all workers to complete

Implements proper message handling for progress updates

Captures and propagates worker errors

Detects abnormal worker exits

This ensures reliable operation and proper error reporting during the multi-threaded processing.

92-99: Clean process exit handling

The main function is properly wrapped with promise handling that ensures the process exits with the appropriate status code (0 for success, 1 for error), and errors are properly logged.

cedar/schema/tables.sql (1)

70-80: Good schema design for notifications table

The notifications table has a well-designed schema with:

An auto-incrementing primary key

A timestamp field with appropriate default

Enumerated event types with comments explaining the values

Proper foreign key constraints with ON DELETE CASCADE

Appropriate NULL constraint for post_id since FOLLOW events won't reference posts

cedar/data-generation/generate-follows-notifications-worker.js (4)

1-7: Proper module imports and configuration

The worker thread imports and configuration look good. The code properly uses the Node.js worker threads API and filesystem modules.

69-76: Good use of batch processing

The implementation efficiently processes data in batches using a configurable chunk size, which is good for performance when dealing with large datasets.

84-87: Progress tracking implementation

Good implementation of progress tracking by sending messages to the parent thread after each file is processed.

49-49: CSV header consistency

The output CSV header matches the expected columns for the notifications table. Ensure this remains consistent with the database schema if any changes are made.

cedar/data-generation/generate-follows-notifications-worker.js

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (6)

cedar/data-generation/README.md (1)
51-70: Well-structured documentation for the new Notifications Generation process.

The section clearly explains the process and challenges of generating notifications from follows data, mirroring the structure of the existing Feeds Generation section, which provides consistency to readers.

Consider making these minor stylistic improvements:
-Finding all of the accounts that follow a user's account is difficult because
-we can't load all of the follow relationships into memory (as they are contained
+Finding all accounts that follow a user's account is difficult because
+we can't load all follow relationships into memory (as they are contained
🧰 Tools

🪛 LanguageTool

[style] ~59-~59: Consider removing “of” to be more concise
Context: ... each of the resolved accounts Finding all of the accounts that follow a user's account i...

(ALL_OF_THE)

[style] ~60-~60: Consider removing “of” to be more concise
Context: ...ount is difficult because we can't load all of the follow relationships into memory (as th...

(ALL_OF_THE)
cedar/query-runner/queries/read-notifications.sql (2)
1-31: Well-structured query for retrieving notifications with related data.

The query properly joins the notifications table with accounts and posts tables, applies appropriate filtering, and includes pagination.

I notice you're doing a LEFT JOIN to the reply_posts alias (line 21) but not selecting any fields from it. Consider either:

Adding fields from reply_posts to the SELECT statement:
  posts.title AS post_title,
  posts.content AS post_content,
  posts.type AS post_type
+ reply_posts.title AS reply_post_title,
+ reply_posts.content AS reply_post_content,
+ reply_posts.type AS reply_post_type
Or removing the join if it's not needed for this query:
LEFT JOIN
  posts on posts.internal_id = notifications.post_id
-LEFT JOIN
-  posts AS reply_posts on reply_posts.internal_id = notifications.reply_post_id
24-27: Consider parameterizing the user_id and cursor values.

The query currently uses hardcoded values (user_id = 2340, internal_id < 20) which appear to be for testing.

For production use, these should be parameterized. Depending on your query execution system, you might use:
-  notifications.user_id = 2340
+  notifications.user_id = :user_id
AND
-  notifications.internal_id < 20 -- the cursor for pagination
+  notifications.internal_id < :cursor -- the cursor for pagination
cedar/data-generation/generate-follows-notifications.js (2)
10-17: Consider enhancing the progress bar format.

The current progress bar only shows percentage completion.

Add more information to make monitoring more useful:
    {
        clearOnComplete: false,
        hideCursor: true,
-        format: '{bar} {percentage}%',
+        format: '{bar} {percentage}% | {value}/{total} files | ETA: {eta}s',
    },
92-99: Consider alternate exit strategy.

Using process.exit() can abruptly terminate the Node.js process, potentially interrupting other asynchronous operations.

For cleaner shutdown, consider allowing natural program completion:
main()
    .then(() => {
-        process.exit(0);
+        console.log('Process completed successfully');
    })
    .catch((error) => {
        console.error(error);
-        process.exit(1);
+        // The uncaught exception will naturally exit with non-zero code
    });
cedar/schema/tables.sql (1)
70-82: Well-designed notifications table with appropriate relationships.

The table structure captures the necessary data for different notification types and includes proper foreign key constraints.

Consider these schema improvements:

Use ENUM for event_type for better readability and data integrity:
-    event_type TINYINT UNSIGNED NOT NULL, -- 1=LIKE, 2=REPOST, 3=REPLY, 4=FOLLOW
+    event_type ENUM('LIKE', 'REPOST', 'REPLY', 'FOLLOW') NOT NULL,
Add a composite index to optimize filtering by user_id and event_type:
CREATE INDEX idx_notifications_user_event ON notifications (user_id, event_type);
Consider adding a unique constraint if duplicate notifications should be prevented:
UNIQUE KEY unique_notification (user_id, event_type, account_id, post_id, reply_post_id)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0b188f1 and 817102a.

📒 Files selected for processing (11)

cedar/data-generation/Dockerfile (1 hunks)
cedar/data-generation/README.md (1 hunks)
cedar/data-generation/data-generate.sh (1 hunks)
cedar/data-generation/data-stats.sh (1 hunks)
cedar/data-generation/data-upload.sh (1 hunks)
cedar/data-generation/generate-follows-notifications-worker.js (1 hunks)
cedar/data-generation/generate-follows-notifications.js (1 hunks)
cedar/query-runner/queries/read-notifications.sql (1 hunks)
cedar/schema/indexes.sql (1 hunks)
cedar/schema/tables.sql (1 hunks)
cedar/upload-data.sh (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (7)

cedar/data-generation/data-generate.sh
cedar/schema/indexes.sql
cedar/data-generation/data-upload.sh
cedar/data-generation/Dockerfile
cedar/data-generation/data-stats.sh
cedar/upload-data.sh
cedar/data-generation/generate-follows-notifications-worker.js

🧰 Additional context used

🪛 LanguageTool

cedar/data-generation/README.md

[style] ~59-~59: Consider removing “of” to be more concise
Context: ... each of the resolved accounts Finding all of the accounts that follow a user's account i...

(ALL_OF_THE)

[style] ~60-~60: Consider removing “of” to be more concise
Context: ...ount is difficult because we can't load all of the follow relationships into memory (as th...

(ALL_OF_THE)

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: Build, Test and Deploy

🔇 Additional comments (2)

cedar/data-generation/generate-follows-notifications.js (2)

1-18: Good setup for multi-threaded processing with progress reporting.

The code properly imports necessary modules and configures the progress bar for visual feedback.

19-90: Well-structured main function with proper error handling.

The implementation handles file distribution across workers, tracks progress, and reports completion time effectively.

refs https://linear.app/ghost/issue/AP-897/[notifications-perf]-step-1-validate-performance-improvement-at Added data generation scripts for notifications and a query to read notifications

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (7)

cedar/args.sh (1)
1-5: Add a shebang and declare variables as exported.

The script is missing a shebang line and the variables aren't exported, which might cause issues when they're referenced from other scripts.
+#!/bin/bash
-INSTANCE_NAME="cedar-test-db-mb170325"
-TIER="db-f1-micro"
-REGION="us-central1"
-DB_NAME="activitypub"
-BUCKET_NAME="cedar-test-data-mb170325"
+export INSTANCE_NAME="cedar-test-db-mb170325"
+export TIER="db-f1-micro"
+export REGION="us-central1"
+export DB_NAME="activitypub"
+export BUCKET_NAME="cedar-test-data-mb170325"
🧰 Tools

🪛 Shellcheck (0.10.0)

[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.

(SC2148)

[warning] 1-1: INSTANCE_NAME appears unused. Verify use (or export if used externally).

(SC2034)

[warning] 2-2: TIER appears unused. Verify use (or export if used externally).

(SC2034)

[warning] 3-3: REGION appears unused. Verify use (or export if used externally).

(SC2034)

[warning] 4-4: DB_NAME appears unused. Verify use (or export if used externally).

(SC2034)

[warning] 5-5: BUCKET_NAME appears unused. Verify use (or export if used externally).

(SC2034)
cedar/data-generation/README.md (2)
56-56: Fix verb agreement.

The verb should agree with the plural subject "accounts".
- Find all the accounts that follows the user's account
+ Find all the accounts that follow the user's account
🧰 Tools

🪛 LanguageTool

[uncategorized] ~56-~56: This verb does not appear to agree with the subject. Consider using a different form.
Context: ...ith a user - Find all the accounts that follows the user's account - Insert into the no...

(AI_EN_LECTOR_REPLACEMENT_VERB_AGREEMENT)

59-60: Consider more concise phrasing.

For better readability, consider removing "of" in these phrases.
- Finding all of the accounts that follow a user's account is difficult because
- we can't load all of the follow relationships into memory (as they are contained
+ Finding all the accounts that follow a user's account is difficult because
+ we can't load all the follow relationships into memory (as they are contained
🧰 Tools

🪛 LanguageTool

[style] ~59-~59: Consider removing “of” to be more concise
Context: ... each of the resolved accounts Finding all of the accounts that follow a user's account i...

(ALL_OF_THE)

[style] ~60-~60: Consider removing “of” to be more concise
Context: ...ount is difficult because we can't load all of the follow relationships into memory (as th...

(ALL_OF_THE)
cedar/data-generation/generate-follows-notifications.js (2)
19-90: Consider adding documentation for the main function.

Adding a JSDoc comment would help clarify the purpose and behavior of the main function.
+/**
+ * Main function to generate follow notifications.
+ * - Reads all follows CSV files
+ * - Distributes processing across multiple worker threads
+ * - Tracks progress and handles errors
+ * - Reports completion time
+ */
async function main() {
54-76: Add error details to rejection message.

When handling worker errors, it would be helpful to include the error details in the rejection.
-                    worker.on('error', reject);
+                    worker.on('error', (error) => {
+                        reject(new Error(`Worker error: ${error.message}`));
+                    });
cedar/schema/tables.sql (2)
70-82: Consider adding a composite index for filtering notifications.

Users will likely filter notifications by both user_id and event_type, so a composite index would be beneficial.
    FOREIGN KEY (account_id) REFERENCES accounts(internal_id) ON DELETE CASCADE,
    FOREIGN KEY (post_id) REFERENCES posts(internal_id) ON DELETE CASCADE,
    FOREIGN KEY (reply_post_id) REFERENCES posts(internal_id) ON DELETE CASCADE,
+   INDEX idx_user_event (user_id, event_type)
);
72-73: Consider using an ENUM for event_type.

Using an ENUM for event_type would provide better type safety and readability than a TINYINT with comments.
-    event_type TINYINT UNSIGNED NOT NULL, -- 1=LIKE, 2=REPOST, 3=REPLY, 4=FOLLOW
+    event_type ENUM('LIKE', 'REPOST', 'REPLY', 'FOLLOW') NOT NULL,

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 817102a and 8e082a7.

📒 Files selected for processing (13)

cedar/args.sh (1 hunks)
cedar/data-generation/Dockerfile (1 hunks)
cedar/data-generation/README.md (1 hunks)
cedar/data-generation/config.js (1 hunks)
cedar/data-generation/data-generate.sh (1 hunks)
cedar/data-generation/data-stats.sh (1 hunks)
cedar/data-generation/data-upload.sh (1 hunks)
cedar/data-generation/generate-follows-notifications-worker.js (1 hunks)
cedar/data-generation/generate-follows-notifications.js (1 hunks)
cedar/query-runner/queries/read-notifications.sql (1 hunks)
cedar/schema/indexes.sql (1 hunks)
cedar/schema/tables.sql (1 hunks)
cedar/upload-data.sh (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (8)

cedar/data-generation/data-generate.sh
cedar/schema/indexes.sql
cedar/data-generation/data-stats.sh
cedar/data-generation/Dockerfile
cedar/data-generation/data-upload.sh
cedar/upload-data.sh
cedar/query-runner/queries/read-notifications.sql
cedar/data-generation/generate-follows-notifications-worker.js

🧰 Additional context used

🪛 Shellcheck (0.10.0)

cedar/args.sh

[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.

(SC2148)

[warning] 1-1: INSTANCE_NAME appears unused. Verify use (or export if used externally).

(SC2034)

[warning] 2-2: TIER appears unused. Verify use (or export if used externally).

(SC2034)

[warning] 3-3: REGION appears unused. Verify use (or export if used externally).

(SC2034)

[warning] 4-4: DB_NAME appears unused. Verify use (or export if used externally).

(SC2034)

[warning] 5-5: BUCKET_NAME appears unused. Verify use (or export if used externally).

(SC2034)

🪛 LanguageTool

cedar/data-generation/README.md

[uncategorized] ~56-~56: This verb does not appear to agree with the subject. Consider using a different form.
Context: ...ith a user - Find all the accounts that follows the user's account - Insert into the no...

(AI_EN_LECTOR_REPLACEMENT_VERB_AGREEMENT)

[style] ~59-~59: Consider removing “of” to be more concise
Context: ... each of the resolved accounts Finding all of the accounts that follow a user's account i...

(ALL_OF_THE)

[style] ~60-~60: Consider removing “of” to be more concise
Context: ...ount is difficult because we can't load all of the follow relationships into memory (as th...

(ALL_OF_THE)

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: Build, Test and Deploy

🔇 Additional comments (4)

cedar/data-generation/config.js (1)

1-1: Scaling factor reduced by 100x for testing purposes.

The SCALING_FACTOR has been reduced from 1 to 0.01, which will significantly decrease the volume of test data:

NUM_SITES: 60,000 → 600

NUM_USERS: 300,000 → 3,000

NUM_ACCOUNTS: 2.5M → 25,000

NUM_POSTS: 3M → 30,000

This change will speed up notification performance testing by requiring less data generation time and resources, while still maintaining proper proportional relationships between data types.
cedar/args.sh (1)

1-1:
❓ Verification inconclusive

Personalized instance name might cause issues for other users.

The instance name appears to include a personal identifier (mb170325), which could cause naming conflicts or confusion if other developers use this script directly.

Consider whether this change should be committed or if it would be better to use environment variables or a gitignored configuration file for personalized settings.

🏁 Script executed:
#!/bin/bash
# Check if this is a personal setting that shouldn't be committed
git diff --cached cedar/args.sh | grep -E "cedar-test-db-mb[0-9]+"
Length of output: 68
Action: Verify Instance Name Configuration

The file cedar/args.sh still sets INSTANCE_NAME to a personalized value (cedar-test-db-mb170325), which could be problematic for other users.

Consider externalizing this setting (e.g., via an environment variable or a gitignored configuration file) to avoid potential naming conflicts.

Note: The diff check using git diff --cached cedar/args.sh | grep -E "cedar-test-db-mb[0-9]+" produced no output, so please manually verify that the personalized instance name is intentionally maintained (or adjust if not).

🧰 Tools

🪛 Shellcheck (0.10.0)

[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.

(SC2148)

[warning] 1-1: INSTANCE_NAME appears unused. Verify use (or export if used externally).

(SC2034)
cedar/data-generation/README.md (1)

51-70: Documentation for Notifications Generation is well-structured.

The new section follows the same format as existing sections and clearly explains the notifications generation process.

🧰 Tools

🪛 LanguageTool

[uncategorized] ~56-~56: This verb does not appear to agree with the subject. Consider using a different form.
Context: ...ith a user - Find all the accounts that follows the user's account - Insert into the no...

(AI_EN_LECTOR_REPLACEMENT_VERB_AGREEMENT)

[style] ~59-~59: Consider removing “of” to be more concise
Context: ... each of the resolved accounts Finding all of the accounts that follow a user's account i...

(ALL_OF_THE)

[style] ~60-~60: Consider removing “of” to be more concise
Context: ...ount is difficult because we can't load all of the follow relationships into memory (as th...

(ALL_OF_THE)

cedar/data-generation/generate-follows-notifications.js (1)

1-99: Well-structured multi-threaded notification generation script.

The implementation follows the same patterns as other generation scripts in the codebase, with proper parallelization, error handling, and progress reporting.

coderabbitai · 2025-03-20T08:16:29Z

cedar/schema/tables.sql

+CREATE TABLE notifications (
+    internal_id INT AUTO_INCREMENT PRIMARY KEY,
+    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
+    event_type TINYINT UNSIGNED NOT NULL, -- 1=LIKE, 2=REPOST, 3=REPLY, 4=FOLLOW
+    user_id INT NOT NULL, -- The user the notification is for
+    account_id INT NOT NULL, -- The account that "did" the notification
+    post_id INT NULL, -- NULL for FOLLOW events
+    reply_post_id INT NULL, -- NULL for non-reply events
+    FOREIGN KEY (user_id) REFERENCES users(internal_id) ON DELETE CASCADE,
+    FOREIGN KEY (account_id) REFERENCES accounts(internal_id) ON DELETE CASCADE,
+    FOREIGN KEY (post_id) REFERENCES posts(internal_id) ON DELETE CASCADE,
+    FOREIGN KEY (reply_post_id) REFERENCES posts(internal_id) ON DELETE CASCADE
+);


🛠️ Refactor suggestion

Add an index on the user_id column for notification queries.

Since notifications will likely be queried by user_id frequently, adding an index would improve query performance.

CREATE TABLE notifications ( internal_id INT AUTO_INCREMENT PRIMARY KEY, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, event_type TINYINT UNSIGNED NOT NULL, -- 1=LIKE, 2=REPOST, 3=REPLY, 4=FOLLOW user_id INT NOT NULL, -- The user the notification is for account_id INT NOT NULL, -- The account that "did" the notification post_id INT NULL, -- NULL for FOLLOW events reply_post_id INT NULL, -- NULL for non-reply events FOREIGN KEY (user_id) REFERENCES users(internal_id) ON DELETE CASCADE, FOREIGN KEY (account_id) REFERENCES accounts(internal_id) ON DELETE CASCADE, FOREIGN KEY (post_id) REFERENCES posts(internal_id) ON DELETE CASCADE, FOREIGN KEY (reply_post_id) REFERENCES posts(internal_id) ON DELETE CASCADE, + INDEX idx_user_id (user_id) );

Committable suggestion skipped: line range outside the PR's diff.

mike182uk temporarily deployed to build March 18, 2025 15:21 — with GitHub Actions Inactive

coderabbitai bot reviewed Mar 18, 2025

View reviewed changes

cedar/data-generation/generate-follows-notifications-worker.js Show resolved Hide resolved

mike182uk force-pushed the mike-ap-897-notifications-perf-step-1-validate-performance-improvement branch from 0b188f1 to 817102a Compare March 18, 2025 15:57

mike182uk temporarily deployed to build March 18, 2025 15:57 — with GitHub Actions Inactive

coderabbitai bot reviewed Mar 18, 2025

View reviewed changes

mike182uk added 2 commits March 20, 2025 08:10

Added notifications query perf testing

1f21ec7

refs https://linear.app/ghost/issue/AP-897/[notifications-perf]-step-1-validate-performance-improvement-at Added data generation scripts for notifications and a query to read notifications

TEMP_UNDO

8e082a7

mike182uk force-pushed the mike-ap-897-notifications-perf-step-1-validate-performance-improvement branch from 817102a to 8e082a7 Compare March 20, 2025 08:10

mike182uk temporarily deployed to build March 20, 2025 08:10 — with GitHub Actions Inactive

coderabbitai bot reviewed Mar 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added notifications query perf testing #405

Added notifications query perf testing #405

mike182uk commented Mar 18, 2025 •

edited

Loading

coderabbitai bot commented Mar 18, 2025 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot Mar 20, 2025

Added notifications query perf testing #405

Are you sure you want to change the base?

Added notifications query perf testing #405

Conversation

mike182uk commented Mar 18, 2025 • edited Loading

coderabbitai bot commented Mar 18, 2025 • edited Loading

Walkthrough

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Mar 20, 2025

Choose a reason for hiding this comment

mike182uk commented Mar 18, 2025 •

edited

Loading

coderabbitai bot commented Mar 18, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)