Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added notifications query perf testing #405

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

mike182uk
Copy link
Member

@mike182uk mike182uk commented Mar 18, 2025

refs https://linear.app/ghost/issue/AP-897

Added data generation scripts for notifications and a query to read notifications

Copy link

coderabbitai bot commented Mar 18, 2025

Walkthrough

The changes introduce a notifications system to manage follow event data. A new database table named notifications has been created, along with an index on the user_id column to facilitate efficient retrieval of notifications. Two new JavaScript files, generate-follows-notifications.js and generate-follows-notifications-worker.js, implement multi-threaded logic for generating notifications from CSV files. Modifications to existing shell scripts enhance the overall data generation process by integrating the generation and import of notifications data. Additionally, updates to the output formatting in the data-stats.sh script improve readability and include counts for notifications. A new SQL query has been added to retrieve detailed notification information, and the README documentation has been updated to provide instructions on processing notifications data. These changes collectively support the generation, import, and querying of notifications within the data processing workflow.

✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (1)
cedar/schema/tables.sql (1)

69-81: ⚠️ Potential issue

Missing index on user_id column

While the table structure looks good, the AI summary mentioned an index named idx_notifications_user_id on the user_id column, but it's not present in this code. Since querying notifications by user will be a common operation, this index is important for performance.

Add the following index after the table creation:

CREATE TABLE notifications (
    internal_id INT AUTO_INCREMENT PRIMARY KEY,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    event_type TINYINT UNSIGNED NOT NULL, -- 1=LIKE, 2=REPOST, 3=REPLY, 4=FOLLOW
    user_id INT NOT NULL, -- The user the notification is for
    account_id INT NOT NULL, -- The account that "did" the notification
    post_id INT NULL, -- NULL for FOLLOW events
    FOREIGN KEY (user_id) REFERENCES users(internal_id) ON DELETE CASCADE,
    FOREIGN KEY (account_id) REFERENCES accounts(internal_id) ON DELETE CASCADE,
    FOREIGN KEY (post_id) REFERENCES posts(internal_id) ON DELETE CASCADE
);
+
+CREATE INDEX idx_notifications_user_id ON notifications(user_id);
🧹 Nitpick comments (8)
cedar/schema/indexes.sql (1)

30-33: New Notification Index Added
The addition of the idx_notifications_user_id index on the notifications table is a solid step toward improving lookup performance for queries filtering by user_id. Please ensure that any future queries filtering by additional columns (e.g., event_type) consider whether a composite index might yield further benefits.

cedar/data-generation/Dockerfile (1)

16-17: Include New Notification Generation Scripts in Docker Build
The new COPY instructions for generate-follows-notifications.js and generate-follows-notifications-worker.js ensure that the Docker image contains all necessary files for handling notifications. For clarity and future maintainability, consider adding a comment explaining the purpose of these new files.

cedar/data-generation/data-upload.sh (1)

37-47: Import Loop for Notifications Data CSV Files
The newly added loop for importing notifications via CSV is well integrated with the rest of the script. For consistency and to avoid potential issues with filenames containing spaces, it’s recommended to wrap the $file variable in quotes, similar to what’s done in the follows import loop. For example:

-        $file
+        "$file"

This minor tweak enhances robustness without affecting functionality.

cedar/data-generation/README.md (1)

51-70: Documentation clearly explains the notifications generation process

The new section provides a clear explanation of the notifications generation process, following the same structure as the existing "Feeds Generation" section and effectively communicating the challenges and approach.

However, there's a minor grammatical error:

-Find all the accounts that follows the user's account
+Find all the accounts that follow the user's account
🧰 Tools
🪛 LanguageTool

[uncategorized] ~56-~56: This verb does not appear to agree with the subject. Consider using a different form.
Context: ...ith a user - Find all the accounts that follows the user's account - Insert into the no...

(AI_EN_LECTOR_REPLACEMENT_VERB_AGREEMENT)


[style] ~59-~59: Consider removing “of” to be more concise
Context: ... each of the resolved accounts Finding all of the accounts that follow a user's account i...

(ALL_OF_THE)


[style] ~60-~60: Consider removing “of” to be more concise
Context: ...ount is difficult because we can't load all of the follow relationships into memory (as th...

(ALL_OF_THE)

cedar/data-generation/generate-follows-notifications.js (1)

22-24: Consider adding validation for CSV files

The code filters for files that start with 'follows_' and end with '.csv', but doesn't verify if the files are readable or properly formatted before distribution to workers.

Consider adding basic validation:

 const followsFiles = fs
     .readdirSync(DATA_DIR)
-    .filter((file) => file.startsWith('follows_') && file.endsWith('.csv'));
+    .filter((file) => file.startsWith('follows_') && file.endsWith('.csv'))
+    .filter((file) => {
+        try {
+            // Check if file is readable
+            fs.accessSync(`${DATA_DIR}/${file}`, fs.constants.R_OK);
+            return true;
+        } catch (err) {
+            console.warn(`Skipping inaccessible file: ${file}`);
+            return false;
+        }
+    });
cedar/data-generation/generate-follows-notifications-worker.js (3)

10-23: Consider adding error handling for CSV parsing

The parseCSV function assumes well-formed CSV data. If the data has an incorrect format (e.g., inconsistent number of values per row), this could cause issues.

function parseCSV(data) {
    const lines = data.trim().split('\n');
    const headers = lines[0].split(',');

    return lines.slice(1).map((line) => {
        const values = line.split(',');
+       
+       // Skip malformed rows
+       if (values.length !== headers.length) {
+           console.warn(`Skipping malformed CSV line: ${line}`);
+           return null;
+       }

        return headers.reduce((object, header, index) => {
            object[header] = values[index];

            return object;
        }, {});
-    });
+    }).filter(Boolean);
}

28-35: Consider using streaming for large user data files

Loading the entire users.csv file synchronously with readFileSync could cause memory issues if the file is very large. Consider using a streaming approach similar to how you're handling the follows files.

-    const usersData = fs.readFileSync(
-        path.join(DATA_DIR, 'users.csv'),
-        'utf-8',
-    );
-
-    for (const user of parseCSV(usersData)) {
-        users[user.account_id] = user;
-    }

+    // Load users with stream processing
+    const usersReadStream = fs.createReadStream(path.join(DATA_DIR, 'users.csv'));
+    const usersRL = readline.createInterface({
+        input: usersReadStream,
+        crlfDelay: Number.POSITIVE_INFINITY,
+    });
+
+    let isFirstUserLine = true;
+    let headers;
+
+    for await (const line of usersRL) {
+        if (isFirstUserLine) {
+            headers = line.split(',');
+            isFirstUserLine = false;
+            continue;
+        }
+
+        const values = line.split(',');
+        const user = headers.reduce((obj, header, idx) => {
+            obj[header] = values[idx];
+            return obj;
+        }, {});
+
+        users[user.account_id] = user;
+    }
+
+    usersRL.close();

61-68: Add logging for skipped accounts without users

The code silently skips accounts that don't have an associated user. It would be helpful to add some logging to track how many accounts are being skipped.

const [followerId, followingId] = line.split(',');
const user = users[followingId];

// Not all accounts are associated with a user
if (!user) {
+   // Add a counter or logging for diagnostic purposes
+   // console.log(`No user found for account ID: ${followingId}`);
    continue;
}
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5dd10f1 and 0b188f1.

📒 Files selected for processing (11)
  • cedar/data-generation/Dockerfile (1 hunks)
  • cedar/data-generation/README.md (1 hunks)
  • cedar/data-generation/data-generate.sh (1 hunks)
  • cedar/data-generation/data-stats.sh (1 hunks)
  • cedar/data-generation/data-upload.sh (1 hunks)
  • cedar/data-generation/generate-follows-notifications-worker.js (1 hunks)
  • cedar/data-generation/generate-follows-notifications.js (1 hunks)
  • cedar/query-runner/queries/read-notifications.sql (1 hunks)
  • cedar/schema/indexes.sql (1 hunks)
  • cedar/schema/tables.sql (1 hunks)
  • cedar/upload-data.sh (1 hunks)
🧰 Additional context used
🧬 Code Definitions (2)
cedar/data-generation/generate-follows-notifications.js (1)
cedar/data-generation/generate-follows-notifications-worker.js (3) (3)
  • require (1:1)
  • require (6:6)
  • fs (2:2)
cedar/data-generation/generate-follows-notifications-worker.js (1)
cedar/data-generation/generate-follows-notifications.js (4) (4)
  • require (1:1)
  • require (6:6)
  • fs (3:3)
  • followsFiles (22:24)
🪛 LanguageTool
cedar/data-generation/README.md

[uncategorized] ~56-~56: This verb does not appear to agree with the subject. Consider using a different form.
Context: ...ith a user - Find all the accounts that follows the user's account - Insert into the no...

(AI_EN_LECTOR_REPLACEMENT_VERB_AGREEMENT)


[style] ~59-~59: Consider removing “of” to be more concise
Context: ... each of the resolved accounts Finding all of the accounts that follow a user's account i...

(ALL_OF_THE)


[style] ~60-~60: Consider removing “of” to be more concise
Context: ...ount is difficult because we can't load all of the follow relationships into memory (as th...

(ALL_OF_THE)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Build, Test and Deploy
🔇 Additional comments (13)
cedar/data-generation/data-generate.sh (1)

12-12: Integrate Notifications Data Generation
Adding the command node generate-follows-notifications.js integrates notifications data generation into the overall workflow. Ensure that the new script includes proper error handling and logs meaningful output in case of failures.

cedar/data-generation/data-stats.sh (1)

11-17: Extended Data Statistics for Notifications
Adding the line for counting notifications via the glob pattern ./data/follows_notifications_*.csv improves overall insight into the data generation process. Verify that the glob pattern correctly matches the intended files in different environments.

cedar/upload-data.sh (1)

35-45: Well-structured implementation for notifications import

The implementation follows the established pattern for importing feeds and follows data, correctly specifying the notifications table and the required columns (user_id, account_id, event_type).

cedar/query-runner/queries/read-notifications.sql (1)

1-28: Well-structured query with effective joins and pagination

The query is well-organized with:

  • Appropriate column selections with clear aliases
  • Proper LEFT JOIN for posts (allowing notifications without posts)
  • INNER JOIN for accounts (requiring an account for each notification)
  • Efficient filtering and pagination using internal_id as a cursor
  • Helpful comments explaining the filtering and ordering constraints

The query's structure will work well for performance testing notifications retrieval.

cedar/data-generation/generate-follows-notifications.js (4)

1-9: Good module imports and worker thread configuration

The file correctly imports necessary modules and configures the number of worker threads based on available CPU cores, which allows for efficient parallelization of the notification generation process.


19-44: Effective worker distribution strategy

The main function effectively:

  1. Reads and filters follows files
  2. Calculates an appropriate number of files per worker
  3. Creates and configures worker threads with their portion of files

This approach ensures balanced distribution of work across available CPU cores.


51-78: Robust worker communication and error handling

The code effectively:

  1. Uses Promise.all to wait for all workers to complete
  2. Implements proper message handling for progress updates
  3. Captures and propagates worker errors
  4. Detects abnormal worker exits

This ensures reliable operation and proper error reporting during the multi-threaded processing.


92-99: Clean process exit handling

The main function is properly wrapped with promise handling that ensures the process exits with the appropriate status code (0 for success, 1 for error), and errors are properly logged.

cedar/schema/tables.sql (1)

70-80: Good schema design for notifications table

The notifications table has a well-designed schema with:

  • An auto-incrementing primary key
  • A timestamp field with appropriate default
  • Enumerated event types with comments explaining the values
  • Proper foreign key constraints with ON DELETE CASCADE
  • Appropriate NULL constraint for post_id since FOLLOW events won't reference posts
cedar/data-generation/generate-follows-notifications-worker.js (4)

1-7: Proper module imports and configuration

The worker thread imports and configuration look good. The code properly uses the Node.js worker threads API and filesystem modules.


69-76: Good use of batch processing

The implementation efficiently processes data in batches using a configurable chunk size, which is good for performance when dealing with large datasets.


84-87: Progress tracking implementation

Good implementation of progress tracking by sending messages to the parent thread after each file is processed.


49-49: CSV header consistency

The output CSV header matches the expected columns for the notifications table. Ensure this remains consistent with the database schema if any changes are made.

@mike182uk mike182uk force-pushed the mike-ap-897-notifications-perf-step-1-validate-performance-improvement branch from 0b188f1 to 817102a Compare March 18, 2025 15:57
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (6)
cedar/data-generation/README.md (1)

51-70: Well-structured documentation for the new Notifications Generation process.

The section clearly explains the process and challenges of generating notifications from follows data, mirroring the structure of the existing Feeds Generation section, which provides consistency to readers.

Consider making these minor stylistic improvements:

-Finding all of the accounts that follow a user's account is difficult because
-we can't load all of the follow relationships into memory (as they are contained
+Finding all accounts that follow a user's account is difficult because
+we can't load all follow relationships into memory (as they are contained
🧰 Tools
🪛 LanguageTool

[style] ~59-~59: Consider removing “of” to be more concise
Context: ... each of the resolved accounts Finding all of the accounts that follow a user's account i...

(ALL_OF_THE)


[style] ~60-~60: Consider removing “of” to be more concise
Context: ...ount is difficult because we can't load all of the follow relationships into memory (as th...

(ALL_OF_THE)

cedar/query-runner/queries/read-notifications.sql (2)

1-31: Well-structured query for retrieving notifications with related data.

The query properly joins the notifications table with accounts and posts tables, applies appropriate filtering, and includes pagination.

I notice you're doing a LEFT JOIN to the reply_posts alias (line 21) but not selecting any fields from it. Consider either:

  1. Adding fields from reply_posts to the SELECT statement:
  posts.title AS post_title,
  posts.content AS post_content,
  posts.type AS post_type
+ reply_posts.title AS reply_post_title,
+ reply_posts.content AS reply_post_content,
+ reply_posts.type AS reply_post_type
  1. Or removing the join if it's not needed for this query:
LEFT JOIN
  posts on posts.internal_id = notifications.post_id
-LEFT JOIN
-  posts AS reply_posts on reply_posts.internal_id = notifications.reply_post_id

24-27: Consider parameterizing the user_id and cursor values.

The query currently uses hardcoded values (user_id = 2340, internal_id < 20) which appear to be for testing.

For production use, these should be parameterized. Depending on your query execution system, you might use:

-  notifications.user_id = 2340
+  notifications.user_id = :user_id
AND
-  notifications.internal_id < 20 -- the cursor for pagination
+  notifications.internal_id < :cursor -- the cursor for pagination
cedar/data-generation/generate-follows-notifications.js (2)

10-17: Consider enhancing the progress bar format.

The current progress bar only shows percentage completion.

Add more information to make monitoring more useful:

    {
        clearOnComplete: false,
        hideCursor: true,
-        format: '{bar} {percentage}%',
+        format: '{bar} {percentage}% | {value}/{total} files | ETA: {eta}s',
    },

92-99: Consider alternate exit strategy.

Using process.exit() can abruptly terminate the Node.js process, potentially interrupting other asynchronous operations.

For cleaner shutdown, consider allowing natural program completion:

main()
    .then(() => {
-        process.exit(0);
+        console.log('Process completed successfully');
    })
    .catch((error) => {
        console.error(error);
-        process.exit(1);
+        // The uncaught exception will naturally exit with non-zero code
    });
cedar/schema/tables.sql (1)

70-82: Well-designed notifications table with appropriate relationships.

The table structure captures the necessary data for different notification types and includes proper foreign key constraints.

Consider these schema improvements:

  1. Use ENUM for event_type for better readability and data integrity:
-    event_type TINYINT UNSIGNED NOT NULL, -- 1=LIKE, 2=REPOST, 3=REPLY, 4=FOLLOW
+    event_type ENUM('LIKE', 'REPOST', 'REPLY', 'FOLLOW') NOT NULL,
  1. Add a composite index to optimize filtering by user_id and event_type:
CREATE INDEX idx_notifications_user_event ON notifications (user_id, event_type);
  1. Consider adding a unique constraint if duplicate notifications should be prevented:
UNIQUE KEY unique_notification (user_id, event_type, account_id, post_id, reply_post_id)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0b188f1 and 817102a.

📒 Files selected for processing (11)
  • cedar/data-generation/Dockerfile (1 hunks)
  • cedar/data-generation/README.md (1 hunks)
  • cedar/data-generation/data-generate.sh (1 hunks)
  • cedar/data-generation/data-stats.sh (1 hunks)
  • cedar/data-generation/data-upload.sh (1 hunks)
  • cedar/data-generation/generate-follows-notifications-worker.js (1 hunks)
  • cedar/data-generation/generate-follows-notifications.js (1 hunks)
  • cedar/query-runner/queries/read-notifications.sql (1 hunks)
  • cedar/schema/indexes.sql (1 hunks)
  • cedar/schema/tables.sql (1 hunks)
  • cedar/upload-data.sh (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
  • cedar/data-generation/data-generate.sh
  • cedar/schema/indexes.sql
  • cedar/data-generation/data-upload.sh
  • cedar/data-generation/Dockerfile
  • cedar/data-generation/data-stats.sh
  • cedar/upload-data.sh
  • cedar/data-generation/generate-follows-notifications-worker.js
🧰 Additional context used
🪛 LanguageTool
cedar/data-generation/README.md

[style] ~59-~59: Consider removing “of” to be more concise
Context: ... each of the resolved accounts Finding all of the accounts that follow a user's account i...

(ALL_OF_THE)


[style] ~60-~60: Consider removing “of” to be more concise
Context: ...ount is difficult because we can't load all of the follow relationships into memory (as th...

(ALL_OF_THE)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Build, Test and Deploy
🔇 Additional comments (2)
cedar/data-generation/generate-follows-notifications.js (2)

1-18: Good setup for multi-threaded processing with progress reporting.

The code properly imports necessary modules and configures the progress bar for visual feedback.


19-90: Well-structured main function with proper error handling.

The implementation handles file distribution across workers, tracks progress, and reports completion time effectively.

@mike182uk mike182uk force-pushed the mike-ap-897-notifications-perf-step-1-validate-performance-improvement branch from 817102a to 8e082a7 Compare March 20, 2025 08:10
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (7)
cedar/args.sh (1)

1-5: Add a shebang and declare variables as exported.

The script is missing a shebang line and the variables aren't exported, which might cause issues when they're referenced from other scripts.

+#!/bin/bash
-INSTANCE_NAME="cedar-test-db-mb170325"
-TIER="db-f1-micro"
-REGION="us-central1"
-DB_NAME="activitypub"
-BUCKET_NAME="cedar-test-data-mb170325"
+export INSTANCE_NAME="cedar-test-db-mb170325"
+export TIER="db-f1-micro"
+export REGION="us-central1"
+export DB_NAME="activitypub"
+export BUCKET_NAME="cedar-test-data-mb170325"
🧰 Tools
🪛 Shellcheck (0.10.0)

[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.

(SC2148)


[warning] 1-1: INSTANCE_NAME appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 2-2: TIER appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 3-3: REGION appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 4-4: DB_NAME appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 5-5: BUCKET_NAME appears unused. Verify use (or export if used externally).

(SC2034)

cedar/data-generation/README.md (2)

56-56: Fix verb agreement.

The verb should agree with the plural subject "accounts".

- Find all the accounts that follows the user's account
+ Find all the accounts that follow the user's account
🧰 Tools
🪛 LanguageTool

[uncategorized] ~56-~56: This verb does not appear to agree with the subject. Consider using a different form.
Context: ...ith a user - Find all the accounts that follows the user's account - Insert into the no...

(AI_EN_LECTOR_REPLACEMENT_VERB_AGREEMENT)


59-60: Consider more concise phrasing.

For better readability, consider removing "of" in these phrases.

- Finding all of the accounts that follow a user's account is difficult because
- we can't load all of the follow relationships into memory (as they are contained
+ Finding all the accounts that follow a user's account is difficult because
+ we can't load all the follow relationships into memory (as they are contained
🧰 Tools
🪛 LanguageTool

[style] ~59-~59: Consider removing “of” to be more concise
Context: ... each of the resolved accounts Finding all of the accounts that follow a user's account i...

(ALL_OF_THE)


[style] ~60-~60: Consider removing “of” to be more concise
Context: ...ount is difficult because we can't load all of the follow relationships into memory (as th...

(ALL_OF_THE)

cedar/data-generation/generate-follows-notifications.js (2)

19-90: Consider adding documentation for the main function.

Adding a JSDoc comment would help clarify the purpose and behavior of the main function.

+/**
+ * Main function to generate follow notifications.
+ * - Reads all follows CSV files
+ * - Distributes processing across multiple worker threads
+ * - Tracks progress and handles errors
+ * - Reports completion time
+ */
async function main() {

54-76: Add error details to rejection message.

When handling worker errors, it would be helpful to include the error details in the rejection.

-                    worker.on('error', reject);
+                    worker.on('error', (error) => {
+                        reject(new Error(`Worker error: ${error.message}`));
+                    });
cedar/schema/tables.sql (2)

70-82: Consider adding a composite index for filtering notifications.

Users will likely filter notifications by both user_id and event_type, so a composite index would be beneficial.

    FOREIGN KEY (account_id) REFERENCES accounts(internal_id) ON DELETE CASCADE,
    FOREIGN KEY (post_id) REFERENCES posts(internal_id) ON DELETE CASCADE,
    FOREIGN KEY (reply_post_id) REFERENCES posts(internal_id) ON DELETE CASCADE,
+   INDEX idx_user_event (user_id, event_type)
);

72-73: Consider using an ENUM for event_type.

Using an ENUM for event_type would provide better type safety and readability than a TINYINT with comments.

-    event_type TINYINT UNSIGNED NOT NULL, -- 1=LIKE, 2=REPOST, 3=REPLY, 4=FOLLOW
+    event_type ENUM('LIKE', 'REPOST', 'REPLY', 'FOLLOW') NOT NULL,
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 817102a and 8e082a7.

📒 Files selected for processing (13)
  • cedar/args.sh (1 hunks)
  • cedar/data-generation/Dockerfile (1 hunks)
  • cedar/data-generation/README.md (1 hunks)
  • cedar/data-generation/config.js (1 hunks)
  • cedar/data-generation/data-generate.sh (1 hunks)
  • cedar/data-generation/data-stats.sh (1 hunks)
  • cedar/data-generation/data-upload.sh (1 hunks)
  • cedar/data-generation/generate-follows-notifications-worker.js (1 hunks)
  • cedar/data-generation/generate-follows-notifications.js (1 hunks)
  • cedar/query-runner/queries/read-notifications.sql (1 hunks)
  • cedar/schema/indexes.sql (1 hunks)
  • cedar/schema/tables.sql (1 hunks)
  • cedar/upload-data.sh (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
  • cedar/data-generation/data-generate.sh
  • cedar/schema/indexes.sql
  • cedar/data-generation/data-stats.sh
  • cedar/data-generation/Dockerfile
  • cedar/data-generation/data-upload.sh
  • cedar/upload-data.sh
  • cedar/query-runner/queries/read-notifications.sql
  • cedar/data-generation/generate-follows-notifications-worker.js
🧰 Additional context used
🪛 Shellcheck (0.10.0)
cedar/args.sh

[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.

(SC2148)


[warning] 1-1: INSTANCE_NAME appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 2-2: TIER appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 3-3: REGION appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 4-4: DB_NAME appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 5-5: BUCKET_NAME appears unused. Verify use (or export if used externally).

(SC2034)

🪛 LanguageTool
cedar/data-generation/README.md

[uncategorized] ~56-~56: This verb does not appear to agree with the subject. Consider using a different form.
Context: ...ith a user - Find all the accounts that follows the user's account - Insert into the no...

(AI_EN_LECTOR_REPLACEMENT_VERB_AGREEMENT)


[style] ~59-~59: Consider removing “of” to be more concise
Context: ... each of the resolved accounts Finding all of the accounts that follow a user's account i...

(ALL_OF_THE)


[style] ~60-~60: Consider removing “of” to be more concise
Context: ...ount is difficult because we can't load all of the follow relationships into memory (as th...

(ALL_OF_THE)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Build, Test and Deploy
🔇 Additional comments (4)
cedar/data-generation/config.js (1)

1-1: Scaling factor reduced by 100x for testing purposes.

The SCALING_FACTOR has been reduced from 1 to 0.01, which will significantly decrease the volume of test data:

  • NUM_SITES: 60,000 → 600
  • NUM_USERS: 300,000 → 3,000
  • NUM_ACCOUNTS: 2.5M → 25,000
  • NUM_POSTS: 3M → 30,000

This change will speed up notification performance testing by requiring less data generation time and resources, while still maintaining proper proportional relationships between data types.

cedar/args.sh (1)

1-1:

❓ Verification inconclusive

Personalized instance name might cause issues for other users.

The instance name appears to include a personal identifier (mb170325), which could cause naming conflicts or confusion if other developers use this script directly.

Consider whether this change should be committed or if it would be better to use environment variables or a gitignored configuration file for personalized settings.


🏁 Script executed:

#!/bin/bash
# Check if this is a personal setting that shouldn't be committed
git diff --cached cedar/args.sh | grep -E "cedar-test-db-mb[0-9]+"

Length of output: 68


Action: Verify Instance Name Configuration

  • The file cedar/args.sh still sets INSTANCE_NAME to a personalized value (cedar-test-db-mb170325), which could be problematic for other users.
  • Consider externalizing this setting (e.g., via an environment variable or a gitignored configuration file) to avoid potential naming conflicts.
  • Note: The diff check using git diff --cached cedar/args.sh | grep -E "cedar-test-db-mb[0-9]+" produced no output, so please manually verify that the personalized instance name is intentionally maintained (or adjust if not).
🧰 Tools
🪛 Shellcheck (0.10.0)

[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.

(SC2148)


[warning] 1-1: INSTANCE_NAME appears unused. Verify use (or export if used externally).

(SC2034)

cedar/data-generation/README.md (1)

51-70: Documentation for Notifications Generation is well-structured.

The new section follows the same format as existing sections and clearly explains the notifications generation process.

🧰 Tools
🪛 LanguageTool

[uncategorized] ~56-~56: This verb does not appear to agree with the subject. Consider using a different form.
Context: ...ith a user - Find all the accounts that follows the user's account - Insert into the no...

(AI_EN_LECTOR_REPLACEMENT_VERB_AGREEMENT)


[style] ~59-~59: Consider removing “of” to be more concise
Context: ... each of the resolved accounts Finding all of the accounts that follow a user's account i...

(ALL_OF_THE)


[style] ~60-~60: Consider removing “of” to be more concise
Context: ...ount is difficult because we can't load all of the follow relationships into memory (as th...

(ALL_OF_THE)

cedar/data-generation/generate-follows-notifications.js (1)

1-99: Well-structured multi-threaded notification generation script.

The implementation follows the same patterns as other generation scripts in the codebase, with proper parallelization, error handling, and progress reporting.

Comment on lines +70 to +82
CREATE TABLE notifications (
internal_id INT AUTO_INCREMENT PRIMARY KEY,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
event_type TINYINT UNSIGNED NOT NULL, -- 1=LIKE, 2=REPOST, 3=REPLY, 4=FOLLOW
user_id INT NOT NULL, -- The user the notification is for
account_id INT NOT NULL, -- The account that "did" the notification
post_id INT NULL, -- NULL for FOLLOW events
reply_post_id INT NULL, -- NULL for non-reply events
FOREIGN KEY (user_id) REFERENCES users(internal_id) ON DELETE CASCADE,
FOREIGN KEY (account_id) REFERENCES accounts(internal_id) ON DELETE CASCADE,
FOREIGN KEY (post_id) REFERENCES posts(internal_id) ON DELETE CASCADE,
FOREIGN KEY (reply_post_id) REFERENCES posts(internal_id) ON DELETE CASCADE
);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add an index on the user_id column for notification queries.

Since notifications will likely be queried by user_id frequently, adding an index would improve query performance.

CREATE TABLE notifications (
    internal_id INT AUTO_INCREMENT PRIMARY KEY,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    event_type TINYINT UNSIGNED NOT NULL, -- 1=LIKE, 2=REPOST, 3=REPLY, 4=FOLLOW
    user_id INT NOT NULL, -- The user the notification is for
    account_id INT NOT NULL, -- The account that "did" the notification
    post_id INT NULL, -- NULL for FOLLOW events
    reply_post_id INT NULL, -- NULL for non-reply events
    FOREIGN KEY (user_id) REFERENCES users(internal_id) ON DELETE CASCADE,
    FOREIGN KEY (account_id) REFERENCES accounts(internal_id) ON DELETE CASCADE,
    FOREIGN KEY (post_id) REFERENCES posts(internal_id) ON DELETE CASCADE,
    FOREIGN KEY (reply_post_id) REFERENCES posts(internal_id) ON DELETE CASCADE,
+   INDEX idx_user_id (user_id)
);

Committable suggestion skipped: line range outside the PR's diff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant