Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve prevalence metrics by recording all scanned sites in scan db #89

Open
ghostwords opened this issue Mar 22, 2025 · 0 comments
Open
Labels
enhancement New feature or request

Comments

@ghostwords
Copy link
Member

ghostwords commented Mar 22, 2025

We should record all sites we visited in scan db, including sites with no trackers. If there are redirects, we should record the actual final site domain that we scanned.

This will enable:

  • More meaningful tracker prevalence data
  • Greater scan visibility ("80% of visited sites contain tracking", top ten slowest sites to visit)
  • Listing of sites with no trackers
  • Improvements to scan site list quality

Note: there will be sites with no trackers that have GA on them; it's just that PB didn't record tracking there for whatever reason

This continues work started in 5211f67 and 4e4d5f2.

New scan db table idea:

CREATE TABLE scan_sites (
    scan_id INTEGER NOT NULL,
    initial_site_id INTEGER NOT NULL,
    final_site_id INTEGER NOT NULL,
    status_id INTEGER NOT NULL,
    start_time TIMESTAMP NOT NULL,
    end_time TIMESTAMP NOT NULL,
    UNIQUE(scan_id, initial_site_id)
    FOREIGN KEY(scan_id) REFERENCES scan(id)
    FOREIGN KEY(initial_site_id) REFERENCES site(id)
    FOREIGN KEY(final_site_id) REFERENCES site(id)
    FOREIGN KEY(status_id) REFERENCES site_status(id))
site_statuses = {
    "success": 1,
    "timeout": 2,
    "error": 3,
    "antibot": 4,
}

This will probably require updating some of the queries in sql/.

@ghostwords ghostwords added the enhancement New feature or request label Mar 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant