Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client-side pglite struggles with large (~5000 row) import #568

Open
lucasgelfond opened this issue Mar 12, 2025 · 6 comments
Open

Client-side pglite struggles with large (~5000 row) import #568

lucasgelfond opened this issue Mar 12, 2025 · 6 comments

Comments

@lucasgelfond
Copy link

Hi! Trying to push pglite to its limits, a bit—I want to do hybrid search (w/ embeddings + full-text) all client-side in the browser, on a ~5000 row table. The table has an 384-dim embedding and fts index on each, and I have tested this works. The real issue is wrangling initialization. The database lives in Supabase right now —I didn't want to bother with converting formats, so I just downloaded the full JSON, and imported row by row to pglite (takes ~2 mins) like this:

	for (const row of rows) {
		const { parent_issue_id, page_number, ocr_result, embedding, image_url } = row;
		if (parent_issue_id && page_number && ocr_result) {
			await db.query(
				`INSERT INTO page (parent_issue_id, page_number, ocr_result, embedding, image_url) 
				VALUES ($1, $2, $3, $4, $5)`,
				[parent_issue_id, page_number, ocr_result, embedding, image_url]
			);
		}
	}
};

My thought was this would be a quick stop-gap/MacGyver-y solution: do this once, save the pglite compatible tarball, and serve this (50MB) by CDN for users, which will make actual search very, very fast. My issue is that, while I am having no problem saving a tarball, I can't manage to quickly import it.

My import code has (for testing) consisted of a file input and then some logic to instantiate a new db.

		const newDb = new PGlite('idb://supa-semantic-search', {
			loadDataDir: dumpFile,
			extensions: {
				vector,
				uuid_ossp
			},
			debug: 5
		});

The file seems to go through, etc, the problem is that I hang for several minutes (never gotten to completion!) on await newDb.waitReady;. It sounded like dev/blob was another option, but it's less immediately suitable for me because I'm using two tables (it's for a magazine: page and issue). This also proved more difficult than anticipated to get working. I guess one way I could do this would be to lock clients out for ~2 minutes and do that really slow manual loading, but this also seems quite sub-ideal. I followed the https://pglite.dev/examples/dump-data-dir example pretty much to a tee and still no luck. Happy to send over code and data to reproduce if helpful, there may also be some better way to do this?

@samwillis
Copy link
Collaborator

Hey @lucasgelfond

As you have spotted best way to solve this import is using the COPY command with /dev/blob/, and constructing a CSV for Postgres to import from. You will find that significantly faster. Would two civ's not work for you for the two tables?

Alternatively, batching the inserts with multiple records per insert VALUES ($1, $2, $3, $4, $5) ($6, $7, $8, $9, $10) ... can also help a lot, as can wrapping in a transaction, or importing to a temp table then copying across to the final table.

@treetopdevs
Copy link

@samwillis Is there an example somewhere of using COPY and a CSV to import?

@samwillis
Copy link
Collaborator

@treetopdevs we have an example here where we copy to a css and back again: https://github.com/electric-sql/pglite/blob/main/packages/pglite/examples/copy.html

@lucasgelfond
Copy link
Author

lucasgelfond commented Mar 15, 2025

@samwillis thank you!

Still some issues.

I got the CSV import to work (after doing pgdump directly - Supabase's CSV encoding broke!), but I need to do in a temp table first. I benchmarked it, and it works alright: the CSV import is pretty quick (~4s), but it takes forever to copy the table over and convert to the correct types (~31s).

When I try to do this without a temp table I always get errors, most commonly error: extra data after last expected column or errors about the schema, i.e. error: column "error" of relation "page" does not exist (there is an error column in my database!)

I figure a good workaround here is to do the (slow) CSV-import-with-temp-table once, and then load from tarball. I did dumpDataDir, the database is ~50 MB, and I have successful load, but it is missing a ton of my data. I have two tables, issue (~150 rows) and page (~4500 rows and way larger), the tarball load is fast (~1s) but only gets the issue table. That said, it recognizes the page schema, and when I dumpDataDir this again, it's still ~50MB. What's going on?

Thanks again for your help!

@lucasgelfond
Copy link
Author

Happy to send code and/or data if you want to repro, I think this probably is actually a bug. I'm mostly working in https://github.com/lucasgelfond/semantic-whole-earth/blob/main/frontend/src/lib/setup.ts although this may change over the next few hours, can also try to more minimally repro any of these individual issues if helpfuL!

@lucasgelfond
Copy link
Author

lucasgelfond commented Mar 15, 2025

Somewhat promising: if I do CSV import, but only import the page (~4500 rows) table w/ temp table, dump it, and then reimport it, this does work, and only takes about ~2s. My stopgap probably will be doing this and then manually importing the row CSV (because it will be much faster), but obviously would rather have one tarball import work (seems like a bug that only one table is getting imported?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants