Client-side pglite struggles with large (~5000 row) import #568

lucasgelfond · 2025-03-12T03:56:31Z

Hi! Trying to push pglite to its limits, a bit—I want to do hybrid search (w/ embeddings + full-text) all client-side in the browser, on a ~5000 row table. The table has an 384-dim embedding and fts index on each, and I have tested this works. The real issue is wrangling initialization. The database lives in Supabase right now —I didn't want to bother with converting formats, so I just downloaded the full JSON, and imported row by row to pglite (takes ~2 mins) like this:

	for (const row of rows) {
		const { parent_issue_id, page_number, ocr_result, embedding, image_url } = row;
		if (parent_issue_id && page_number && ocr_result) {
			await db.query(
				`INSERT INTO page (parent_issue_id, page_number, ocr_result, embedding, image_url) 
				VALUES ($1, $2, $3, $4, $5)`,
				[parent_issue_id, page_number, ocr_result, embedding, image_url]
			);
		}
	}
};

My thought was this would be a quick stop-gap/MacGyver-y solution: do this once, save the pglite compatible tarball, and serve this (50MB) by CDN for users, which will make actual search very, very fast. My issue is that, while I am having no problem saving a tarball, I can't manage to quickly import it.

My import code has (for testing) consisted of a file input and then some logic to instantiate a new db.

		const newDb = new PGlite('idb://supa-semantic-search', {
			loadDataDir: dumpFile,
			extensions: {
				vector,
				uuid_ossp
			},
			debug: 5
		});

The file seems to go through, etc, the problem is that I hang for several minutes (never gotten to completion!) on await newDb.waitReady;. It sounded like dev/blob was another option, but it's less immediately suitable for me because I'm using two tables (it's for a magazine: page and issue). This also proved more difficult than anticipated to get working. I guess one way I could do this would be to lock clients out for ~2 minutes and do that really slow manual loading, but this also seems quite sub-ideal. I followed the https://pglite.dev/examples/dump-data-dir example pretty much to a tee and still no luck. Happy to send over code and data to reproduce if helpful, there may also be some better way to do this?

The text was updated successfully, but these errors were encountered:

samwillis · 2025-03-13T12:37:53Z

Hey @lucasgelfond

As you have spotted best way to solve this import is using the COPY command with /dev/blob/, and constructing a CSV for Postgres to import from. You will find that significantly faster. Would two civ's not work for you for the two tables?

Alternatively, batching the inserts with multiple records per insert VALUES ($1, $2, $3, $4, $5) ($6, $7, $8, $9, $10) ... can also help a lot, as can wrapping in a transaction, or importing to a temp table then copying across to the final table.

treetopdevs · 2025-03-15T04:50:48Z

@samwillis Is there an example somewhere of using COPY and a CSV to import?

samwillis · 2025-03-15T14:26:18Z

@treetopdevs we have an example here where we copy to a css and back again: https://github.com/electric-sql/pglite/blob/main/packages/pglite/examples/copy.html

lucasgelfond · 2025-03-15T20:07:12Z

@samwillis thank you!

Still some issues.

I got the CSV import to work (after doing pgdump directly - Supabase's CSV encoding broke!), but I need to do in a temp table first. I benchmarked it, and it works alright: the CSV import is pretty quick (~4s), but it takes forever to copy the table over and convert to the correct types (~31s).

When I try to do this without a temp table I always get errors, most commonly error: extra data after last expected column or errors about the schema, i.e. error: column "error" of relation "page" does not exist (there is an error column in my database!)

I figure a good workaround here is to do the (slow) CSV-import-with-temp-table once, and then load from tarball. I did dumpDataDir, the database is ~50 MB, and I have successful load, but it is missing a ton of my data. I have two tables, issue (~150 rows) and page (~4500 rows and way larger), the tarball load is fast (~1s) but only gets the issue table. That said, it recognizes the page schema, and when I dumpDataDir this again, it's still ~50MB. What's going on?

Thanks again for your help!

lucasgelfond · 2025-03-15T20:08:28Z

Happy to send code and/or data if you want to repro, I think this probably is actually a bug. I'm mostly working in https://github.com/lucasgelfond/semantic-whole-earth/blob/main/frontend/src/lib/setup.ts although this may change over the next few hours, can also try to more minimally repro any of these individual issues if helpfuL!

lucasgelfond · 2025-03-15T20:20:32Z

Somewhat promising: if I do CSV import, but only import the page (~4500 rows) table w/ temp table, dump it, and then reimport it, this does work, and only takes about ~2s. My stopgap probably will be doing this and then manually importing the row CSV (because it will be much faster), but obviously would rather have one tarball import work (seems like a bug that only one table is getting imported?)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Client-side pglite struggles with large (~5000 row) import #568

Client-side pglite struggles with large (~5000 row) import #568

lucasgelfond commented Mar 12, 2025

samwillis commented Mar 13, 2025

treetopdevs commented Mar 15, 2025

samwillis commented Mar 15, 2025

lucasgelfond commented Mar 15, 2025 •

edited

Loading

lucasgelfond commented Mar 15, 2025

lucasgelfond commented Mar 15, 2025 •

edited

Loading

Client-side pglite struggles with large (~5000 row) import #568

Client-side pglite struggles with large (~5000 row) import #568

Comments

lucasgelfond commented Mar 12, 2025

samwillis commented Mar 13, 2025

treetopdevs commented Mar 15, 2025

samwillis commented Mar 15, 2025

lucasgelfond commented Mar 15, 2025 • edited Loading

lucasgelfond commented Mar 15, 2025

lucasgelfond commented Mar 15, 2025 • edited Loading

lucasgelfond commented Mar 15, 2025 •

edited

Loading

lucasgelfond commented Mar 15, 2025 •

edited

Loading