Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: csv parsing from streams mangled #395

Open
hbwhbw opened this issue Mar 21, 2025 · 0 comments
Open

Bug: csv parsing from streams mangled #395

hbwhbw opened this issue Mar 21, 2025 · 0 comments

Comments

@hbwhbw
Copy link

hbwhbw commented Mar 21, 2025

Here's a minimal script that:

  • builds an aq table
  • exports it to a csv string, reads back the string, counts the rows
  • writes it to disk, read back the file, counts a different number of rows!

The behavior is consistent on a single run, but non-deterministic (a different wrong number) when run in a loop.

So the bugs are:

  • row counts should be preserved across writing a file and reading it back in
  • there should be no state carried from one invocation of loadCSV to the next

run with node 22 on linux and node 20 on Windows
I also have a web based version where you can play with the row counts and column counts to slightly different bad behavior.
If it works with 10K rows and 20 columns, try 100K rows and 200 columns.

// csv_fail.mjs
// $ node csv_fail.mjs
import * as aq from "arquero";
import { writeFileSync, unlinkSync } from 'fs';

// make the test CSV
const numRows = 10000;
const numColumns = 20;
const one = Object.fromEntries(Array(numColumns).fill(null).map((_, i) => [`a${i + 1}`, '"']));
const many = Array(numRows).fill(one);
const built = aq.from(many);
const csvFromBuilt = built.toCSV();

// read it back from a string (good)
const fromCsvFromBuilt = await aq.fromCSV(csvFromBuilt);
// prints 10000 (good)
console.log(fromCsvFromBuilt.numRows());

// write it to disk and read it back (bad)
writeFileSync('test.csv', csvFromBuilt);
const loadCsvFromDisk = await aq.loadCSV('test.csv');
// prints 4587 (bad)
console.log(loadCsvFromDisk.numRows());

unlinkSync('test.csv');
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant