Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support arbitrary separators in read_csv #61

Open
apcamargo opened this issue Mar 7, 2024 · 3 comments
Open

Support arbitrary separators in read_csv #61

apcamargo opened this issue Mar 7, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@apcamargo
Copy link

Description

TSV files are pretty common, but I couldn't make them work with polars-cli. Apparently, the read_csv function of polars-cli doesn't take the same arguments as the one in Python (maybe it doesn't support arguments at all?)

From what I've seen in table_functions.rs, read_csv uses a LazyCsvReader, which apparently supports the separator argument. I can't say I'm very familiar with Polar's code, though. I might be missing something obvious.

@apcamargo apcamargo added the enhancement New feature or request label Mar 7, 2024
@jeroenflvr
Copy link

I have a use case for this. Currently read_csv() only accepts a single file path and has hardcoded try_parse_dates(true) and missing_is_null(true). It would be great if polars-cli's read_csv() was at feature parity with the python api's read_csv().

I was under the impression someone was already working on this a while ago, but I can take it.

@jeroenflvr
Copy link

jeroenflvr commented Mar 3, 2025

For this:

    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.36s
     Running `target/debug/polars`
Polars CLI version 0.9.0
Type .help for help.
〉select * FROM read_csv('foods.csv');
┌────────────┬──────────┬────────┬──────────┐
│ category   ┆ calories ┆ fats_g ┆ sugars_g │
│ ---        ┆ ---      ┆ ---    ┆ ---      │
│ str        ┆ i64      ┆ f64    ┆ i64      │
╞════════════╪══════════╪════════╪══════════╡
│ vegetables ┆ 45       ┆ 0.5    ┆ 2        │
│ seafood    ┆ 150      ┆ 5.0    ┆ 0        │
│ meat       ┆ 100      ┆ 5.0    ┆ 0        │
│ fruit      ┆ 60       ┆ 0.0    ┆ 11       │
│ seafood    ┆ 140      ┆ 5.0    ┆ 1        │
│ …          ┆ …        ┆ …      ┆ …        │
│ seafood    ┆ 100      ┆ 5.0    ┆ 0        │
│ seafood    ┆ 200      ┆ 10.0   ┆ 0        │
│ seafood    ┆ 200      ┆ 7.0    ┆ 2        │
│ fruit      ┆ 60       ┆ 0.0    ┆ 11       │
│ meat       ┆ 110      ┆ 7.0    ┆ 0        │
└────────────┴──────────┴────────┴──────────┘
〉select * FROM read_csv('foods_semicolon.csv');
┌─────────────────────────────────┐
│ category;calories;fats_g;sugar… │
│ ---                             │
│ str                             │
╞═════════════════════════════════╡
│ vegetables;45;0.5;2             │
│ seafood;150;5;0                 │
│ meat;100;5;0                    │
│ fruit;60;0;11                   │
│ seafood;140;5;1                 │
│ …                               │
│ seafood;100;5;0                 │
│ seafood;200;10;0                │
│ seafood;200;7;2                 │
│ fruit;60;0;11                   │
│ meat;110;7;0                    │
└─────────────────────────────────┘
〉select * FROM read_csv('foods_semicolon.csv', separator=';');
┌────────────┬──────────┬────────┬──────────┐
│ category   ┆ calories ┆ fats_g ┆ sugars_g │
│ ---        ┆ ---      ┆ ---    ┆ ---      │
│ str        ┆ i64      ┆ f64    ┆ i64      │
╞════════════╪══════════╪════════╪══════════╡
│ vegetables ┆ 45       ┆ 0.5    ┆ 2        │
│ seafood    ┆ 150      ┆ 5.0    ┆ 0        │
│ meat       ┆ 100      ┆ 5.0    ┆ 0        │
│ fruit      ┆ 60       ┆ 0.0    ┆ 11       │
│ seafood    ┆ 140      ┆ 5.0    ┆ 1        │
│ …          ┆ …        ┆ …      ┆ …        │
│ seafood    ┆ 100      ┆ 5.0    ┆ 0        │
│ seafood    ┆ 200      ┆ 10.0   ┆ 0        │
│ seafood    ┆ 200      ┆ 7.0    ┆ 2        │
│ fruit      ┆ 60       ┆ 0.0    ┆ 11       │
│ meat       ┆ 110      ┆ 7.0    ┆ 0        │
└────────────┴──────────┴────────┴──────────┘
〉select * FROM read_csv('foods_hash.csv', separator=';');
┌─────────────────────────────────┐
│ category#calories#fats_g#sugar… │
│ ---                             │
│ str                             │
╞═════════════════════════════════╡
│ vegetables#45#0.5#2             │
│ seafood#150#5#0                 │
│ meat#100#5#0                    │
│ fruit#60#0#11                   │
│ seafood#140#5#1                 │
│ …                               │
│ seafood#100#5#0                 │
│ seafood#200#10#0                │
│ seafood#200#7#2                 │
│ fruit#60#0#11                   │
│ meat#110#7#0                    │
└─────────────────────────────────┘
〉select * FROM read_csv('foods_hash.csv', separator='#');
┌────────────┬──────────┬────────┬──────────┐
│ category   ┆ calories ┆ fats_g ┆ sugars_g │
│ ---        ┆ ---      ┆ ---    ┆ ---      │
│ str        ┆ i64      ┆ f64    ┆ i64      │
╞════════════╪══════════╪════════╪══════════╡
│ vegetables ┆ 45       ┆ 0.5    ┆ 2        │
│ seafood    ┆ 150      ┆ 5.0    ┆ 0        │
│ meat       ┆ 100      ┆ 5.0    ┆ 0        │
│ fruit      ┆ 60       ┆ 0.0    ┆ 11       │
│ seafood    ┆ 140      ┆ 5.0    ┆ 1        │
│ …          ┆ …        ┆ …      ┆ …        │
│ seafood    ┆ 100      ┆ 5.0    ┆ 0        │
│ seafood    ┆ 200      ┆ 10.0   ┆ 0        │
│ seafood    ┆ 200      ┆ 7.0    ┆ 2        │
│ fruit      ┆ 60       ┆ 0.0    ┆ 11       │
│ meat       ┆ 110      ┆ 7.0    ┆ 0        │
└────────────┴──────────┴────────┴──────────┘
〉SELECT *
〉FROM read_csv('foods_tab.csv', separator='\t')
〉WHERE fats_g >= 10;
┌──────────┬──────────┬────────┬──────────┐
│ category ┆ calories ┆ fats_g ┆ sugars_g │
│ ---      ┆ ---      ┆ ---    ┆ ---      │
│ str      ┆ i64      ┆ f64    ┆ i64      │
╞══════════╪══════════╪════════╪══════════╡
│ meat     ┆ 120      ┆ 10.0   ┆ 1        │
│ seafood  ┆ 200      ┆ 10.0   ┆ 0        │
└──────────┴──────────┴────────┴──────────┘
〉

stdin:

$ cat foods_query.sql
SELECT
    category,
    calories
FROM read_csv('foods_semicolon.csv', separator = ';')
WHERE calories > 100;
$ cat foods_query.sql | polars
┌──────────┬──────────┐
│ category ┆ calories │
│ ---      ┆ ---      │
│ str      ┆ i64      │
╞══════════╪══════════╡
│ seafood  ┆ 150      │
│ seafood  ┆ 140      │
│ meat     ┆ 120      │
│ seafood  ┆ 130      │
│ meat     ┆ 110      │
│ …        ┆ …        │
│ seafood  ┆ 200      │
│ meat     ┆ 110      │
│ seafood  ┆ 200      │
│ seafood  ┆ 130      │
│ fruit    ┆ 130      │
└──────────┴──────────┘
$

Done:

  • polars-cli: modify buffer handling on ';' (using parser combinator library nom in workspace dependencies - need to check if that's ok with guidelines, versions, ..)
  • polars-sql: handle args for read_csv() in table_functions.rs to match read and parse options

@jeroenflvr
Copy link

Before continuing, @ritchie46 might there be any specific reason why we shouldn't extend read_csv()?

If not, should I first create an issue on the main repo?

There's an explicit test on the number of arguments passed, so not sure if that's for the sake of testing completeness or to block extending.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants