-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: Add .list.struct_field
to obtain fields in lists of structs
#21556
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #21556 +/- ##
==========================================
- Coverage 79.75% 79.75% -0.01%
==========================================
Files 1591 1591
Lines 229480 229521 +41
Branches 2625 2625
==========================================
+ Hits 183018 183049 +31
- Misses 45857 45867 +10
Partials 605 605 ☔ View full report in Codecov by Sentry. |
This feels too specific to me. I think this should be a plugin. |
I also briefly thought about this @ritchie46 but, to me, this feels rather "core" to working with lists & structs. IMO it should be in polars' realm to optimally deal with Arrow's memory layout for basic operations (I'd personally count accessing a struct field, even within a list, as such). Unless you have a strong opinion against this, I'd love to hear if others share your sentiment 👀 One alternative idea: I realized that it would also be possible to check for Expr::Function {
input,
function: FunctionExpr::StructExpr(StructFunction::FieldByName(name)),
options: _,
} if input.len() == 1 => {
if let Expr::Column(col_name) = &input[0]
&& *col_name == PlSmallStr::EMPTY
{
...
}
} This way, we could keep the public interface of the list namespace like it is and short-circuit in the |
I think this can be more generic. We can add a codepath to directly apply the |
Yes, that sounds like something I'd much rather do. Improve the |
Ok! I'll need to dig in there a little bit, I'll update this PR with changes/questions :) |
Motivation
When using lists of structs, a common operation is to reduce the list to only a single field of the struct. Unfortunately, the only way to currently do this in polars is to use the
eval
method:This is suboptimal since it actually copies the data for that field even though the Arrow memory layout would not require this.
This PR adds a new function in the list expression namespace, namely
struct_field
. It allows to reduce a list of structs to a single field without performing an expensive copy:On a data frame with 1,000,000 rows,
struct_field
is ~40x faster than usingeval
. On a data frame with 10,000,000 rows, it is even up to 100x faster.