Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Add .list.struct_field to obtain fields in lists of structs #21556

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

borchero
Copy link
Contributor

@borchero borchero commented Mar 2, 2025

Motivation

When using lists of structs, a common operation is to reduce the list to only a single field of the struct. Unfortunately, the only way to currently do this in polars is to use the eval method:

df.list.eval(pl.element().struct.field("..."))

This is suboptimal since it actually copies the data for that field even though the Arrow memory layout would not require this.

This PR adds a new function in the list expression namespace, namely struct_field. It allows to reduce a list of structs to a single field without performing an expensive copy:

df.list.struct_field("...")

On a data frame with 1,000,000 rows, struct_field is ~40x faster than using eval. On a data frame with 10,000,000 rows, it is even up to 100x faster.

@github-actions github-actions bot added performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars labels Mar 2, 2025
Copy link

codecov bot commented Mar 2, 2025

Codecov Report

Attention: Patch coverage is 87.80488% with 5 lines in your changes missing coverage. Please review.

Project coverage is 79.75%. Comparing base (2ae7287) to head (e38041a).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
crates/polars-plan/src/plans/expr_ir.rs 0.00% 3 Missing ⚠️
crates/polars-plan/src/dsl/function_expr/list.rs 90.47% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #21556      +/-   ##
==========================================
- Coverage   79.75%   79.75%   -0.01%     
==========================================
  Files        1591     1591              
  Lines      229480   229521      +41     
  Branches     2625     2625              
==========================================
+ Hits       183018   183049      +31     
- Misses      45857    45867      +10     
  Partials      605      605              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ritchie46
Copy link
Member

This feels too specific to me. I think this should be a plugin.

@borchero
Copy link
Contributor Author

borchero commented Mar 2, 2025

I also briefly thought about this @ritchie46 but, to me, this feels rather "core" to working with lists & structs. IMO it should be in polars' realm to optimally deal with Arrow's memory layout for basic operations (I'd personally count accessing a struct field, even within a list, as such). Unless you have a strong opinion against this, I'd love to hear if others share your sentiment 👀


One alternative idea: I realized that it would also be possible to check for pl.element().struct.field("...") in the eval method by pattern matching on

Expr::Function {
    input,
    function: FunctionExpr::StructExpr(StructFunction::FieldByName(name)),
    options: _,
} if input.len() == 1 => {
    if let Expr::Column(col_name) = &input[0]
        && *col_name == PlSmallStr::EMPTY
    {
        ...
    }
}

This way, we could keep the public interface of the list namespace like it is and short-circuit in the eval implementation. What do you think about that? This would also allow all users which currently use this pattern to profit from the performance improvement without any need for changes.

@nameexhaustion
Copy link
Collaborator

I think this can be more generic. We can add a codepath to directly apply the list.eval expression to the underlying values array if we see that it is elementwise.

@ritchie46
Copy link
Member

I think this can be more generic. We can add a codepath to directly apply the list.eval expression to the underlying values array if we see that it is elementwise.

Yes, that sounds like something I'd much rather do. Improve the eval in a generic fashion.

@borchero
Copy link
Contributor Author

borchero commented Mar 5, 2025

Ok! I'll need to dig in there a little bit, I'll update this PR with changes/questions :)

@borchero borchero marked this pull request as draft March 16, 2025 00:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants