perf: Add `.list.struct_field` to obtain fields in lists of structs #21556

borchero · 2025-03-02T18:22:02Z

Motivation

When using lists of structs, a common operation is to reduce the list to only a single field of the struct. Unfortunately, the only way to currently do this in polars is to use the eval method:

df.list.eval(pl.element().struct.field("..."))

This is suboptimal since it actually copies the data for that field even though the Arrow memory layout would not require this.

This PR adds a new function in the list expression namespace, namely struct_field. It allows to reduce a list of structs to a single field without performing an expensive copy:

df.list.struct_field("...")

On a data frame with 1,000,000 rows, struct_field is ~40x faster than using eval. On a data frame with 10,000,000 rows, it is even up to 100x faster.

codecov · 2025-03-02T18:33:41Z

Codecov Report

Attention: Patch coverage is 87.80488% with 5 lines in your changes missing coverage. Please review.

Project coverage is 79.75%. Comparing base (2ae7287) to head (e38041a).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
crates/polars-plan/src/plans/expr_ir.rs	0.00%	3 Missing ⚠️
crates/polars-plan/src/dsl/function_expr/list.rs	90.47%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #21556      +/-   ##
==========================================
- Coverage   79.75%   79.75%   -0.01%     
==========================================
  Files        1591     1591              
  Lines      229480   229521      +41     
  Branches     2625     2625              
==========================================
+ Hits       183018   183049      +31     
- Misses      45857    45867      +10     
  Partials      605      605

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ritchie46 · 2025-03-02T18:49:57Z

This feels too specific to me. I think this should be a plugin.

borchero · 2025-03-02T19:29:49Z

I also briefly thought about this @ritchie46 but, to me, this feels rather "core" to working with lists & structs. IMO it should be in polars' realm to optimally deal with Arrow's memory layout for basic operations (I'd personally count accessing a struct field, even within a list, as such). Unless you have a strong opinion against this, I'd love to hear if others share your sentiment 👀

One alternative idea: I realized that it would also be possible to check for pl.element().struct.field("...") in the eval method by pattern matching on

Expr::Function {
    input,
    function: FunctionExpr::StructExpr(StructFunction::FieldByName(name)),
    options: _,
} if input.len() == 1 => {
    if let Expr::Column(col_name) = &input[0]
        && *col_name == PlSmallStr::EMPTY
    {
        ...
    }
}

This way, we could keep the public interface of the list namespace like it is and short-circuit in the eval implementation. What do you think about that? This would also allow all users which currently use this pattern to profit from the performance improvement without any need for changes.

nameexhaustion · 2025-03-03T10:00:05Z

I think this can be more generic. We can add a codepath to directly apply the list.eval expression to the underlying values array if we see that it is elementwise.

ritchie46 · 2025-03-05T14:40:11Z

I think this can be more generic. We can add a codepath to directly apply the list.eval expression to the underlying values array if we see that it is elementwise.

Yes, that sounds like something I'd much rather do. Improve the eval in a generic fashion.

borchero · 2025-03-05T15:50:56Z

Ok! I'll need to dig in there a little bit, I'll update this PR with changes/questions :)

perf: Add .list.struct_field to obtain fields in lists of structs

3681028

borchero requested review from ritchie46, c-peters, alexander-beedie, MarcoGorelli, reswqa and orlp as code owners March 2, 2025 18:22

github-actions bot added performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars labels Mar 2, 2025

borchero added 3 commits March 2, 2025 21:13

Merge remote-tracking branch 'origin/main' into struct-field

80ccb50

Rerun

d6fd30c

Fix ci

e38041a

borchero marked this pull request as draft March 16, 2025 00:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Add `.list.struct_field` to obtain fields in lists of structs #21556

perf: Add `.list.struct_field` to obtain fields in lists of structs #21556

borchero commented Mar 2, 2025

codecov bot commented Mar 2, 2025 •

edited

Loading

ritchie46 commented Mar 2, 2025

borchero commented Mar 2, 2025 •

edited

Loading

nameexhaustion commented Mar 3, 2025

ritchie46 commented Mar 5, 2025

borchero commented Mar 5, 2025

perf: Add .list.struct_field to obtain fields in lists of structs #21556

Are you sure you want to change the base?

perf: Add .list.struct_field to obtain fields in lists of structs #21556

Conversation

borchero commented Mar 2, 2025

Motivation

codecov bot commented Mar 2, 2025 • edited Loading

Codecov Report

ritchie46 commented Mar 2, 2025

borchero commented Mar 2, 2025 • edited Loading

nameexhaustion commented Mar 3, 2025

ritchie46 commented Mar 5, 2025

borchero commented Mar 5, 2025

perf: Add `.list.struct_field` to obtain fields in lists of structs #21556

perf: Add `.list.struct_field` to obtain fields in lists of structs #21556

codecov bot commented Mar 2, 2025 •

edited

Loading

borchero commented Mar 2, 2025 •

edited

Loading