-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow the connector to delegate filter statistics estimation to the engine #844
Conversation
Please let me know if this approach is valid - I'll be happy to change/adapt the code to support this feature. |
3bad7d1
to
ed4f967
Compare
ed4f967
to
0b32383
Compare
// If the connector don't support statistics estimation for pushed-down predicate, we use FilterStatsCalculator | ||
// to heuristically update the PlanNodeStatsEstimate (similarly to how it's done by FilterStatsRule). | ||
Map<ColumnHandle, Symbol> assignments = ImmutableBiMap.copyOf(node.getAssignments()).inverse(); | ||
Expression predicate = domainTranslator.toPredicate(unestimatedPredicate.simplify().transform(assignments::get)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sopel39 I'd prefer pull-based approach to stats derivation. For example, if this is just SELECT ... WHERE ...
we need no stats during planning.
Also, a pushed down predicate may be refined several times before we ask for stats.
Having noted that, I didn't look into what are the technically viable options yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, a pushed down predicate may be refined several times before we ask for stats.
That might happen independently if there is unestimatedPredicate
or not. We always pull stats from connector when needed.
We might have similar issue with projection pushdown. unestimatedProjection
might be harder to implement though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have another idea. Some connectors accept pushed down filter but don't remove it from the plan (i think JDBC connectors do this).
This approach also solves stats problem -- a connector can report stats for the whole table, and the planner will infer stats for the filter predicate, because its still in the plan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some connectors accept pushed down filter but don't remove it from the plan (i think JDBC connectors do this).
We can definitely use this approach, but wouldn't it be have a small performance cost (due to the need to pass the scanned rows through the filter operator - that would act as a no-op)?
The motivation behind this PR is to use similar approach to ConnectorMetadata#applyFilter
API - which returns the "unapplied" predicates to the engine via ConstraintApplicationResult
.
Since I didn't want to break the existing API - I've added the "unestimated" predicates to the existing TableStatistics
object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
total elapsed time also increases from 1s to 2.5s for this specific query.
Is Varada data store able to execute "wide OR" much faster than Presto? How do you do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, what times do you see when you replace this "wide OR" with equivalent IN expression?
@Praveen2112 is working on an optimization (related to #932) that may potentially roll such OR into an IN, which should be faster to execute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is Varada data store able to execute "wide OR" much faster than Presto? How do you do this?
We have an optimized storage engine implementation, supporting highly effective predicate pushdown for most of the filter expressions.
Btw, what times do you see when you replace this "wide OR" with equivalent IN expression?
Good idea, will check and update with the results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that may potentially roll such OR into an IN, which should be faster to execute.
If the above predicate is pushed down to the TableScan
then it happens by default. (provided the connector supports predicate push down). The optimization that we are working will optimize in the case described (#932) or if the connector doesn't support predicate push down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, will check and update with the results.
If I use IN
(instead of wide OR
), the query finishes after ~1.9 seconds and uses ~37 CPU-seconds (see here).
Ping :) |
I'm not sure what was decided here regarding #844 (comment) |
0e4ea1a
to
acc6c57
Compare
…ngine If the connector don't support statistics estimation for pushed-down predicate, we use FilterStatsCalculator to heuristically update the PlanNodeStatsEstimate (similarly to how it's done by FilterStatsRule).
acc6c57
to
1590608
Compare
Ping :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is reasonable. It behaves as if the connector didn't actually handle that part of the predicate and it was left to the the engine to do actual filtering and estimate stats.
With #3697 this is going to be even more important. |
#6998 suggests a better approach than this PR. |
If the connector don't support statistics estimation for pushed-down predicate, we use
FilterStatsCalculator
to heuristically update thePlanNodeStatsEstimate
(similarly to how it's done byFilterStatsRule
).It would allow connectors that support generic predicate pushdown, but don't support efficient statistics estimation with a predicate - to return a better estimate of the filtered statistics.