Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow the connector to delegate filter statistics estimation to the engine #844

Closed
wants to merge 1 commit into from

Conversation

rzeyde-varada
Copy link
Contributor

If the connector don't support statistics estimation for pushed-down predicate, we use FilterStatsCalculator to heuristically update the PlanNodeStatsEstimate (similarly to how it's done by FilterStatsRule).

It would allow connectors that support generic predicate pushdown, but don't support efficient statistics estimation with a predicate - to return a better estimate of the filtered statistics.

@cla-bot cla-bot bot added the cla-signed label May 29, 2019
@rzeyde-varada
Copy link
Contributor Author

Please let me know if this approach is valid - I'll be happy to change/adapt the code to support this feature.

@rzeyde-varada rzeyde-varada changed the title VDB-711 Allow the connector to delegate filter statistics estimation to the engine Allow the connector to delegate filter statistics estimation to the engine May 29, 2019
// If the connector don't support statistics estimation for pushed-down predicate, we use FilterStatsCalculator
// to heuristically update the PlanNodeStatsEstimate (similarly to how it's done by FilterStatsRule).
Map<ColumnHandle, Symbol> assignments = ImmutableBiMap.copyOf(node.getAssignments()).inverse();
Expression predicate = domainTranslator.toPredicate(unestimatedPredicate.simplify().transform(assignments::get));
Copy link
Member

@sopel39 sopel39 May 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry that accuracy of stats can be greatly degraded by this operation.

Maybe instead the engine would also provide it's current estimates for pushed down expressions (e.g: ColumnStatistics after predicate is applied on raw table stats).

WDYT: @findepi @martint

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sopel39 I'd prefer pull-based approach to stats derivation. For example, if this is just SELECT ... WHERE ... we need no stats during planning.
Also, a pushed down predicate may be refined several times before we ask for stats.
Having noted that, I didn't look into what are the technically viable options yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, a pushed down predicate may be refined several times before we ask for stats.

That might happen independently if there is unestimatedPredicate or not. We always pull stats from connector when needed.

We might have similar issue with projection pushdown. unestimatedProjection might be harder to implement though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have another idea. Some connectors accept pushed down filter but don't remove it from the plan (i think JDBC connectors do this).
This approach also solves stats problem -- a connector can report stats for the whole table, and the planner will infer stats for the filter predicate, because its still in the plan.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some connectors accept pushed down filter but don't remove it from the plan (i think JDBC connectors do this).

We can definitely use this approach, but wouldn't it be have a small performance cost (due to the need to pass the scanned rows through the filter operator - that would act as a no-op)?

The motivation behind this PR is to use similar approach to ConnectorMetadata#applyFilter API - which returns the "unapplied" predicates to the engine via ConstraintApplicationResult.
Since I didn't want to break the existing API - I've added the "unestimated" predicates to the existing TableStatistics object.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

total elapsed time also increases from 1s to 2.5s for this specific query.

Is Varada data store able to execute "wide OR" much faster than Presto? How do you do this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, what times do you see when you replace this "wide OR" with equivalent IN expression?
@Praveen2112 is working on an optimization (related to #932) that may potentially roll such OR into an IN, which should be faster to execute.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Varada data store able to execute "wide OR" much faster than Presto? How do you do this?

We have an optimized storage engine implementation, supporting highly effective predicate pushdown for most of the filter expressions.

Btw, what times do you see when you replace this "wide OR" with equivalent IN expression?

Good idea, will check and update with the results.

Copy link
Member

@Praveen2112 Praveen2112 Jun 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that may potentially roll such OR into an IN, which should be faster to execute.

If the above predicate is pushed down to the TableScan then it happens by default. (provided the connector supports predicate push down). The optimization that we are working will optimize in the case described (#932) or if the connector doesn't support predicate push down.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, will check and update with the results.

If I use IN (instead of wide OR), the query finishes after ~1.9 seconds and uses ~37 CPU-seconds (see here).

@rzeyde-varada
Copy link
Contributor Author

Ping :)

@sopel39
Copy link
Member

sopel39 commented Jun 27, 2019

I'm not sure what was decided here regarding #844 (comment)
@findepi @electrum ?
For me we could keep the filter above table scan.

@rzeyde-varada rzeyde-varada force-pushed the estimate-filter branch 2 times, most recently from 0e4ea1a to acc6c57 Compare February 12, 2020 11:43
…ngine

If the connector don't support statistics estimation for pushed-down predicate, we use FilterStatsCalculator
to heuristically update the PlanNodeStatsEstimate (similarly to how it's done by FilterStatsRule).
@rzeyde-varada
Copy link
Contributor Author

Ping :)

@martint martint self-requested a review May 31, 2020 07:29
Copy link
Member

@martint martint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is reasonable. It behaves as if the connector didn't actually handle that part of the predicate and it was left to the the engine to do actual filtering and estimate stats.

@findepi
Copy link
Member

findepi commented Jun 6, 2020

With #3697 this is going to be even more important.

@rzeyde-varada
Copy link
Contributor Author

#6998 suggests a better approach than this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

6 participants