Skip to content

Conversation

tokoko
Copy link
Contributor

@tokoko tokoko commented Mar 15, 2025

Adds relations field to ExtendedExpression message. This is necessary because Expressions may contain ReferenceRels that currently can't point to anything. I chose to add relations with a repeated ExtendedExpressionRel type rather than a repeated Rel because I'm anticipating the merge of #726 which will require this message to also contain an anchor.

@vbarua
Copy link
Member

vbarua commented Jun 18, 2025

Adds relations field to ExtendedExpression message. This is necessary because Expressions may contain ReferenceRels

What's a usecase were we would want to allow an ExtendedExpression message to reference a relation? My understanding is that ExtendedExpressions are primarily intended as a way to permit the encoding of expression only based messages.

If there's an ExtendedExpression that references a full relation, I would prefer to push users towards using a standard Plan at that point.

@drin
Copy link
Member

drin commented Jun 19, 2025

Moving my comment from #726 to here since it made more sense.

I'm assuming the relations field refers to inputs of the ExpressionReference and are not meant to be pointers to other Rel messages in the Plan? I just want to clarify that including an embedded Rel message is not a good way to reference another message in protobuf and it's not clear to me what the intention is for the embedded Rel message.

I'm also wondering:

  • why define an ExtendedExpressionRel (that only holds a Rel) instead of making the relations field point to PlanRel (either via anchor or ordinal)?
  • why can't ReferenceRels in expressions "point to anything"?

@tokoko
Copy link
Contributor Author

tokoko commented Jun 19, 2025

What's a usecase were we would want to allow an ExtendedExpression message to reference a relation?

My immediate use case might be a bit unusual. I'm using ExtendedExpression messages as necessary intermediary steps in substrait-python builders while incrementally building up plans. see an example. In order for some sort of a subquery expression to refer to a CTE, you would need to pass that information through, hence the need for a list of relations in ExtendedExpressions.

My understanding is that ExtendedExpressions are primarily intended as a way to permit the encoding of expression only based messages.

My understanding is that they encode computation (similar to plans) where the final form of the compute is an expression rather than a relation, so that they can be (re)used in different contexts.

Take something like x IN (SELECT * FROM t) for an example. Even now, a valid ExtendedExpression can contain arbitrarily complex computation through Subquery expression type. I don't see a valid reason why we would allow that, but disallow the use case of referring to CTEs in the same exact contexts. In the example I show above t relation can be either a table or a name for a CTE after all.

@tokoko
Copy link
Contributor Author

tokoko commented Jun 19, 2025

I'm assuming the relations field refers to inputs of the ExpressionReference and are not meant to be pointers to other Rel messages in the Plan? I just want to clarify that including an embedded Rel message is not a good way to reference another message in protobuf and it's not clear to me what the intention is for the embedded Rel message.

I'm not sure I fully follow the question tbh. The intention is to enable the use of CTEs in the context of ExtendedExpression similar to how it can now be used in the context of a Plan.

why can't ReferenceRels in expressions "point to anything"?

ReferenceRels by definition right now refer to a Rel defined in the relations array in a top-level message. This only makes sense for a Plan message currently because it has relations array where Rels can be defined unlike ExtendedExpression message which doesn't.

@drin
Copy link
Member

drin commented Jun 19, 2025

see an example.

I think I can follow that the scalar_function builder returns a "resolver" and that the resolver is used to set an expression field. I do not see where the ExtendedExpression references a Rel. Also, in the example output, I don't see an extended_expression or an expression_reference.

The intention is to enable the use of CTEs in the context of ExtendedExpression similar to how it can now be used in the context of a Plan.

Naively, I think an ExtendedExpression can refer to a PlanRel directly. I haven't figured out why it embeds a Rel directly within itself.

Edit: I now see that on the substrait website ExtendedExpression is "...provided for expression-level protocols as an alternative to using a Plan."

@tokoko
Copy link
Contributor Author

tokoko commented Jun 19, 2025

I think I can follow that the scalar_function builder returns a "resolver" and that the resolver is used to set an expression field. I do not see where the ExtendedExpression references a Rel. Also, in the example output, I don't see an extended_expression or an expression_reference.

@drin "resolver" functions are basically closures of type UnboundExtendedExpression that return ExtendedExpression objects. In this example they are meant to be used as indermediate step only, the actual goal is to set necessary Expression types in FilterRel, but builders are passing around ExtendedExpressions instead of plain Expressions because that lets them communicate additional context as well (for example mapping between extension functions and their anchors). That's why they don't show up in the final output. filter builder function extracts inner expression from it and discards the actual ExtendedExpression object.

@vbarua
Copy link
Member

vbarua commented Jul 11, 2025

Circling back to this as we talked about this a bit at the last Substrait Sync.

The intention is to enable the use of CTEs in the context of ExtendedExpression similar to how it can now be used in the context of a Plan.

ExtendedExpressions were intended as sugar to that targets expression only evaluation. Once you start introducing CTEs into the mix, the guidance at the moment would be to use a full Plan relation instead. If we added relation handling to ExtendedExpressions, there wouldn't be much of a difference between a Plan and and ExtendedExpression really.

In terms of your usecase

My immediate use case might be a bit unusual. I'm using ExtendedExpression messages as necessary intermediary steps in substrait-python builders while incrementally building up plans.

This is something that we would push back on. Specifically, making modifications to the serialization definitions to accommodate library internals. Most of the libraries end up implementing a native data layer, like the expressions and relations classes in substrait-java, and the interface and structs in substrait-go to make manipulations easier.

@tokoko
Copy link
Contributor Author

tokoko commented Jul 11, 2025

Specifically, making modifications to the serialization definitions to accommodate library internals.

I don't disagree actually 😄 although to be fair I never claimed that my immediate use case justified the change, it was simply what prompted me to open a PR. (it's also why I didn't mention it in the PR description)

ExtendedExpressions were intended as sugar to that targets expression only evaluation

This sounds very ambiguous to me, what does "expression-only" mean in this context? subqueries are expressions after all, aren't they? I just listened to the relevant part of the meeting recording and tbh it's still fuzzy to me where one would draw the line for the original intent of extended expressions. Are subqueries containing Rels okay? Or is it only subqueries containing ReferenceRels specifically that violate the intent? Whatever the answer it feels like an arbitrary line in the sand that's bound to be violated one way or another anyway. I still don't get what's the issue is with simply allowing all valid expressions.

If we added relation handling to ExtendedExpressions, there wouldn't be much of a difference between a Plan and and ExtendedExpression really.

I wholeheartedly disagree on this one. Of course I can't speak to the original intent, but the way I see it the primary distinction between a Plan and an ExtendedExpression is that the latter can hold the expressions w/o specifying the full context other than the base_schema.

  • Plan with a ProjectRel at the root is not a good replacement since ProjectRel needs to have an input relation defined while extended expressions have no concept of an input relation (just an input schema). I think the closest thing one can come up with to imitate extended expressions would be to have a Plan with a single ReadRel that has a VirtualTable read_type to define the expressions as the first Nested.Struct message in the expressions array... And even then you're still stuck with all the additional fields like filter and best_effort_filter that make no sense at all.
  • Another feature that's exclusive to extended expressions that can't be easily replicated elsewhere is ExpressionReference message that allows one to define either an Expression or an AggregateFunction. You can't really do this within a Plan message at the moment.

@vbarua
Copy link
Member

vbarua commented Jul 11, 2025

although to be fair I never claimed that my immediate use case justified the change, it was simply what prompted me to open a PR.

Fair fair 😅

Of course I can't speak to the original intent

Thinking about it, I'm not super wedded the original intent. Use cases evolve, and I'm happy for Substrait to change with them.

This sounds very ambiguous to me, what does "expression-only" mean in this context?

I agree that's its ambiguous, and we don't have a good line for this The original intent may have been to not permit subqueries, but maybe that isn't a good restriction.

I am a little strapped for time right now, so I can't fully digests your argument, but I am open changing my mind on this.

@jacques-n
Copy link
Contributor

This kind of makes ExtendedExpression feel like it is colliding with Plan.

I wonder if there is a way instead to deprecate ExtendedExpression and introduce a way to just add Expression to Plan. One option would be PlanRel oneof adds Expression. Would love people's thoughts on that.

@tokoko
Copy link
Contributor Author

tokoko commented Jul 16, 2025

I agree that it makes a lot of sense. I've previously tried to map out how it would look like. It's a bit more involved than simply adding an Expression to PlanRel, but definitely doable.

  1. one would need to add ExpressionReference rather than a simple Expression. ExpressionReference allows both AggregateFunction messages and simple Expression messages, plus you need a field for a field name sort of similar to how we use RelRoot instead of a Rel.
  2. unlike RelRoot, you'd also need to provide a base_schema.
  3. If the goal is to mimic current ExtendedExpression, you'd also need to allow referencing multiple expressions in a single plan. Some sort of a parent message type (ExpressionRoot) would be required to solve both of these issues.
  4. this is minor, but if feat: add anchor for non-ordinal plan reference #726 is merged, you'd also end up with an anchor for expression reference that is basically redundant. tbf, same is true for RelRoot case as well, but at least in the case of a RelRoot I suppose it might prove to be necessary when recursive CTE support is introduced.

In short, I think we will end up with something like this:

message ExpressionReference {
  oneof expr_type {
    Expression expression = 1;
    AggregateFunction measure = 2;
  }
  // Field names in depth-first order
  repeated string output_names = 3;
}

message ExpressionRoot {
  repeated ExpressionReference referred_expr = 1;
  NamedStruct base_schema = 2;
}

message PlanRel {
  planrel_anchor = 4;
  oneof rel_type {
    // Any relation (used for references and CTEs)
    Rel rel = 1;
    // The root of a relation tree
    RelRoot root = 2;
    ExpressionRoot = 3;
  }
}

There are probably better ways to structure a combined Plan (for example extracting out relations into a separate array and just leave a single oneof to refer to either RelRoot or ExpressionRoot, but that would be extremenly backwards-incompatible 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants