Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement the rest of Set Operators: INTERSECT, EXCEPT, etc #1082

Closed
4 tasks done
xudong963 opened this issue Oct 8, 2021 · 17 comments · Fixed by #1135, #1258, #1259 or #1261
Closed
4 tasks done

Implement the rest of Set Operators: INTERSECT, EXCEPT, etc #1082

xudong963 opened this issue Oct 8, 2021 · 17 comments · Fixed by #1135, #1258, #1259 or #1261
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@xudong963
Copy link
Member

xudong963 commented Oct 8, 2021

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

  • INTERSECT
  • INTERSECT ALL
  • EXCEPT
  • EXCEPT ALL

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

@xudong963 xudong963 added the enhancement New feature or request label Oct 8, 2021
@xudong963
Copy link
Member Author

xudong963 commented Oct 8, 2021

Please assign it to me, thanks! @houqp

@houqp houqp changed the title Implement the rest Set Operators: INTERSECT & MINUS Implement the rest of Set Operators: INTERSECT & MINUS Oct 8, 2021
@xudong963 xudong963 changed the title Implement the rest of Set Operators: INTERSECT & MINUS Implement the rest of Set Operators Oct 8, 2021
@xudong963
Copy link
Member Author

xudong963 commented Oct 13, 2021

First of all, I'll implement INTERSECT
My direct idea is to rewrite it to left-semi Join

such as

SELECT a1 FROM t1 INTERSECT SELECT a2 FROM t2;

to

SELECT DISTINCT a1 FROM t1 LEFT SEMI JOIN t2 ON a1 is not distinct from a2

How do you think about it? @houqp @alamb @Dandandan

If it's ok, I think I should implement is distinct from firstly. https://www.postgresql.org/docs/14/functions-comparison.html

@alamb
Copy link
Contributor

alamb commented Oct 13, 2021

I am not sure is distinct from is needed

Trying it out with postgres:

alamb=# select * from foo;
 x 
---
 1
  
 2
(3 rows)

alamb=# select * from bar;
 x 
---
 2
  
(2 rows)

alamb=# select * from foo intersect select * from bar;
 x 
---
  
 2
(2 rows)

And if bar does not have any nulls:

alamb=# select * from foo;
 x 
---
 1
  
 2
(3 rows)

alamb=# select * from bar;
 x 
---
 2
(1 row)

alamb=# select * from foo intersect select * from bar;
 x 
---
 2
(1 row)

Which I believe is the same semantics as semi-join using an equality predicate (though now that I type this, perhaps since NULL = NULL is not true, then semi join with equality isn't correct. 🤔

@xudong963
Copy link
Member Author

xudong963 commented Oct 13, 2021

perhaps since NULL = NULL is not true, then semi join with equality isn't correct. 🤔

Yes.

postgres=# select * from bar;
 a | b
---+---
 1 | 2
   | 3
 3 | 4
(3 rows)

postgres=# select * from foo;
 a | b
---+---
 1 | 2
   | 3
(2 rows)


-- intersect

select a from foo intersect select a from bar;

/**
 a
---

 1
(2 rows)
**/

-- equivalent transformation ? (x)  since NULL = NULL is not true
select distinct a from foo where a in (select a from bar);

/**
 a
---
 1
(1 row)
**/

psql doesn't seem to have direct SEMI JOIN. It uses IN to do the same thing as SEMI JOIN. So how to use is not distinct from in select distinct a from foo where a in (select a from bar);?

@alamb
Copy link
Contributor

alamb commented Oct 13, 2021

Creating / using is distinct from makes sense to me 👍 ). Good thinking @xudong963

@xudong963
Copy link
Member Author

I'm still surveying set operators in cockroachdb.
I won't rush to write code until I figure out the best and most detailed solution.

@Dandandan
Copy link
Contributor

Dandandan commented Oct 13, 2021

psql doesn't seem to have direct SEMI JOIN. It uses IN to do the same thing as SEMI JOIN. So how to use is distinct from in select distinct a from foo where a in (select a from bar);?

We also don't have the semi / anti join SQL syntax, but we support semi and anti join in a LogicalPlan and support it in the hash join nodes / implementation. We don't need it at the syntax level to support operators like INTERSECT, as long as we can map the input query to a logical plan / physical plan.

I didn't have a look at using select distinct + semi join for INTERSECT but if it's equivalent that totally makes sense to me.
Complication with using things like is distinct from will not supported by a equi-join implementation (which only supports A=B type of expressions), so I'm not sure if this is an efficient / easy way to implement it.

I think supporting the is distinct from operator is useful in itself.

@alamb
Copy link
Contributor

alamb commented Oct 13, 2021

I remember other systems that basically had special equality checking in semi joins to handle the null stuff -- in fact when using semi join for x IN (select ..) type queries, the is distinct from operator might required to get correct results in the presence of nulls, but I can't remember

@Dandandan
Copy link
Contributor

It looks like PostgreSQL has some explicit operator (HashSetOp) for these kind of things:

> explain select * from demo intersect select * from demo;

HashSetOp Intersect  (cost=0.00..30.40 rows=160 width=458)
  ->  Append  (cost=0.00..28.00 rows=320 width=458)
        ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..13.20 rows=160 width=458)
              ->  Seq Scan on demo  (cost=0.00..11.60 rows=160 width=454)
        ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..13.20 rows=160 width=458)
              ->  Seq Scan on demo demo_1  (cost=0.00..11.60 rows=160 width=454)

@xudong963
Copy link
Member Author

It looks like PostgreSQL has some explicit operator (HashSetOp) for these kind of things:

Yes, I have seen the pg code which uses the HashSetOp for INTERSECT and EXCEPT

 * In SETOP_HASHED mode, the input is delivered in no particular order,
 * except that we know all the tuples from one input relation will come before
 * all the tuples of the other.  The planner guarantees that the first input
 * relation is the left-hand one for EXCEPT, and tries to make the smaller
 * input relation come first for INTERSECT.  We build a hash table in memory
 * with one entry for each group of identical tuples, and count the number of
 * tuples in the group from each relation.  After seeing all the input, we
 * scan the hashtable and generate the correct output using those counts.
 * We can avoid making hashtable entries for any tuples appearing only in the
 * second input relation, since they cannot result in any output.
 *
 * This node type is not used for UNION or UNION ALL, since those can be
 * implemented more cheaply (there's no need for the junk attribute to
 * identify the source relation).

@alamb
Copy link
Contributor

alamb commented Oct 14, 2021

I can see the potential benefit of implementing some special physical operator to implement INTERSECT / EXCEPT -- it might be potentially faster than the general purpose join operator

However, that being said, using a general purpose SEMI join as you have suggested @xudong963 would definitely be my preference to keep the DF code base simpler, unless there is some compelling performance measurement / need for a faster implementation

@xudong963
Copy link
Member Author

xudong963 commented Oct 15, 2021

However, that being said, using a general purpose SEMI join as you have suggested @xudong963 would definitely be my preference to keep the DF code base simpler, unless there is some compelling performance measurement / need for a faster implementation

@alamb Me too. I also saw the implementation of TIDB which just transfers it to SEMI Join.

func (b *PlanBuilder) buildIntersect(ctx context.Context, selects []ast.Node) (LogicalPlan, *ast.SetOprType, error) {
	var leftPlan LogicalPlan
	var err error
	var afterSetOperator *ast.SetOprType
	switch x := selects[0].(type) {
	case *ast.SelectStmt:
		afterSetOperator = x.AfterSetOperator
		leftPlan, err = b.buildSelect(ctx, x)
	case *ast.SetOprSelectList:
		afterSetOperator = x.AfterSetOperator
		leftPlan, err = b.buildSetOpr(ctx, &ast.SetOprStmt{SelectList: x})
	}
	if err != nil {
		return nil, nil, err
	}
	if len(selects) == 1 {
		return leftPlan, afterSetOperator, nil
	}

	columnNums := leftPlan.Schema().Len()
	for i := 1; i < len(selects); i++ {
		var rightPlan LogicalPlan
		switch x := selects[i].(type) {
		case *ast.SelectStmt:
			if *x.AfterSetOperator == ast.IntersectAll {
				// TODO: support intersect all
				return nil, nil, errors.Errorf("TiDB do not support intersect all")
			}
			rightPlan, err = b.buildSelect(ctx, x)
		case *ast.SetOprSelectList:
			if *x.AfterSetOperator == ast.IntersectAll {
				// TODO: support intersect all
				return nil, nil, errors.Errorf("TiDB do not support intersect all")
			}
			rightPlan, err = b.buildSetOpr(ctx, &ast.SetOprStmt{SelectList: x})
		}
		if err != nil {
			return nil, nil, err
		}
		if rightPlan.Schema().Len() != columnNums {
			return nil, nil, ErrWrongNumberOfColumnsInSelect.GenWithStackByArgs()
		}
		leftPlan, err = b.buildSemiJoinForSetOperator(leftPlan, rightPlan, SemiJoin)
		if err != nil {
			return nil, nil, err
		}
	}
	return leftPlan, afterSetOperator, nil
}

A nice weekend is coming, It's time for me to concentrate on Datafusion. PR about the INTERSECTION feature will appear soon. 😄

@alamb
Copy link
Contributor

alamb commented Oct 15, 2021

@Dandandan has made #1117 to add is distinct from

@alamb alamb changed the title Implement the rest of Set Operators Implement the rest of Set Operators: INTERSECT, EXCEPT, etc Oct 26, 2021
@xudong963
Copy link
Member Author

After #1135 merged, I believe the issue will be closed quickly, because on the basis of #1135, EXCEPT can be implemented by ANTI JOIN with null_equal_null.

@xudong963
Copy link
Member Author

The issue needs to reopen, there are other things to do, such as Expect support. @alamb

@alamb alamb reopened this Nov 6, 2021
@alamb
Copy link
Contributor

alamb commented Nov 6, 2021

Reopening as github API got a little too excited

@xudong963
Copy link
Member Author

xudong963 commented Nov 6, 2021

All related PRs have finished, after merging, the issue can be closed. Thanks again for your help! @alamb @Dandandan @houqp

@houqp houqp added this to the 6.0.0 milestone Nov 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment