Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TPC-H test suite added #136

Merged
merged 27 commits into from
Jan 29, 2021
Merged

TPC-H test suite added #136

merged 27 commits into from
Jan 29, 2021

Conversation

octaviansima
Copy link
Collaborator

@octaviansima octaviansima commented Jan 26, 2021

This PR refactors the single existing TPC-H query to perform operations using standard SQL rather than directly with DataFrames. This was done to expedite the process of increasing TPC-H coverage, since all of the queries are already available in SQL. It also creates a new folder to hold all queries from https://github.com/apache/spark/tree/master/sql/core/src/test/resources/tpch

Copy link
Collaborator

@wzheng wzheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a couple of comments!

Comment on lines 157 to 179
def this(
sqlContext: SQLContext,
size: String,
numPartitions: Int,
) = {
this(sqlContext, size, numPartitions,

Map("part" -> TPCHDataFrames.part(sqlContext, Insecure, size, numPartitions),
"supplier" -> TPCHDataFrames.supplier(sqlContext, Insecure, size, numPartitions),
"lineitem" -> TPCHDataFrames.lineitem(sqlContext, Insecure, size, numPartitions),
"partsupp" -> TPCHDataFrames.partsupp(sqlContext, Insecure, size, numPartitions),
"orders" -> TPCHDataFrames.orders(sqlContext, Insecure, size, numPartitions),
"nation" -> TPCHDataFrames.nation(sqlContext, Insecure, size, numPartitions)),

Map("part" -> TPCHDataFrames.part(sqlContext, Encrypted, size, numPartitions),
"supplier" -> TPCHDataFrames.supplier(sqlContext, Encrypted, size, numPartitions),
"lineitem" -> TPCHDataFrames.lineitem(sqlContext, Encrypted, size, numPartitions),
"partsupp" -> TPCHDataFrames.partsupp(sqlContext, Encrypted, size, numPartitions),
"orders" -> TPCHDataFrames.orders(sqlContext, Encrypted, size, numPartitions),
"nation" -> TPCHDataFrames.nation(sqlContext, Encrypted, size, numPartitions)),
)
ensureCached()
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way that this secondary constructor written is a bit confusing. I wonder if you can have an init() to set up this mapping and ensure that the DataFrames are cached when TPCHTests is called.

Comment on lines 157 to 179
def this(
sqlContext: SQLContext,
size: String,
numPartitions: Int,
) = {
this(sqlContext, size, numPartitions,

Map("part" -> TPCHDataFrames.part(sqlContext, Insecure, size, numPartitions),
"supplier" -> TPCHDataFrames.supplier(sqlContext, Insecure, size, numPartitions),
"lineitem" -> TPCHDataFrames.lineitem(sqlContext, Insecure, size, numPartitions),
"partsupp" -> TPCHDataFrames.partsupp(sqlContext, Insecure, size, numPartitions),
"orders" -> TPCHDataFrames.orders(sqlContext, Insecure, size, numPartitions),
"nation" -> TPCHDataFrames.nation(sqlContext, Insecure, size, numPartitions)),

Map("part" -> TPCHDataFrames.part(sqlContext, Encrypted, size, numPartitions),
"supplier" -> TPCHDataFrames.supplier(sqlContext, Encrypted, size, numPartitions),
"lineitem" -> TPCHDataFrames.lineitem(sqlContext, Encrypted, size, numPartitions),
"partsupp" -> TPCHDataFrames.partsupp(sqlContext, Encrypted, size, numPartitions),
"orders" -> TPCHDataFrames.orders(sqlContext, Encrypted, size, numPartitions),
"nation" -> TPCHDataFrames.nation(sqlContext, Encrypted, size, numPartitions)),
)
ensureCached()
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constructor setup is currently a bit confusing to read. Is it possible to put this in a separate setup function, which can then be called by TPCHTests before the tests are run?

override def numPartitions: Int = 1
override val spark = SparkSession.builder()
.master("local[1]")
.appName("QEDSuite")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit: I think this appName can be changed to something like TPCHSuite. The same goes for OpaqueOperatorTests. This naming seems to have been carried over from the very first test suite we had (QEDSuite).

Copy link
Collaborator

@wzheng wzheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few more comments!

Comment on lines 31 to 34
override def beforeAll(): Unit = {
super.beforeAll();
tpch.ensureCached();
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks! Can you also call generateMap in here? Or put both ensureCached and generateMap in an init function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't because generateMap returns the actual map structure that the constructor needs. One alternative would be to do something like val nameToDF = TPCHDataFrames.generateMap(sqlContext, Insecure, size, numPartitions) outside of the auxiliary thisconstructor. The other is to set nameToDF to an arbitrary initialized value, then call an init function that sets it to the correct value while also calling ensureCached.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current code is just a bit confusing, since the default constructor is not meant to be used directly. I think val nameToDF = TPCHDataFrames.generateMap(sqlContext, Insecure, size, numPartitions) is good. Perhaps another way is to use a companion object and define apply methods.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with the apply method because that enables generateMap and ensureCached to be called in the same spot as a sort of init function. Thanks for the tip!

override def numPartitions: Int = 3
override val spark = SparkSession.builder()
.master("local[1]")
.appName("QEDSuite")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change the app name here as well?

.config("spark.sql.shuffle.partitions", numPartitions)
.getOrCreate()

override def tpch = new TPCH(spark.sqlContext, size, numPartitions)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like creating a new TPCH object for each test will create the DataFrames twice? Can you only load the tables once for both single partition and multi-partition tests?

@@ -911,7 +831,7 @@ trait OpaqueOperatorTests extends FunSuite with BeforeAndAfterAll { self =>

}

class OpaqueSinglePartitionSuite extends OpaqueOperatorTests {
class OpaqueOperatorSinglePartitionSuite extends OpaqueOperatorTests {
override val spark = SparkSession.builder()
.master("local[1]")
.appName("QEDSuite")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you modify the name here?

@@ -921,7 +841,7 @@ class OpaqueSinglePartitionSuite extends OpaqueOperatorTests {
override def numPartitions: Int = 1
}

class OpaqueMultiplePartitionSuite extends OpaqueOperatorTests {
class OpaqueOperatorMultiplePartitionSuite extends OpaqueOperatorTests {
override val spark = SparkSession.builder()
.master("local[1]")
.appName("QEDSuite")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modify the name here too?

@octaviansima octaviansima changed the title TPC-H #9 Refactor TPC-H test suite added Jan 28, 2021
@wzheng wzheng merged commit 0a20d71 into mc2-project:master Jan 29, 2021
@wzheng
Copy link
Collaborator

wzheng commented Jan 29, 2021

Thanks!

andrewlawhh added a commit that referenced this pull request Mar 15, 2021
* Support for multiple branched CaseWhen

* Interval (#116)

* add date_add, interval sql still running into issues

* Add Interval SQL support

* uncomment out the other tests

* resolve comments

* change interval equality

Co-authored-by: Eric Feng <[email protected]>

* Remove partition ID argument from enclaves

* Fix comments

* updates

* Modifications to integrate crumb, log-mac, and all-outputs_mac, wip

* Store log mac after each output buffer, add all-outputs-mac to each encryptedblocks wip

* Add all_outputs_mac to all EncryptedBlocks once all log_macs have been generated

* Almost builds

* cpp builds

* Use ubyte for all_outputs_mac

* use Mac for all_outputs_mac

* Hopefully this works for flatbuffers all_outputs_mac mutation, cpp builds

* Scala builds now too, running into error with union

* Stuff builds, error with all outputs mac serialization. this commit uses all_outputs_mac as Mac table

* Fixed bug, basic encryption / show works

* All single partition tests pass, multiple partiton passes until tpch-9

* All tests pass except tpch-9 and skew join

* comment tpch back in

* Check same number of ecalls per partition - exception for scanCollectLastPrimary(?)

* First attempt at constructing executed DAG

* Fix typos

* Rework graph

* Add log macs to graph nodes

* Construct expected DAG and refactor JobNode.
Refactor construction of executed DAG.

* Implement 'paths to sink' for a DAG

* add crumb for last ecall

* Fix NULL handling for aggregation (#130)

* Modify COUNT and SUM to correctly handle NULL values

* Change average to support NULL values

* Fix

* Changing operator matching from logical to physical (#129)

* WIP

* Fix

* Unapply change

* Aggregation rewrite (#132)

* updated build/sbt file (#135)

* Travis update (#137)

* update breeze (#138)

* TPC-H test suite added (#136)

* added tpch sql files

* functions updated to save temp view

* main function skeleton done

* load and clear done

* fix clear

* performQuery done

* import cleanup, use OPAQUE_HOME

* TPC-H 9 refactored to use SQL rather than DF operations

* removed : Unit, unused imports

* added TestUtils.scala

* moved all common initialization to TestUtils

* update name

* begin rewriting TPCH.scala to store persistent tables

* invalid table name error

* TPCH conversion to class started

* compiles

* added second case, cleared up names

* added TPC-H 6 to check that persistent state has no issues

* added functions for the last two tables

* addressed most logic changes

* DataFrame only loaded once

* apply method in companion object

* full test suite added

* added testFunc parameter to testAgainstSpark

* ignore #18

* Separate IN PR (#124)

* finishing the in expression. adding more tests and null support. need confirmation on null behavior and also I wonder why integer field is sufficient for string

* adding additional test

* adding additional test

* saving concat implementation and it's passing basic functionality tests

* adding type aware comparison and better error message for IN operator

* adding null checking for the concat operator and adding one additional test

* cleaning up IN&Concat PR

* deleting concat and preping the in branch for in pr

* fixing null bahavior 

now it's only null when there's no match and there's null input

* Build failed

Co-authored-by: Ubuntu <[email protected]>
Co-authored-by: Wenting Zheng <[email protected]>
Co-authored-by: Wenting Zheng <[email protected]>

* Merge new aggregate

* Uncomment log_mac_lst clear

* Clean up comments

* Separate Concat PR  (#125)

Implementation of the CONCAT expression.

Co-authored-by: Ubuntu <[email protected]>
Co-authored-by: Wenting Zheng <[email protected]>

* Clean up comments in other files

* Update pathsEqual to be less conservative

* Remove print statements from unit tests

* Removed calls to toSet in TPC-H tests (#140)

* removed calls to toSet

* added calls to toSet back where queries are unordered

* Documentation update (#148)

* Cluster Remote Attestation Fix (#146)

The existing code only had RA working when run locally. This PR adds a sleep for 5 seconds to make sure that all executors are spun up successfully before attestation begins.

Closes #147

* upgrade to 3.0.1 (#144)

* Update two TPC-H queries (#149)

Tests for TPC-H 12 and 19 pass.

* TPC-H 20 Fix (#142)

* string to stringtype error

* tpch 20 passes

* cleanup

* implemented changes

* decimal.tofloat

Co-authored-by: Wenting Zheng <[email protected]>

* Add expected operator DAG generation from executedPlan string

* Rebase

* Merge join update

* Integrate new join

* Add expected operator for sortexec

* Merge comp-integrity with join update

* Remove some print statements

* Construct expected DAG from dataframe physical plan

* Refactor collect and add integrity checking helper function to OpaqueOperatorTest

* Remove addExpectedOperator from JobVerificationEngine, add comments

* Implement expected DAG construction by doing graph manipulation on dataframe field instead of string parsing

Co-authored-by: Andrew Law <[email protected]>
Co-authored-by: Eric Feng <[email protected]>
Co-authored-by: Eric Feng <[email protected]>
Co-authored-by: Chester Leung <[email protected]>
Co-authored-by: Wenting Zheng <[email protected]>
Co-authored-by: octaviansima <[email protected]>
Co-authored-by: Chenyu Shi <[email protected]>
Co-authored-by: Ubuntu <[email protected]>
Co-authored-by: Wenting Zheng <[email protected]>
andrewlawhh added a commit that referenced this pull request Mar 15, 2021
* Support for multiple branched CaseWhen

* Interval (#116)

* add date_add, interval sql still running into issues

* Add Interval SQL support

* uncomment out the other tests

* resolve comments

* change interval equality

Co-authored-by: Eric Feng <[email protected]>

* Remove partition ID argument from enclaves

* Fix comments

* updates

* Modifications to integrate crumb, log-mac, and all-outputs_mac, wip

* Store log mac after each output buffer, add all-outputs-mac to each encryptedblocks wip

* Add all_outputs_mac to all EncryptedBlocks once all log_macs have been generated

* Almost builds

* cpp builds

* Use ubyte for all_outputs_mac

* use Mac for all_outputs_mac

* Hopefully this works for flatbuffers all_outputs_mac mutation, cpp builds

* Scala builds now too, running into error with union

* Stuff builds, error with all outputs mac serialization. this commit uses all_outputs_mac as Mac table

* Fixed bug, basic encryption / show works

* All single partition tests pass, multiple partiton passes until tpch-9

* All tests pass except tpch-9 and skew join

* comment tpch back in

* Check same number of ecalls per partition - exception for scanCollectLastPrimary(?)

* First attempt at constructing executed DAG

* Fix typos

* Rework graph

* Add log macs to graph nodes

* Construct expected DAG and refactor JobNode.
Refactor construction of executed DAG.

* Implement 'paths to sink' for a DAG

* add crumb for last ecall

* Fix NULL handling for aggregation (#130)

* Modify COUNT and SUM to correctly handle NULL values

* Change average to support NULL values

* Fix

* Changing operator matching from logical to physical (#129)

* WIP

* Fix

* Unapply change

* Aggregation rewrite (#132)

* updated build/sbt file (#135)

* Travis update (#137)

* update breeze (#138)

* TPC-H test suite added (#136)

* added tpch sql files

* functions updated to save temp view

* main function skeleton done

* load and clear done

* fix clear

* performQuery done

* import cleanup, use OPAQUE_HOME

* TPC-H 9 refactored to use SQL rather than DF operations

* removed : Unit, unused imports

* added TestUtils.scala

* moved all common initialization to TestUtils

* update name

* begin rewriting TPCH.scala to store persistent tables

* invalid table name error

* TPCH conversion to class started

* compiles

* added second case, cleared up names

* added TPC-H 6 to check that persistent state has no issues

* added functions for the last two tables

* addressed most logic changes

* DataFrame only loaded once

* apply method in companion object

* full test suite added

* added testFunc parameter to testAgainstSpark

* ignore #18

* Separate IN PR (#124)

* finishing the in expression. adding more tests and null support. need confirmation on null behavior and also I wonder why integer field is sufficient for string

* adding additional test

* adding additional test

* saving concat implementation and it's passing basic functionality tests

* adding type aware comparison and better error message for IN operator

* adding null checking for the concat operator and adding one additional test

* cleaning up IN&Concat PR

* deleting concat and preping the in branch for in pr

* fixing null bahavior 

now it's only null when there's no match and there's null input

* Build failed

Co-authored-by: Ubuntu <[email protected]>
Co-authored-by: Wenting Zheng <[email protected]>
Co-authored-by: Wenting Zheng <[email protected]>

* Merge new aggregate

* Uncomment log_mac_lst clear

* Clean up comments

* Separate Concat PR  (#125)

Implementation of the CONCAT expression.

Co-authored-by: Ubuntu <[email protected]>
Co-authored-by: Wenting Zheng <[email protected]>

* Clean up comments in other files

* Update pathsEqual to be less conservative

* Remove print statements from unit tests

* Removed calls to toSet in TPC-H tests (#140)

* removed calls to toSet

* added calls to toSet back where queries are unordered

* Documentation update (#148)

* Cluster Remote Attestation Fix (#146)

The existing code only had RA working when run locally. This PR adds a sleep for 5 seconds to make sure that all executors are spun up successfully before attestation begins.

Closes #147

* upgrade to 3.0.1 (#144)

* Update two TPC-H queries (#149)

Tests for TPC-H 12 and 19 pass.

* TPC-H 20 Fix (#142)

* string to stringtype error

* tpch 20 passes

* cleanup

* implemented changes

* decimal.tofloat

Co-authored-by: Wenting Zheng <[email protected]>

* Add expected operator DAG generation from executedPlan string

* Rebase

* Merge join update

* Integrate new join

* Add expected operator for sortexec

* Merge comp-integrity with join update

* Remove some print statements

* Construct expected DAG from dataframe physical plan

* Refactor collect and add integrity checking helper function to OpaqueOperatorTest

* Remove addExpectedOperator from JobVerificationEngine, add comments

* Implement expected DAG construction by doing graph manipulation on dataframe field instead of string parsing

* Fix merge errors in the test cases

Co-authored-by: Andrew Law <[email protected]>
Co-authored-by: Eric Feng <[email protected]>
Co-authored-by: Eric Feng <[email protected]>
Co-authored-by: Chester Leung <[email protected]>
Co-authored-by: Wenting Zheng <[email protected]>
Co-authored-by: octaviansima <[email protected]>
Co-authored-by: Chenyu Shi <[email protected]>
Co-authored-by: Ubuntu <[email protected]>
Co-authored-by: Wenting Zheng <[email protected]>
andrewlawhh added a commit that referenced this pull request Apr 2, 2021
* Support for multiple branched CaseWhen

* Interval (#116)

* add date_add, interval sql still running into issues

* Add Interval SQL support

* uncomment out the other tests

* resolve comments

* change interval equality

Co-authored-by: Eric Feng <[email protected]>

* Remove partition ID argument from enclaves

* Fix comments

* updates

* Modifications to integrate crumb, log-mac, and all-outputs_mac, wip

* Store log mac after each output buffer, add all-outputs-mac to each encryptedblocks wip

* Add all_outputs_mac to all EncryptedBlocks once all log_macs have been generated

* Almost builds

* cpp builds

* Use ubyte for all_outputs_mac

* use Mac for all_outputs_mac

* Hopefully this works for flatbuffers all_outputs_mac mutation, cpp builds

* Scala builds now too, running into error with union

* Stuff builds, error with all outputs mac serialization. this commit uses all_outputs_mac as Mac table

* Fixed bug, basic encryption / show works

* All single partition tests pass, multiple partiton passes until tpch-9

* All tests pass except tpch-9 and skew join

* comment tpch back in

* Check same number of ecalls per partition - exception for scanCollectLastPrimary(?)

* First attempt at constructing executed DAG

* Fix typos

* Rework graph

* Add log macs to graph nodes

* Construct expected DAG and refactor JobNode.
Refactor construction of executed DAG.

* Implement 'paths to sink' for a DAG

* add crumb for last ecall

* Fix NULL handling for aggregation (#130)

* Modify COUNT and SUM to correctly handle NULL values

* Change average to support NULL values

* Fix

* Changing operator matching from logical to physical (#129)

* WIP

* Fix

* Unapply change

* Aggregation rewrite (#132)

* updated build/sbt file (#135)

* Travis update (#137)

* update breeze (#138)

* TPC-H test suite added (#136)

* added tpch sql files

* functions updated to save temp view

* main function skeleton done

* load and clear done

* fix clear

* performQuery done

* import cleanup, use OPAQUE_HOME

* TPC-H 9 refactored to use SQL rather than DF operations

* removed : Unit, unused imports

* added TestUtils.scala

* moved all common initialization to TestUtils

* update name

* begin rewriting TPCH.scala to store persistent tables

* invalid table name error

* TPCH conversion to class started

* compiles

* added second case, cleared up names

* added TPC-H 6 to check that persistent state has no issues

* added functions for the last two tables

* addressed most logic changes

* DataFrame only loaded once

* apply method in companion object

* full test suite added

* added testFunc parameter to testAgainstSpark

* ignore #18

* Separate IN PR (#124)

* finishing the in expression. adding more tests and null support. need confirmation on null behavior and also I wonder why integer field is sufficient for string

* adding additional test

* adding additional test

* saving concat implementation and it's passing basic functionality tests

* adding type aware comparison and better error message for IN operator

* adding null checking for the concat operator and adding one additional test

* cleaning up IN&Concat PR

* deleting concat and preping the in branch for in pr

* fixing null bahavior 

now it's only null when there's no match and there's null input

* Build failed

Co-authored-by: Ubuntu <[email protected]>
Co-authored-by: Wenting Zheng <[email protected]>
Co-authored-by: Wenting Zheng <[email protected]>

* Merge new aggregate

* Uncomment log_mac_lst clear

* Clean up comments

* Separate Concat PR  (#125)

Implementation of the CONCAT expression.

Co-authored-by: Ubuntu <[email protected]>
Co-authored-by: Wenting Zheng <[email protected]>

* Clean up comments in other files

* Update pathsEqual to be less conservative

* Remove print statements from unit tests

* Removed calls to toSet in TPC-H tests (#140)

* removed calls to toSet

* added calls to toSet back where queries are unordered

* Documentation update (#148)

* Cluster Remote Attestation Fix (#146)

The existing code only had RA working when run locally. This PR adds a sleep for 5 seconds to make sure that all executors are spun up successfully before attestation begins.

Closes #147

* upgrade to 3.0.1 (#144)

* Update two TPC-H queries (#149)

Tests for TPC-H 12 and 19 pass.

* TPC-H 20 Fix (#142)

* string to stringtype error

* tpch 20 passes

* cleanup

* implemented changes

* decimal.tofloat

Co-authored-by: Wenting Zheng <[email protected]>

* Add expected operator DAG generation from executedPlan string

* Rebase

* Join update (#145)

* Merge join update

* Integrate new join

* Add expected operator for sortexec

* Merge comp-integrity with join update

* Remove some print statements

* Migrate from Travis CI to Github Actions (#156)

* Upgrade to OE 0.12 (#153)

* Update README.md

* Support for scalar subquery (#157)

This PR implements the scalar subquery expression, which is triggered whenever a subquery returns a scalar value. There were two main problems that needed to be solved.

First, support for matching the scalar subquery expression is necessary. Spark implements this by wrapping a SparkPlan within the expression and calls executeCollect. Then it constructs a literal with that value. However, this is problematic for us because that value should not be decrypted by the driver and serialized into an expression, since it's an intermediate value.

Therefore, the second issue to be addressed here is supporting an encrypted literal. This is implemented in this PR by serializing an encrypted ciphertext into a base64 encoded string, and wrapping a Decrypt expression on top of it. This expression is then evaluated in the enclave and returns a literal. Note that, in order to test our implementation, we also implement a Decrypt expression in Scala. However, this should never be evaluated on the driver side and serialized into a plaintext literal. This is because Decrypt is designated as a Nondeterministic expression, and therefore will always evaluate on the workers.

* Add TPC-H Benchmarks (#139)

* logic decoupling in TPCH.scala for easier benchmarking

* added TPCHBenchmark.scala

* Benchmark.scala rewrite

* done adding all support TPC-H query benchmarks

* changed commandline arguments that benchmark takes

* TPCHBenchmark takes in parameters

* fixed issue with spark conf

* size error handling, --help flag

* add Utils.force, break cluster mode

* comment out logistic regression benchmark

* ensureCached right before temp view created/replaced

* upgrade to 3.0.1

* upgrade to 3.0.1

* 10 scale factor

* persistData

* almost done refactor

* more cleanup

* compiles

* 9 passes

* cleanup

* collect instead of force, sf_none

* remove sf_none

* defaultParallelism

* no removing trailing/leading whitespace

* add sf_med

* hdfs works in local case

* cleanup, added new CLI argument

* added newly supported tpch queries

* function for running all supported tests

* Construct expected DAG from dataframe physical plan

* Refactor collect and add integrity checking helper function to OpaqueOperatorTest

* Float expressions (#160)

This PR adds float normalization expressions [implemented in Spark](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NormalizeFloatingNumbers.scala#L170). TPC-H query 2 also passes.

* Broadcast Nested Loop Join - Left Anti and Left Semi  (#159)

This PR is the first of two parts towards making TPC-H 16 work: the other will be implementing `is_distinct` for aggregate operations.

`BroadcastNestedLoopJoin` is Spark's "catch all" for non-equi joins. It works by first picking a side to broadcast, then iterating through every possible row combination and checking the non-equi condition against the pair.

* Remove addExpectedOperator from JobVerificationEngine, add comments

* Implement expected DAG construction by doing graph manipulation on dataframe field instead of string parsing

* Fix merge errors in the test cases

Co-authored-by: Andrew Law <[email protected]>
Co-authored-by: Eric Feng <[email protected]>
Co-authored-by: Eric Feng <[email protected]>
Co-authored-by: Chester Leung <[email protected]>
Co-authored-by: Wenting Zheng <[email protected]>
Co-authored-by: octaviansima <[email protected]>
Co-authored-by: Chenyu Shi <[email protected]>
Co-authored-by: Ubuntu <[email protected]>
Co-authored-by: Wenting Zheng <[email protected]>
andrewlawhh added a commit that referenced this pull request Apr 2, 2021
* Support for multiple branched CaseWhen

* Interval (#116)

* add date_add, interval sql still running into issues

* Add Interval SQL support

* uncomment out the other tests

* resolve comments

* change interval equality

Co-authored-by: Eric Feng <[email protected]>

* Remove partition ID argument from enclaves

* Fix comments

* updates

* Modifications to integrate crumb, log-mac, and all-outputs_mac, wip

* Store log mac after each output buffer, add all-outputs-mac to each encryptedblocks wip

* Add all_outputs_mac to all EncryptedBlocks once all log_macs have been generated

* Almost builds

* cpp builds

* Use ubyte for all_outputs_mac

* use Mac for all_outputs_mac

* Hopefully this works for flatbuffers all_outputs_mac mutation, cpp builds

* Scala builds now too, running into error with union

* Stuff builds, error with all outputs mac serialization. this commit uses all_outputs_mac as Mac table

* Fixed bug, basic encryption / show works

* All single partition tests pass, multiple partiton passes until tpch-9

* All tests pass except tpch-9 and skew join

* comment tpch back in

* Check same number of ecalls per partition - exception for scanCollectLastPrimary(?)

* First attempt at constructing executed DAG

* Fix typos

* Rework graph

* Add log macs to graph nodes

* Construct expected DAG and refactor JobNode.
Refactor construction of executed DAG.

* Implement 'paths to sink' for a DAG

* add crumb for last ecall

* Fix NULL handling for aggregation (#130)

* Modify COUNT and SUM to correctly handle NULL values

* Change average to support NULL values

* Fix

* Changing operator matching from logical to physical (#129)

* WIP

* Fix

* Unapply change

* Aggregation rewrite (#132)

* updated build/sbt file (#135)

* Travis update (#137)

* update breeze (#138)

* TPC-H test suite added (#136)

* added tpch sql files

* functions updated to save temp view

* main function skeleton done

* load and clear done

* fix clear

* performQuery done

* import cleanup, use OPAQUE_HOME

* TPC-H 9 refactored to use SQL rather than DF operations

* removed : Unit, unused imports

* added TestUtils.scala

* moved all common initialization to TestUtils

* update name

* begin rewriting TPCH.scala to store persistent tables

* invalid table name error

* TPCH conversion to class started

* compiles

* added second case, cleared up names

* added TPC-H 6 to check that persistent state has no issues

* added functions for the last two tables

* addressed most logic changes

* DataFrame only loaded once

* apply method in companion object

* full test suite added

* added testFunc parameter to testAgainstSpark

* ignore #18

* Separate IN PR (#124)

* finishing the in expression. adding more tests and null support. need confirmation on null behavior and also I wonder why integer field is sufficient for string

* adding additional test

* adding additional test

* saving concat implementation and it's passing basic functionality tests

* adding type aware comparison and better error message for IN operator

* adding null checking for the concat operator and adding one additional test

* cleaning up IN&Concat PR

* deleting concat and preping the in branch for in pr

* fixing null bahavior 

now it's only null when there's no match and there's null input

* Build failed

Co-authored-by: Ubuntu <[email protected]>
Co-authored-by: Wenting Zheng <[email protected]>
Co-authored-by: Wenting Zheng <[email protected]>

* Merge new aggregate

* Uncomment log_mac_lst clear

* Clean up comments

* Separate Concat PR  (#125)

Implementation of the CONCAT expression.

Co-authored-by: Ubuntu <[email protected]>
Co-authored-by: Wenting Zheng <[email protected]>

* Clean up comments in other files

* Update pathsEqual to be less conservative

* Remove print statements from unit tests

* Removed calls to toSet in TPC-H tests (#140)

* removed calls to toSet

* added calls to toSet back where queries are unordered

* Documentation update (#148)

* Cluster Remote Attestation Fix (#146)

The existing code only had RA working when run locally. This PR adds a sleep for 5 seconds to make sure that all executors are spun up successfully before attestation begins.

Closes #147

* upgrade to 3.0.1 (#144)

* Update two TPC-H queries (#149)

Tests for TPC-H 12 and 19 pass.

* TPC-H 20 Fix (#142)

* string to stringtype error

* tpch 20 passes

* cleanup

* implemented changes

* decimal.tofloat

Co-authored-by: Wenting Zheng <[email protected]>

* Add expected operator DAG generation from executedPlan string

* Rebase

* Join update (#145)

* Merge join update

* Integrate new join

* Add expected operator for sortexec

* Merge comp-integrity with join update

* Remove some print statements

* Migrate from Travis CI to Github Actions (#156)

* Upgrade to OE 0.12 (#153)

* Update README.md

* Support for scalar subquery (#157)

This PR implements the scalar subquery expression, which is triggered whenever a subquery returns a scalar value. There were two main problems that needed to be solved.

First, support for matching the scalar subquery expression is necessary. Spark implements this by wrapping a SparkPlan within the expression and calls executeCollect. Then it constructs a literal with that value. However, this is problematic for us because that value should not be decrypted by the driver and serialized into an expression, since it's an intermediate value.

Therefore, the second issue to be addressed here is supporting an encrypted literal. This is implemented in this PR by serializing an encrypted ciphertext into a base64 encoded string, and wrapping a Decrypt expression on top of it. This expression is then evaluated in the enclave and returns a literal. Note that, in order to test our implementation, we also implement a Decrypt expression in Scala. However, this should never be evaluated on the driver side and serialized into a plaintext literal. This is because Decrypt is designated as a Nondeterministic expression, and therefore will always evaluate on the workers.

* Add TPC-H Benchmarks (#139)

* logic decoupling in TPCH.scala for easier benchmarking

* added TPCHBenchmark.scala

* Benchmark.scala rewrite

* done adding all support TPC-H query benchmarks

* changed commandline arguments that benchmark takes

* TPCHBenchmark takes in parameters

* fixed issue with spark conf

* size error handling, --help flag

* add Utils.force, break cluster mode

* comment out logistic regression benchmark

* ensureCached right before temp view created/replaced

* upgrade to 3.0.1

* upgrade to 3.0.1

* 10 scale factor

* persistData

* almost done refactor

* more cleanup

* compiles

* 9 passes

* cleanup

* collect instead of force, sf_none

* remove sf_none

* defaultParallelism

* no removing trailing/leading whitespace

* add sf_med

* hdfs works in local case

* cleanup, added new CLI argument

* added newly supported tpch queries

* function for running all supported tests

* Construct expected DAG from dataframe physical plan

* Refactor collect and add integrity checking helper function to OpaqueOperatorTest

* Float expressions (#160)

This PR adds float normalization expressions [implemented in Spark](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NormalizeFloatingNumbers.scala#L170). TPC-H query 2 also passes.

* Broadcast Nested Loop Join - Left Anti and Left Semi  (#159)

This PR is the first of two parts towards making TPC-H 16 work: the other will be implementing `is_distinct` for aggregate operations.

`BroadcastNestedLoopJoin` is Spark's "catch all" for non-equi joins. It works by first picking a side to broadcast, then iterating through every possible row combination and checking the non-equi condition against the pair.

* Move join condition handling for equi-joins into enclave code (#164)

* Add in TPC-H 21

* Add condition processing in enclave code

* Code clean up

* Enable query 18

* WIP

* Local tests pass

* Apply suggestions from code review

Co-authored-by: octaviansima <[email protected]>

* WIP

* Address comments

* q21.sql

Co-authored-by: octaviansima <[email protected]>

* Remove addExpectedOperator from JobVerificationEngine, add comments

* Implement expected DAG construction by doing graph manipulation on dataframe field instead of string parsing

* Fix merge errors in the test cases

Co-authored-by: Andrew Law <[email protected]>
Co-authored-by: Eric Feng <[email protected]>
Co-authored-by: Eric Feng <[email protected]>
Co-authored-by: Chester Leung <[email protected]>
Co-authored-by: Wenting Zheng <[email protected]>
Co-authored-by: octaviansima <[email protected]>
Co-authored-by: Chenyu Shi <[email protected]>
Co-authored-by: Ubuntu <[email protected]>
Co-authored-by: Wenting Zheng <[email protected]>
andrewlawhh added a commit that referenced this pull request Apr 2, 2021
* Support for multiple branched CaseWhen

* Interval (#116)

* add date_add, interval sql still running into issues

* Add Interval SQL support

* uncomment out the other tests

* resolve comments

* change interval equality

Co-authored-by: Eric Feng <[email protected]>

* Remove partition ID argument from enclaves

* Fix comments

* updates

* Modifications to integrate crumb, log-mac, and all-outputs_mac, wip

* Store log mac after each output buffer, add all-outputs-mac to each encryptedblocks wip

* Add all_outputs_mac to all EncryptedBlocks once all log_macs have been generated

* Almost builds

* cpp builds

* Use ubyte for all_outputs_mac

* use Mac for all_outputs_mac

* Hopefully this works for flatbuffers all_outputs_mac mutation, cpp builds

* Scala builds now too, running into error with union

* Stuff builds, error with all outputs mac serialization. this commit uses all_outputs_mac as Mac table

* Fixed bug, basic encryption / show works

* All single partition tests pass, multiple partiton passes until tpch-9

* All tests pass except tpch-9 and skew join

* comment tpch back in

* Check same number of ecalls per partition - exception for scanCollectLastPrimary(?)

* First attempt at constructing executed DAG

* Fix typos

* Rework graph

* Add log macs to graph nodes

* Construct expected DAG and refactor JobNode.
Refactor construction of executed DAG.

* Implement 'paths to sink' for a DAG

* add crumb for last ecall

* Fix NULL handling for aggregation (#130)

* Modify COUNT and SUM to correctly handle NULL values

* Change average to support NULL values

* Fix

* Changing operator matching from logical to physical (#129)

* WIP

* Fix

* Unapply change

* Aggregation rewrite (#132)

* updated build/sbt file (#135)

* Travis update (#137)

* update breeze (#138)

* TPC-H test suite added (#136)

* added tpch sql files

* functions updated to save temp view

* main function skeleton done

* load and clear done

* fix clear

* performQuery done

* import cleanup, use OPAQUE_HOME

* TPC-H 9 refactored to use SQL rather than DF operations

* removed : Unit, unused imports

* added TestUtils.scala

* moved all common initialization to TestUtils

* update name

* begin rewriting TPCH.scala to store persistent tables

* invalid table name error

* TPCH conversion to class started

* compiles

* added second case, cleared up names

* added TPC-H 6 to check that persistent state has no issues

* added functions for the last two tables

* addressed most logic changes

* DataFrame only loaded once

* apply method in companion object

* full test suite added

* added testFunc parameter to testAgainstSpark

* ignore #18

* Separate IN PR (#124)

* finishing the in expression. adding more tests and null support. need confirmation on null behavior and also I wonder why integer field is sufficient for string

* adding additional test

* adding additional test

* saving concat implementation and it's passing basic functionality tests

* adding type aware comparison and better error message for IN operator

* adding null checking for the concat operator and adding one additional test

* cleaning up IN&Concat PR

* deleting concat and preping the in branch for in pr

* fixing null bahavior 

now it's only null when there's no match and there's null input

* Build failed

Co-authored-by: Ubuntu <[email protected]>
Co-authored-by: Wenting Zheng <[email protected]>
Co-authored-by: Wenting Zheng <[email protected]>

* Merge new aggregate

* Uncomment log_mac_lst clear

* Clean up comments

* Separate Concat PR  (#125)

Implementation of the CONCAT expression.

Co-authored-by: Ubuntu <[email protected]>
Co-authored-by: Wenting Zheng <[email protected]>

* Clean up comments in other files

* Update pathsEqual to be less conservative

* Remove print statements from unit tests

* Removed calls to toSet in TPC-H tests (#140)

* removed calls to toSet

* added calls to toSet back where queries are unordered

* Documentation update (#148)

* Cluster Remote Attestation Fix (#146)

The existing code only had RA working when run locally. This PR adds a sleep for 5 seconds to make sure that all executors are spun up successfully before attestation begins.

Closes #147

* upgrade to 3.0.1 (#144)

* Update two TPC-H queries (#149)

Tests for TPC-H 12 and 19 pass.

* TPC-H 20 Fix (#142)

* string to stringtype error

* tpch 20 passes

* cleanup

* implemented changes

* decimal.tofloat

Co-authored-by: Wenting Zheng <[email protected]>

* Add expected operator DAG generation from executedPlan string

* Rebase

* Join update (#145)

* Merge join update

* Integrate new join

* Add expected operator for sortexec

* Merge comp-integrity with join update

* Remove some print statements

* Migrate from Travis CI to Github Actions (#156)

* Upgrade to OE 0.12 (#153)

* Update README.md

* Support for scalar subquery (#157)

This PR implements the scalar subquery expression, which is triggered whenever a subquery returns a scalar value. There were two main problems that needed to be solved.

First, support for matching the scalar subquery expression is necessary. Spark implements this by wrapping a SparkPlan within the expression and calls executeCollect. Then it constructs a literal with that value. However, this is problematic for us because that value should not be decrypted by the driver and serialized into an expression, since it's an intermediate value.

Therefore, the second issue to be addressed here is supporting an encrypted literal. This is implemented in this PR by serializing an encrypted ciphertext into a base64 encoded string, and wrapping a Decrypt expression on top of it. This expression is then evaluated in the enclave and returns a literal. Note that, in order to test our implementation, we also implement a Decrypt expression in Scala. However, this should never be evaluated on the driver side and serialized into a plaintext literal. This is because Decrypt is designated as a Nondeterministic expression, and therefore will always evaluate on the workers.

* Add TPC-H Benchmarks (#139)

* logic decoupling in TPCH.scala for easier benchmarking

* added TPCHBenchmark.scala

* Benchmark.scala rewrite

* done adding all support TPC-H query benchmarks

* changed commandline arguments that benchmark takes

* TPCHBenchmark takes in parameters

* fixed issue with spark conf

* size error handling, --help flag

* add Utils.force, break cluster mode

* comment out logistic regression benchmark

* ensureCached right before temp view created/replaced

* upgrade to 3.0.1

* upgrade to 3.0.1

* 10 scale factor

* persistData

* almost done refactor

* more cleanup

* compiles

* 9 passes

* cleanup

* collect instead of force, sf_none

* remove sf_none

* defaultParallelism

* no removing trailing/leading whitespace

* add sf_med

* hdfs works in local case

* cleanup, added new CLI argument

* added newly supported tpch queries

* function for running all supported tests

* Construct expected DAG from dataframe physical plan

* Refactor collect and add integrity checking helper function to OpaqueOperatorTest

* Float expressions (#160)

This PR adds float normalization expressions [implemented in Spark](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NormalizeFloatingNumbers.scala#L170). TPC-H query 2 also passes.

* Broadcast Nested Loop Join - Left Anti and Left Semi  (#159)

This PR is the first of two parts towards making TPC-H 16 work: the other will be implementing `is_distinct` for aggregate operations.

`BroadcastNestedLoopJoin` is Spark's "catch all" for non-equi joins. It works by first picking a side to broadcast, then iterating through every possible row combination and checking the non-equi condition against the pair.

* Move join condition handling for equi-joins into enclave code (#164)

* Add in TPC-H 21

* Add condition processing in enclave code

* Code clean up

* Enable query 18

* WIP

* Local tests pass

* Apply suggestions from code review

Co-authored-by: octaviansima <[email protected]>

* WIP

* Address comments

* q21.sql

Co-authored-by: octaviansima <[email protected]>

* Distinct aggregation support (#163)

* matching in strategies.scala

set up class thing

cleanup

added test cases for non-equi left anti join

rename to serializeEquiJoinExpression

added isEncrypted condition

set up keys

JoinExpr now has condition

rename

serialization does not throw compile error for BNLJ

split up

added condition in ExpressionEvaluation.h

zipPartitions

cpp put in place

typo

added func to header

two loops in place

update tests

condition

fixed scala loop

interchange rows

added tags

ensure cached

== match working

comparison decoupling in ExpressionEvalulation

save

compiles and condition works

is printing

fix swap outer/inner

o_i_match

show() has the same result

tests pass

test cleanup

added test cases for different condition

BuildLeft works

optional keys in scala

started C++

passes the operator tests

comments, cleanup

attemping to do it the ~right~ way

comments to distinguish between primary/secondary, operator tests pass

cleanup comments, about to begin implementation for distinct agg ops

is_distinct

added test case

serializing with isDistinct

is_distinct in ExpressionEvaluation.h

removed unused code from join implementation

remove RowWriter/Reader in condition evaluation (join)

easier test

serialization done

correct checking in Scala

set is set up

spaghetti but it finally works

function for clearing values

condition_eval isntead of condition

goto

comment

remove explain from test, need to fix distinct aggregation for >1 partitions

started impl of multiple partitions fix

added rangepartitionexec that runs

partitioning cleanup

serialization properly

comments, generalization for > 1 distinct function

comments

about to refactor into logical.Aggregation

the new case has distinct in result expressions

need to match on distinct

removed new case (doesn't make difference?)

works

Upgrade to OE 0.12 (#153)

Update README.md

Support for scalar subquery (#157)

This PR implements the scalar subquery expression, which is triggered whenever a subquery returns a scalar value. There were two main problems that needed to be solved.

First, support for matching the scalar subquery expression is necessary. Spark implements this by wrapping a SparkPlan within the expression and calls executeCollect. Then it constructs a literal with that value. However, this is problematic for us because that value should not be decrypted by the driver and serialized into an expression, since it's an intermediate value.

Therefore, the second issue to be addressed here is supporting an encrypted literal. This is implemented in this PR by serializing an encrypted ciphertext into a base64 encoded string, and wrapping a Decrypt expression on top of it. This expression is then evaluated in the enclave and returns a literal. Note that, in order to test our implementation, we also implement a Decrypt expression in Scala. However, this should never be evaluated on the driver side and serialized into a plaintext literal. This is because Decrypt is designated as a Nondeterministic expression, and therefore will always evaluate on the workers.

match

remove RangePartitionExec

inefficient implementation refined

Add TPC-H Benchmarks (#139)

* logic decoupling in TPCH.scala for easier benchmarking

* added TPCHBenchmark.scala

* Benchmark.scala rewrite

* done adding all support TPC-H query benchmarks

* changed commandline arguments that benchmark takes

* TPCHBenchmark takes in parameters

* fixed issue with spark conf

* size error handling, --help flag

* add Utils.force, break cluster mode

* comment out logistic regression benchmark

* ensureCached right before temp view created/replaced

* upgrade to 3.0.1

* upgrade to 3.0.1

* 10 scale factor

* persistData

* almost done refactor

* more cleanup

* compiles

* 9 passes

* cleanup

* collect instead of force, sf_none

* remove sf_none

* defaultParallelism

* no removing trailing/leading whitespace

* add sf_med

* hdfs works in local case

* cleanup, added new CLI argument

* added newly supported tpch queries

* function for running all supported tests

complete instead of partial -> final

removed traces of join

cleanup

* added test case for one distinct one non, reverted comment

* removed C++ level implementation of is_distinct

* PartialMerge in operators.scala

* stage 1: grouping with distinct expressions

* stage 2: WIP

* saving, sorting by group expressions ++ name distinct expressions worked

* stage 1 & 2 printing the expected results

* removed extraneous call to sorted, #3 in place but not working

* stage 3 has the final, correct result: refactoring the Aggregate code to not cast aggregate expressions to Partial, PartialMerge, etc will be needed

* refactor done, C++ still printing the correct values

* need to formalize None case in EncryptedAggregateExec.output, but stage 4 passes

* distinct and indistinct passes (git add -u)

* general cleanup, None case looks nicer

* throw error with >1 distinct, add test case for global distinct

* no need for global aggregation case

* single partition passes all aggregate tests, multiple partition doesn't

* works with global sort first

* works with non-global sort first

* cleanup

* cleanup tests

* removed iostream, other nit

* added test case for 13

* None case in isPartial match done properly

* added test cases for sumDistinct

* case-specific namedDistinctExpressions working

* distinct sum is done

* removed comments

* got rid of mode argument

* tests include null values

* partition followed by local sort instead of first global sort

* Remove addExpectedOperator from JobVerificationEngine, add comments

* Implement expected DAG construction by doing graph manipulation on dataframe field instead of string parsing

* Fix merge errors in the test cases

Co-authored-by: Andrew Law <[email protected]>
Co-authored-by: Eric Feng <[email protected]>
Co-authored-by: Eric Feng <[email protected]>
Co-authored-by: Chester Leung <[email protected]>
Co-authored-by: Wenting Zheng <[email protected]>
Co-authored-by: octaviansima <[email protected]>
Co-authored-by: Chenyu Shi <[email protected]>
Co-authored-by: Ubuntu <[email protected]>
Co-authored-by: Wenting Zheng <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants