-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a MutableMap when accessing keySet and values in SparkMap #58
Conversation
df4a6e7
to
2395082
Compare
@@ -31,10 +31,14 @@ case class SparkMap(private var _mapData: MapData, | |||
} | |||
|
|||
override def keySet(): util.Set[StdData] = { | |||
if (_mutableMap == null) { | |||
_mutableMap = createMutableMap() | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating a mutable map on the read path seems excessive. I see that _map.keyArray()
exposes getter methods. Can we build an iterator on top of this method ourselves? I think we would need to just maintain a current index inside the iterator and then increment it on next()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is definitely excessive in this scenario, but I don't think there will be keySet/values calls without some get(key) call which would create a mutable map anyways.
I can make the change to make another Iterator though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the change to the iterator. It's a cleaner solution, thanks for suggestion!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Looks much better. Consider reducing duplicated code? Something like this?
override def keySet(): util.Set[StdData] = {
val keysIterator: Iterator[Any] = if (_mutableMap == null) {
new Iterator[Any] {
private val keyArray = _mapData.keyArray()
var offset : Int = 0
override def next(): Any = {
offset += 1
keyArray.get(offset -1, _keyType)
}
override def hasNext: Boolean = offset < SparkMap.this.size()
}
} else {
_mutableMap.keysIterator
}
new util.AbstractSet[StdData] {
override def iterator(): util.Iterator[StdData] = new util.Iterator[StdData] {
override def next(): StdData = SparkWrapper.createStdData(keysIterator.next(), _keyType)
override def hasNext: Boolean = keysIterator.hasNext
}
override def size(): Int = SparkMap.this.size()
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
acked and updated.
case array: mutable.WrappedArray[Any] if expectedOutputData != null => | ||
val expectedArray = array.sortWith( | ||
(l, r) => r == null || (l != null && l.toString < r.toString)) | ||
val actualArray = result.get(0).asInstanceOf[mutable.WrappedArray[Any]].sortWith( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is the right fix. Arrays are ordered and so this check should be ordered. I understand that in this specific case, the output you are trying to test is that of the map_keys
, map_values
functions which are unordered. IMO, an ideal fix would be to pass the output of the function call through an array_sort
UDF in the test case for those specific UDFs.
If we create the iterator ourselves instead of using a mutable map as mentioned in my comment above, do we still need this in the scope of the PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this should not be the fix, but I thought it might be okay to bandaid it since the unit tests don't seem to be evolving for now.
In theory, doing something like Map.get() then Map.keySet() would break the unit test ordering since a mutable map would be created which uses a sorted set as its key collection (at least from what it seems).
I think there should be some checkUnordered method in the StdTester, but it is not necessary for this RB.
If we create our own iterator without the mutable map, this change is not necessary.
2395082
to
71ac147
Compare
Create this change to fix UnsafeArrayData bug
71ac147
to
eb496db
Compare
I don't have authorization to merge this branch; could you merge it on my behalf Shardul? |
@raymondlam12 Thanks for the PR! A new version should be published in the next couple of hours. |
* Disable Jacoco for platform tests (#37) * 0.0.46 release (previous 0.0.45) + release notes updated [ci skip] * Fixed presto UDF patch broken link (#41) Co-authored-by: Sushant Raikar <[email protected]> * FileSystemUtils: remove an unreliable check for unit testing (#42) Why: we don't need it What changed: instead of checking configuration for hints on whether we are unit testing, we trust the URI's protocol Tests performed: ./gradlew build * 0.0.47 release (previous 0.0.46) + release notes updated [ci skip] * Presto: Pass custom configuration object when using FileSystemUtils (#43) * 0.0.48 release (previous 0.0.47) + release notes updated [ci skip] * Presto: Make ScalarFunctionImplementation state independent of StdUdfWrapper (#44) * 0.0.49 release (previous 0.0.48) + release notes updated [ci skip] * Upgrade to PrestoSQL 333 (#45) Some major changes: - `SqlFunction`, `SqlScalarFunction` and `ScalarFunctionImplementation` have evolved in trinodb/trino#1764 - `Metadata::getScalarFunctionImplementation` evolved in trinodb/trino#1039 - Type signature parser was moved to presto-main in trinodb/trino#1738 * 0.0.50 release (previous 0.0.49) + release notes updated [ci skip] * Add support for StdFloat, StdDouble, and StdBinary (#46) * Introduce StdFloat, StdDouble, and StdBinary interfaces * Add implementations of those interfaces in Avro, Hive, Presto, Spark, and Generic type systems * Add examples of transport UDFs on those new types, and add tests for those UDFs * Update documentation * 0.0.51 release (previous 0.0.50) + release notes updated [ci skip] * Allow users to override main and test source set names, output directories * 0.0.52 release (previous 0.0.51) + release notes updated [ci skip] * Empty commit to release new version * Empty commit to release new version [ci skip-compare-publications] * 0.0.53 release (previous 0.0.52) + release notes updated [ci skip] * Plugin: Publish Presto thin jar which allows consumers to control dependency graph (#49) * 0.0.54 release (previous 0.0.53) + release notes updated [ci skip] * Hive: Struct data should not be converted to object array during StdStruct creation (#50) * Remove slf4j-log4j12 from Transport dependency graph (#51) * Bump shipkit (#54) * 0.0.55 release (previous 0.0.54) + release notes updated [ci skip] * Fix test SQL generation for binary inputs (#55) * 0.0.56 release (previous 0.0.55) + release notes updated [ci skip] * Spark: Create index-based iterator for non-mutable map keySet and values access (#58) * 0.0.57 release (previous 0.0.56) + release notes updated [ci skip] * Avro: Support simple union schemas (#60) * 0.0.58 release (previous 0.0.57) + release notes updated [ci skip] * Support conversion of String type to Utf8 in AvroWrapper (#61) * 0.0.59 release (previous 0.0.58) + release notes updated [ci skip] * Add Avro ENUM read support and fix String bug (#62) Co-authored-by: Raymond Lam <[email protected]> * 0.0.60 release (previous 0.0.59) + release notes updated [ci skip] * Build: Fail if there are checkstyle violations (#64) - Change the Checkstyle severity level from warning to error - Eliminate all existing checkstyle violations Co-authored-by: Carl Steinbach <[email protected]> * 0.0.61 release (previous 0.0.60) + release notes updated [ci skip] * Add travis-build.sh for pre-commit testing from command line (#65) * Add travis-build.sh for pre-commit testing from command line * Fix name of build file in comment Co-authored-by: Carl Steinbach <[email protected]> Co-authored-by: Shardul Mahadik <[email protected]> * Upgrade to Gradle 6.7 (#67) * Support builds with platform specific JDK (#69) * Bump Avro dependency to 1.10.2 (from 1.7.7). (#71) There doesn't seem to be any impact to the code. gradle build passes. * Migrate from PrestoSQL to Trino (#68) * Automate artifact publication to Maven Central (#72) * Update ci.yml java version to 8 (#77) skip release * Fix org.pentaho:pentaho-aggdesigner-algorithm sunset problem (#78) * Remove travis build in favor of github actions (#87) * Add scala_2.11 and scala_2.12 support (#85) * Update ci.yml to also build the udf-examples folder (#90) Co-authored-by: Malini Mahalakshmi Venkatachari <[email protected]> * Fix running multiple builds in run step in workflow action (#92) Co-authored-by: Malini Mahalakshmi Venkatachari <[email protected]> * A solution to fix running multiple UDFs in Spark issue (#93) Co-authored-by: Kai Xu <[email protected]> Co-authored-by: Shardul Mahadik <[email protected]> Co-authored-by: shipkit-org <[email protected]> Co-authored-by: Sushant Raikar <[email protected]> Co-authored-by: Sushant Raikar <[email protected]> Co-authored-by: Suren Nihalani <[email protected]> Co-authored-by: Xingyuan Lin <[email protected]> Co-authored-by: Khai Tran <[email protected]> Co-authored-by: John Joyce <[email protected]> Co-authored-by: Raymond <[email protected]> Co-authored-by: curtiscwang <[email protected]> Co-authored-by: Raymond Lam <[email protected]> Co-authored-by: Carl Steinbach <[email protected]> Co-authored-by: Carl Steinbach <[email protected]> Co-authored-by: Akshay Rai <[email protected]> Co-authored-by: Sreeram Ramachandran <[email protected]> Co-authored-by: Raymond Zhang <[email protected]> Co-authored-by: KAI XU <[email protected]> Co-authored-by: Sushant Raikar <[email protected]> Co-authored-by: Malini Mahalakshmi Venkatachari <[email protected]> Co-authored-by: Kai Xu <[email protected]>
Currently, accessing the keySet via: _mapData.keyArray().array.iterator or values via: _mapData.valueArray().array.iterator fails as its underlying type in Spark is UnsafeArrayData.
UnsafeArrayData does not allow accessing the raw array: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java#L108 .
Creating the mutable map solves this problem.