Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use optional third argument as edge attribute. #901

Closed
wants to merge 8 commits into from
Closed

Use optional third argument as edge attribute. #901

wants to merge 8 commits into from

Conversation

npanj
Copy link

@npanj npanj commented May 28, 2014

This is probably easiest way to include edge attribute. Unless you guys have been thinking about some other ways? I also wanted to add one more argument for default edge attribute value..probably I will do that in another patch.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@rxin
Copy link
Contributor

rxin commented May 28, 2014

This is definitely a good change.

@rxin
Copy link
Contributor

rxin commented May 28, 2014

Jenkins, test this please.

@rxin
Copy link
Contributor

rxin commented May 28, 2014

Upon further look, it might be better if we allow users to define the way to parse the 3rd argument to allow arbitrary data types. What do you think?

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15258/

@npanj
Copy link
Author

npanj commented May 28, 2014

Allowing arbitrary data types sounds like a good idea. I actually tried to do something like this:


def edgeListFile[[@specialized(Long, Int, Double) ED: ClassTag]](sc: SparkContext,
path: String,
canonicalOrientation: Boolean = false,
minEdgePartitions: Int = 1)
: Graph[Int, ED] =

and

if (lineArray.length >= 3) lineArray(2).asInstanceOf[ED]
else 1.asInstanceOf[ED]


But I am running into this error:


java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer
at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)


I guess I will have to handle primitive types separately?
(I am quite new to Scala) Are there any known patterns to handle casting? If you can point me to a code snippet within spark codebase(or else where ) that will be great.

Also do you guys follow any scala code style guide? ; so that I can follow that for future patches

@ankurdave
Copy link
Contributor

Unlike Python (but like Java), Scala doesn't use asInstanceOf for arbitrary type conversions. In this case, it won't work to do "123".asInstanceOf[Int], because Scala doesn't know how to do the conversion. There are two ways to resolve this:

  1. Have edgeListFile take a function that does the conversion:

    def edgeListFile[ED: ClassTag](..., parser: String => ED) = {
      ... parser(lineArray(2)) ...
    }
    // Can be called like this:
    GraphLoader.edgeListFile[Int](..., _.toInt)
    GraphLoader.edgeListFile[String](..., identity)

    This is simple, but it doesn't facilitate reuse of parsing functions. You always have to specify how to parse an Int, even though it's usually the same everywhere.

  2. Define a library of standard parsers and use Scala implicits to select the right one automatically if it exists:

    // `Parseable[T]` is a type class [1].  If it is defined for Foo (meaning
    // there is an `implicit val` of type Parseable[Foo] in scope), it means you
    // can parse any String into a Foo.
    trait Parseable[T] { def parse(s: String): T }
    
    // Here are some parsers. The user can define their own by creating
    // additional implicit vals.
    implicit val IntParser = new Parseable[Int] { def parse(s: String) = s.toInt }
    implicit val StringParser = new Parseable[String] { def parse(s: String) = s }
    
    // Instead of an explicit function parameter, we use a context bound, which
    // lets us access the parser for ED using `implicitly[Parseable[ED]]`.
    def edgeListFile[ED: ClassTag : Parseable](...) = {
      ... implicitly[Parseable[ED]].parse(lineArray(2)) ...
    }
    // Can be called without passing anything extra:
    GraphLoader.edgeListFile[Int](...)
    GraphLoader.edgeListFile[String](...)

[1] http://docs.scala-lang.org/tutorials/FAQ/finding-implicits.html#context_bounds

@ankurdave
Copy link
Contributor

ok to test

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15320/

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15329/

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15337/

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15340/

@npanj
Copy link
Author

npanj commented Jun 2, 2014

@ankurdave

  • thanks for explanation about parsable and use of implicit.(I will try to implement it)
  • about storage level of the vertex attributes, yes I was trying to see if it can be changed to handle memory pressure; but that didn't work. From your patch it seems that it is more complicated than what I thought initially. Any reason why this patch was not submitted? Also I believe some parts of this patch will also be applicable to SPARK-1991?
    (actually, I didn't mean to consider serialization changes for this pull request... but I guess everything from forked branch is pulled as pull request?; I haven't submitted patched on github before, so I was not familiar about it...)

@pwendell
Copy link
Contributor

pwendell commented Sep 2, 2014

@ankurdave @rxin could you guys come to a decision on this one way or the other? Also @npanj mind adding [GRAPHX] to the title here? Right now this is getting sorted with the Spark core patches, which might be why people haven't seen it.

@ankurdave
Copy link
Contributor

@pwendell @npanj @rxin This isn't mergeable at the moment because it doesn't parse the edge attributes correctly. Since it's difficult to do this parsing in general, there are two choices for now:

  1. Modify the PR to accept only Int edge attributes (i.e., remove ED). This would help users who just want Int edge attributes, but it introduces a new special-case public API that we will need to support in the future.
  2. Close the PR for now, and for users who want edge attributes, encourage them to implement their own custom version of GraphLoader.

I'd suggest closing this for now.

@SparkQA
Copy link

SparkQA commented Sep 5, 2014

Can one of the admins verify this patch?

@ankurdave
Copy link
Contributor

@npanj Mind closing this?

@npanj npanj closed this Oct 9, 2014
@npanj
Copy link
Author

npanj commented Oct 9, 2014

Sorry was out of loop for a while... closing it

wangyum added a commit that referenced this pull request May 26, 2023
* [CARMEL-5873] Upgrade Parquet to 1.12.2 (#896)

* [SPARK-36726] Upgrade Parquet to 1.12.1

### What changes were proposed in this pull request?

Upgrade Apache Parquet to 1.12.1

### Why are the changes needed?

Parquet 1.12.1 contains the following bug fixes:
- PARQUET-2064: Make Range public accessible in RowRanges
- PARQUET-2022: ZstdDecompressorStream should close `zstdInputStream`
- PARQUET-2052: Integer overflow when writing huge binary using dictionary encoding
- PARQUET-1633: Fix integer overflow
- PARQUET-2054: fix TCP leaking when calling ParquetFileWriter.appendFile
- PARQUET-2072: Do Not Determine Both Min/Max for Binary Stats
- PARQUET-2073: Fix estimate remaining row count in ColumnWriteStoreBase
- PARQUET-2078: Failed to read parquet file after writing with the same

In particular PARQUET-2078 is a blocker for the upcoming Apache Spark 3.2.0 release.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests + a new test for the issue in SPARK-36696

Closes #33969 from sunchao/upgrade-parquet-12.1.

Authored-by: Chao Sun <[email protected]>
Signed-off-by: DB Tsai <[email protected]>

(cherry picked from commit a927b08)

* [SPARK-34542][BUILD] Upgrade Parquet to 1.12.0

### What changes were proposed in this pull request?

Parquet 1.12.0 New Feature
- PARQUET-41 - Add bloom filters to parquet statistics
- PARQUET-1373 - Encryption key management tools
- PARQUET-1396 - Example of using EncryptionPropertiesFactory and DecryptionPropertiesFactory
- PARQUET-1622 - Add BYTE_STREAM_SPLIT encoding
- PARQUET-1784 - Column-wise configuration
- PARQUET-1817 - Crypto Properties Factory
- PARQUET-1854 - Properties-Driven Interface to Parquet Encryption

Parquet 1.12.0 release notes:
https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/CHANGES.md

### Why are the changes needed?

- Bloom filters to improve filter performance
- ZSTD enhancement

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing unit test.

Closes #31649 from wangyum/SPARK-34542.

Lead-authored-by: Yuming Wang <[email protected]>
Co-authored-by: Yuming Wang <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>

(cherry picked from commit cbffc12)

Co-authored-by: Chao Sun <[email protected]>

* [HADP-44647] Parquet file based kms client for encryption keys (#897)

* [HADP-44647] Parquet file based kms client for encryption keys (#82)

create/write parquet encryption table.
```
set spark.sql.parquet.encryption.key.file=/path/to/key/file;

create table parquet_encryption(a int, b int, c int)
using parquet
options (
'parquet.encryption.column.keys' 'columnKey1: a, b; columnKey2: c',
'parquet.encryption.footer.key' 'footerKey');
```

read parquet encryption table;

```
set spark.sql.parquet.encryption.key.file=/path/to/key/file;

select ... from parquet_encryption ...
```

Will raise another pr for default footerKey.

* [HADP-44647][FOLLOWUP] Reuse the kms instance for same key file (#84)

* Fix

Co-authored-by: fwang12 <[email protected]>

Co-authored-by: Chao Sun <[email protected]>
Co-authored-by: fwang12 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants