Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update dependencies doc #487

Merged
merged 9 commits into from
Jan 7, 2025
Merged

Conversation

kathweinschenkprophecy
Copy link
Collaborator

@kathweinschenkprophecy kathweinschenkprophecy commented Dec 20, 2024


| Scope | The dependency is enabled at the Project level or the Pipeline level. |
| Type | The dependency is either from the Package Hub, Scala (Maven) or Python (PyPi). |
| Name | This will identify the dependency. |
| Version/Package/Coordinates | For Package Hub dependencies, input the package version. For Scala, use the Maven coordinates in the `groupId:artifcatId:version` format. For example, use `org.postgresql:postgresql:42.3.3` For Python, use the package and the version number. |
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the option to choose "Package Hub" when installing a Pipelines dependency. Can I provide a reason why?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you may be using a project that has Prophecy Managed Git. I believe that may not allow you to use packagehub. we can discuss more over chat

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't see it on a project using external git. Though, it doesn't really matter for these docs at the moments... we can return to the question

@kathweinschenkprophecy
Copy link
Collaborator Author

To do:

  • Add screenshots
  • Clean up old screenshots
  • Validate Livy/EMR dependency limitation
  • Libs redirect

docs/Spark/extensibility/dependencies.md Show resolved Hide resolved
| Scope | The dependency is enabled at the Project level or the Pipeline level. |
| Type | The dependency is either from the Package Hub, Scala (Maven) or Python (PyPi). |
| Name | This will identify the dependency. |
| Version/Package/Coordinates | For Package Hub dependencies, input the package version. For Scala, use the Maven coordinates in the `groupId:artifcatId:version` format. For example, use `org.postgresql:postgresql:42.3.3` For Python, use the package and the version number. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you may be using a project that has Prophecy Managed Git. I believe that may not allow you to use packagehub. we can discuss more over chat

docs/Spark/extensibility/dependencies.md Show resolved Hide resolved
Copy link
Contributor

@neontty neontty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just realized that there is 1 more section we need to add:

Scala Dependencies in Pyspark: how do python projects track scala dependencies

We need to note how that gets tracked in pbt_project.yaml and have a hyperlink to the PBT (needs to be written) page which describes the option to build python WHLs with a dummy POM.xml file ... for now we can just hyperlink to PBT main page.

I can provide more details on this tomorrow

@neontty
Copy link
Contributor

neontty commented Jan 6, 2025

additional info for new section:

When deploying pipelines in the WHL format we must consider dependencies both in Python and Scala. These Scala jars are used by spark applications in the underlying jvm ( even in pyspark applications). The WHL format inherently records python dependencies, however there is no industry standard for WHL files to specify non-python dependencies (jar files). As a result we have enhanced PBT to record the Scala dependency information and store it in the WHL file

recently we made an improvement to the prophecy-build-tool which allows you to generate any Scala dependencies of pyspark pipelines as pom.xml and a flat file called MAVEN_COORDINATES

just run pbt once on your project like so:

pbt build-v2 --add-pom-xml-python --path .

and it will generate those files and add them to the WHL under {package_name}-1.0.data/data/
(note: set the environment variable SPARK_VERSION with a valid spark version that ends in a 0 [3.3.0,3.4.0,3.5.0] which corresponds to the spark version you plan to use in the execution environment. Otherwise there is no context for PBT to generate the spark version part of the maven coordinate and a string {{REPLACE_ME}} will be used instead).

@neontty
Copy link
Contributor

neontty commented Jan 7, 2025

@kathweinschenkprophecy , I made a few changes; mostly pointing out that this is only necessary if the user did not create a Job in the Prophecy UI and is deploying the WHL file manually (not using pbt deploy or pbt deploy-v2) .

Please go ahead and correct anything if I made any bad suggestions.

@kathweinschenkprophecy kathweinschenkprophecy merged commit fd1ec7d into main Jan 7, 2025
1 check passed
@kathweinschenkprophecy kathweinschenkprophecy deleted the update-dependencies-info branch January 7, 2025 21:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants