Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Flink Integration Tests #114

Merged
merged 1 commit into from
Nov 6, 2023

Conversation

ranchodeluxe
Copy link
Collaborator

@ranchodeluxe ranchodeluxe commented Oct 23, 2023

Addresses: #111

  • Adds a build matrix for python/beam/recipe version like dataflow integration tests
  • Moved over to using a minio service in the k3s cluster b/c nothing could connect to minio daemon on the host
  • Take the spirit of the existing tests, tweak them and get them asserting correctly on minio output
  • Add validation for Bake.container_image and unit test

@ranchodeluxe ranchodeluxe changed the base branch from main to unpin-beam October 23, 2023 14:09
@codecov
Copy link

codecov bot commented Oct 23, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (23ec9a0) 96.01% compared to head (1d6198f) 96.06%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #114      +/-   ##
==========================================
+ Coverage   96.01%   96.06%   +0.05%     
==========================================
  Files          14       14              
  Lines         452      458       +6     
==========================================
+ Hits          434      440       +6     
  Misses         18       18              
Files Coverage Δ
pangeo_forge_runner/bakery/flink.py 93.75% <ø> (ø)
pangeo_forge_runner/commands/bake.py 90.82% <100.00%> (+0.53%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

setup.py Outdated
@@ -19,7 +19,7 @@
"pangeo-forge-recipes>=0.9.2",
"escapism",
"traitlets",
"apache-beam[gcp]",
"apache-beam[gcp]==2.47.0",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're probably already tracking this (and have made above edit for convenience), but ultimately I think we'll want to specify this version pin in either:

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, for convenience right now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool cool, figured 😎

Copy link
Collaborator Author

@ranchodeluxe ranchodeluxe Oct 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cisaacstern: my hunch was your suggestion above wouldn't work based on what I've been seeing the last week. And it broke the passing tests

The apache-beam version designated by pangeo-forge-runner will determine which job-server jar is download and uploaded to the flink server. Note below it's uploading 2.51.0 even though I have my recipe pinned to apache-beam==2.47.0:

Screen Shot 2023-10-25 at 7 12 04 AM

So there's currently a gross tight coupling between producer and consumer beam versions that has be coordinated in the following places:

  1. producer: pangeo-forge-runner setup.py
  2. consumer: the container_image

Let me recover the actual logs from the Flink run to determine what the issue is but I've seen it talk about incompatible version before and that's my guess what's happening here

Maybe @yuvipanda suggestion here is that we try to run Flink without uploading the job-server jar? Maybe I can investigate that

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all your awesome work here @ranchodeluxe.

For expedience to get this all merged, maybe we remove beam from the top-level dependencies and do:

setup(
    ...,
    extras_require={
        "dataflow": ["apache-beam[gcp]"],
        "flink": ["apache-beam==2.47.0"],
    },
)

?

Any thoughts on that approach @yuvipanda or @ranchodeluxe ?

Copy link
Collaborator Author

@ranchodeluxe ranchodeluxe Oct 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For expedience to get this all merged, maybe we remove beam from the top-level dependencies and do:

yes, @cisaacstern this makes sense to me at least but I'd do "flink": ["apache-beam>=2.47.0"] to convey that it should work from that version forward and then I can have the integration tests use a build matrix for those different versions of apache-beam

Try things out now for different versions of apache-beam 🤞

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note below it's uploading 2.51.0 even though I have my recipe pinned to apache-beam==2.47.0

Hmm actually now that I think about it this is probably just because the integration test is not installing the recipe's requirements.txt in the producer/client/deployer environment? We have a long-standing issue related to that: #27, but currently it's just up to the deployer to make sure that happens on the client side. I've hacked in a solution for that in the GitHub Action here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or I should say, in the GCP Dataflow context, the beam version for the worker container is inferred from the client environment, but maybe that assumption doesn't hold for Flink.

Copy link
Collaborator Author

@ranchodeluxe ranchodeluxe Oct 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the beam version for the worker container is inferred from the client

Yeah, this is what I'd like to happen. And it's good for me to review between the runners.

Currently Bake.container_image defaults to "" which our doc strings say should force beam to figure out which worker SDK image to use. This might work for GCP DataFlow but for Flink we override our container definition (among other things) for the pod to use that specific image.

So using the default "" for Fink breaks and we are forced to supply one (this is a bug I have listed that needs a validation fix in my queue of things to do)

is not installing the recipe's requirements.txt in the producer/client/deployer environment?

This is curious and you are right that the apache-beam client/pipeline stages everything from the recipe's requirement.txt except the apache-beam packages even if they exist in there 🤔

I assume it's doing this for a smart reason b/c the job-server jar already knows what versions it's dealing with but this is where my understanding of the architecture breaks down. I haven't run GPC DataFlow but assume it doesn't need to upload the job-server jar?

@ranchodeluxe ranchodeluxe added the test-flink Add this label to PRs to trigger Dataflow integration test. label Oct 24, 2023
@ranchodeluxe ranchodeluxe changed the base branch from unpin-beam to main October 24, 2023 17:53
@ranchodeluxe ranchodeluxe changed the title WIP: fix integration test on unpin-beam branch WIP: fix flink integration test Oct 24, 2023
@ranchodeluxe ranchodeluxe force-pushed the gcorradini/unpin-beam-fix-integration-test branch 2 times, most recently from 024ef92 to 6c28988 Compare October 25, 2023 02:39
@ranchodeluxe
Copy link
Collaborator Author

We're close to being done here

Screen Shot 2023-10-26 at 1 02 03 PM

@cisaacstern
Copy link
Member

Wow wow wow

@ranchodeluxe
Copy link
Collaborator Author

ranchodeluxe commented Oct 26, 2023

We're close to being done here

@cisaacstern: one of the last considerations that might need some of your input before you do a review is:

Should I be setting up tags for flink-specific versions (e.g. 10.3.0-flink) like you are doing for dataflow? I imagine yes?

@ranchodeluxe
Copy link
Collaborator Author

ranchodeluxe commented Oct 26, 2023

We're close to being done here

@cisaacstern: one of the last considerations that might need some of your input before you do a review is:

Should I be setting up tags for flink-specific versions (e.g. 10.3.0-flink) like you are doing for dataflow? I imagine yes?

Just put in some quick PRs for 0.10.x stuff b/c we should do this like DataFlow (but I don't have access to add tags) https://github.com/pforgetest/gpcp-from-gcs-feedstock/pulls

@ranchodeluxe ranchodeluxe changed the title WIP: fix flink integration test Fix Flink Integration Tests Oct 26, 2023
@ranchodeluxe ranchodeluxe force-pushed the gcorradini/unpin-beam-fix-integration-test branch from b315111 to 17cb8a3 Compare October 27, 2023 00:00
@ranchodeluxe ranchodeluxe added the test-dataflow Add this label to PRs to trigger Dataflow integration test. label Oct 27, 2023
@ranchodeluxe
Copy link
Collaborator Author

ranchodeluxe commented Oct 27, 2023

All the things are passing 💯

All that is left is adding tags to the main https://github.com/pforgetest/gpcp-from-gcs-feedstock/ for the integration tests and swapping out the --repo arg to point at it instead of my fork

@ranchodeluxe ranchodeluxe force-pushed the gcorradini/unpin-beam-fix-integration-test branch from 28aa29d to ab4a95e Compare October 27, 2023 14:39
Copy link
Member

@cisaacstern cisaacstern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh goodness I wrote such a long review which somehow got lost before posting. How demoralizing! Anyway... here goes again... 😅

What a heroic effort, @ranchodeluxe, I am floored! This is a truly impressive and impactful contribution.

My main feedback relates to the question you asked

Should I be setting up tags for flink-specific versions (e.g. 10.3.0-flink) like you are doing for dataflow? I imagine yes?

Actually, I would suggest we not do this. Instead, let's just add s3fs to the requirements.txt included in the existing tags. The reason is twofold:

  1. Fewer test recipes to maintain
  2. We really want recipes to be runner-agnostic, and we are served in that aim if we use the exact same recipes to test each of the runners

The cost of achieving these goals is relatively small: namely, the requirements.txt will just have one extra dependency which is not used in certain deployment settings, but that's a small price to pay in pursuit of the above objectives IMO.

I've just promoted you to Owner on pforgetest so you can do that.

(Once we've got this PR merged, it would be interesting to take a step back and discuss the best testing strategy here, i.e. is this pattern of using pforgetest the best option, etc., but of course let's not let that sidetrack us from getting all this merged first!)

I don't have time for a full line-by-line review today, but can do that early next week. Just wanted to keep things rolling by starting with that feedback about testing tags.

@ranchodeluxe ranchodeluxe force-pushed the gcorradini/unpin-beam-fix-integration-test branch from 6f4cbcb to 920bfd2 Compare October 28, 2023 13:11
@ranchodeluxe
Copy link
Collaborator Author

ranchodeluxe commented Oct 28, 2023

Should I be setting up tags for flink-specific versions (e.g. 10.3.0-flink) like you are doing for dataflow? I imagine yes?

Actually, I would suggest we not do this. Instead, let's just add s3fs to the requirements.txt included in the existing tags. The reason is twofold:

1. Fewer test recipes to maintain

2. We really want recipes to be runner-agnostic, and we are served in that aim if we use the exact same recipes to test each of the runners

Didn't see your comment before I did my last push. That all sounds fine by me. Let me remove the tags I just added and delete those branches. Will clean things up later this weekend

@ranchodeluxe ranchodeluxe force-pushed the gcorradini/unpin-beam-fix-integration-test branch from 920bfd2 to 1d6198f Compare October 28, 2023 17:44

Note that some runners (like the local one) may not support this!
""",
)

@validate("container_image")
def _validate_container_image(self, proposal):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!


- name: Set up min.io as a k3s service
run: |
MYACCESSKEY=$(openssl rand -hex 16)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm learning a lot reading this PR 😄

Comment on lines +105 to +109
def pytest_addoption(parser):
parser.addoption("--flinkversion", action="store", default="1.16")
parser.addoption("--pythonversion", action="store", default="3.9")
parser.addoption("--beamversion", action="store", default="2.47.0")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another pattern that I haven't seen before, I like it!

Copy link
Member

@cisaacstern cisaacstern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome, @ranchodeluxe! Thank you so much.

So it looks like in the end the changes required to get Flink working were pretty minimal... just changing Flink version to 1.16, and making sure the right container_image was passed?

Not to minimize how hard it was to find that out!

The testing here is a work of art, very thorough, and very readable!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test-dataflow Add this label to PRs to trigger Dataflow integration test. test-flink Add this label to PRs to trigger Dataflow integration test.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants