-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Flink Integration Tests #114
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #114 +/- ##
==========================================
+ Coverage 96.01% 96.06% +0.05%
==========================================
Files 14 14
Lines 452 458 +6
==========================================
+ Hits 434 440 +6
Misses 18 18
☔ View full report in Codecov by Sentry. |
setup.py
Outdated
@@ -19,7 +19,7 @@ | |||
"pangeo-forge-recipes>=0.9.2", | |||
"escapism", | |||
"traitlets", | |||
"apache-beam[gcp]", | |||
"apache-beam[gcp]==2.47.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're probably already tracking this (and have made above edit for convenience), but ultimately I think we'll want to specify this version pin in either:
- https://github.com/ranchodeluxe/gpcp-from-gcs-feedstock/blob/main/feedstock/requirements.txt
- or possibly a Flink-specific set of optional dependencies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, for convenience right now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool cool, figured 😎
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cisaacstern: my hunch was your suggestion above wouldn't work based on what I've been seeing the last week. And it broke the passing tests
The apache-beam
version designated by pangeo-forge-runner
will determine which job-server jar is download and uploaded to the flink server. Note below it's uploading 2.51.0
even though I have my recipe pinned to apache-beam==2.47.0
:
![Screen Shot 2023-10-25 at 7 12 04 AM](https://private-user-images.githubusercontent.com/208859/278042542-c33a1f7f-8afd-4fc5-880e-8064db66370d.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk1OTk4OTEsIm5iZiI6MTczOTU5OTU5MSwicGF0aCI6Ii8yMDg4NTkvMjc4MDQyNTQyLWMzM2ExZjdmLThhZmQtNGZjNS04ODBlLTgwNjRkYjY2MzcwZC5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjE1JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxNVQwNjA2MzFaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1jNDliYTA5Mjk5ZDliZDcxNjZkMDE5MWMwZDZiN2RiMTA4MThhMmVkM2M5YTIzMzU4ZjgyNjY4OGQ0Yzg0YTA3JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.kdmG_INRqvZ0eL4kW6wqhA5EbPQKAJ_NKp24ZvWtEU0)
So there's currently a gross tight coupling between producer and consumer beam versions that has be coordinated in the following places:
- producer: pangeo-forge-runner
setup.py
- consumer: the container_image
Let me recover the actual logs from the Flink run to determine what the issue is but I've seen it talk about incompatible version before and that's my guess what's happening here
Maybe @yuvipanda suggestion here is that we try to run Flink without uploading the job-server jar? Maybe I can investigate that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, this repository is archived: https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/tree/master/examples/beam/with_job_server
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all your awesome work here @ranchodeluxe.
For expedience to get this all merged, maybe we remove beam from the top-level dependencies and do:
setup(
...,
extras_require={
"dataflow": ["apache-beam[gcp]"],
"flink": ["apache-beam==2.47.0"],
},
)
?
Any thoughts on that approach @yuvipanda or @ranchodeluxe ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For expedience to get this all merged, maybe we remove beam from the top-level dependencies and do:
yes, @cisaacstern this makes sense to me at least but I'd do "flink": ["apache-beam>=2.47.0"]
to convey that it should work from that version forward and then I can have the integration tests use a build matrix for those different versions of apache-beam
Try things out now for different versions of apache-beam
🤞
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note below it's uploading 2.51.0 even though I have my recipe pinned to apache-beam==2.47.0
Hmm actually now that I think about it this is probably just because the integration test is not installing the recipe's requirements.txt in the producer/client/deployer environment? We have a long-standing issue related to that: #27, but currently it's just up to the deployer to make sure that happens on the client side. I've hacked in a solution for that in the GitHub Action here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or I should say, in the GCP Dataflow context, the beam version for the worker container is inferred from the client environment, but maybe that assumption doesn't hold for Flink.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the beam version for the worker container is inferred from the client
Yeah, this is what I'd like to happen. And it's good for me to review between the runners.
Currently Bake.container_image
defaults to ""
which our doc strings say should force beam to figure out which worker SDK image to use. This might work for GCP DataFlow but for Flink we override our container
definition (among other things) for the pod to use that specific image.
So using the default ""
for Fink breaks and we are forced to supply one (this is a bug I have listed that needs a validation fix in my queue of things to do)
is not installing the recipe's requirements.txt in the producer/client/deployer environment?
This is curious and you are right that the apache-beam client/pipeline stages everything from the recipe's requirement.txt
except the apache-beam
packages even if they exist in there 🤔
I assume it's doing this for a smart reason b/c the job-server jar already knows what versions it's dealing with but this is where my understanding of the architecture breaks down. I haven't run GPC DataFlow but assume it doesn't need to upload the job-server jar?
unpin-beam
branch024ef92
to
6c28988
Compare
Wow wow wow |
@cisaacstern: one of the last considerations that might need some of your input before you do a review is: Should I be setting up tags for flink-specific versions (e.g. |
Just put in some quick PRs for |
b315111
to
17cb8a3
Compare
All the things are passing 💯 All that is left is adding tags to the main https://github.com/pforgetest/gpcp-from-gcs-feedstock/ for the integration tests and swapping out the |
28aa29d
to
ab4a95e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh goodness I wrote such a long review which somehow got lost before posting. How demoralizing! Anyway... here goes again... 😅
What a heroic effort, @ranchodeluxe, I am floored! This is a truly impressive and impactful contribution.
My main feedback relates to the question you asked
Should I be setting up tags for flink-specific versions (e.g. 10.3.0-flink) like you are doing for dataflow? I imagine yes?
Actually, I would suggest we not do this. Instead, let's just add s3fs
to the requirements.txt included in the existing tags. The reason is twofold:
- Fewer test recipes to maintain
- We really want recipes to be runner-agnostic, and we are served in that aim if we use the exact same recipes to test each of the runners
The cost of achieving these goals is relatively small: namely, the requirements.txt will just have one extra dependency which is not used in certain deployment settings, but that's a small price to pay in pursuit of the above objectives IMO.
I've just promoted you to Owner on pforgetest so you can do that.
(Once we've got this PR merged, it would be interesting to take a step back and discuss the best testing strategy here, i.e. is this pattern of using pforgetest
the best option, etc., but of course let's not let that sidetrack us from getting all this merged first!)
I don't have time for a full line-by-line review today, but can do that early next week. Just wanted to keep things rolling by starting with that feedback about testing tags.
6f4cbcb
to
920bfd2
Compare
Didn't see your comment before I did my last push. That all sounds fine by me. Let me remove the tags I just added and delete those branches. Will clean things up later this weekend |
920bfd2
to
1d6198f
Compare
|
||
Note that some runners (like the local one) may not support this! | ||
""", | ||
) | ||
|
||
@validate("container_image") | ||
def _validate_container_image(self, proposal): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
|
||
- name: Set up min.io as a k3s service | ||
run: | | ||
MYACCESSKEY=$(openssl rand -hex 16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm learning a lot reading this PR 😄
def pytest_addoption(parser): | ||
parser.addoption("--flinkversion", action="store", default="1.16") | ||
parser.addoption("--pythonversion", action="store", default="3.9") | ||
parser.addoption("--beamversion", action="store", default="2.47.0") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another pattern that I haven't seen before, I like it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome, @ranchodeluxe! Thank you so much.
So it looks like in the end the changes required to get Flink working were pretty minimal... just changing Flink version to 1.16, and making sure the right container_image was passed?
Not to minimize how hard it was to find that out!
The testing here is a work of art, very thorough, and very readable!
Addresses: #111
Bake.container_image
and unit test