Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add NeptuneHook and operators #32738

Closed
wants to merge 9 commits into from
Closed

Conversation

pateash
Copy link
Contributor

@pateash pateash commented Jul 21, 2023

closes: #28289


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@pateash pateash marked this pull request as draft July 21, 2023 08:04
@eladkal eladkal changed the title AWS Neptune: Provider Add NeptuneHook and operators Jul 21, 2023
@eladkal eladkal requested review from ferruzzi and vincbeck July 21, 2023 08:50
pateash and others added 9 commits July 23, 2023 20:20
* Add Executors discovery and documentation

The Executors can now be added via providers. This PR adds
mechanism of discovering the executors via Providers Manager,
exposing them via CLI and documenting in core-extensions.

* Update scripts/in_container/verify_providers.py
We would like to use the config.yml approach as our main source of truth
for airflow configuration. So far it has been split amongst multiple
files:

  * confg.yml -> descroption
  * default_airflow.cfg -> JINJA template to keep cofiguration, examples
    and description (used to generate airflow.cfg when airflow starts)
  * default_test.cfg -> storing test-only configuration used in some
    test cases - used to generate unittest.cfg
  * scripts/in_container/airflow_ci.cfg where dCI-specific configuration
    overwrote the unittest.cfg

This change consolidates it all into unified appraoch, where all
configuration information is retrieved from .yml files stored in
config_templates. No more additional template files processed by
JINJA, no more special CI versions of it, no more unittestdb.cfg file
where such configuration would be generated, no more unittestdb to
be used separately for tests.

* The default_*.cfg files were not real configuration files, becuase
  they were really JINJA templates and it got peoeple confused when
  copying the files. This change leaves the file empty with the
  comment that instructs the user how they can get the default
  configuration.
* The default_airflow.cfg is gone and instead, we have a way to
  show defaults via `airflow config list --defaults` command
* Unittest config is generated on-the-flight using defaults stored
  in confg_templates. constituing a single place where they need
  to be changed for the tests to use them
* internally, description of the configuration is stored in
  AirflowConfigurationParser and used by the parser to generate
  the default configuration when needed.
* we got rid of `{{{{` in templated config defaults by marking
  the templates with ``is_template`` and getting rid of processing
  those entries with regular formatting when generating the default
  values. This only concerns defaults from config.yml. Reading
  those configuration entries from file is unaffected.

This change aims to be 100% backwards compatible with the previous
implementation when it comes to functionality, even if internals
changed. It also does not add provider-specific changes that are
coming separately.

The only changes visible to the user are:

* generated airflow.cfg is slightly more
  readable and displays names of variables that can be used to override
  each configuration (which is very useful for copy&pasting)

* user are advised, instead of copying the default_airflow.cfg to use
  `airflow config list --defaults` to generate production config. This
  configuration has all the entries commented out, so that they can
  selectively uncomment and change the values they want. This is now
  promoted as "best practice" in the documentation.
…pache#32514)

* build(pre-commit): add list of supported deferrable operators to doc

* docs(providers): move providers list to apache-airflow-providers core-extension
* Allow configuration to be contributed by providers

The changes implemented:

* provider.yaml files for providers can optionally contribute extra
  configuration, the configuration is exposed via "get_provider_info"
  entrypoint, thus allowing Airflow to discover the configuration
  from both - sources (in Breeze and local development) and from
  installed packages

* Provider configuraitions are lazily loaded - only for commands that
  actually need them

* Documentation for configuration contributed by providers is
  generated as part of Provider documentation. It is also discoverable
  by having a "core-extension" page displaying all community providers
  that contribute their own configuration.

* Celery configuration (and in the future Kubernetes configuration) is
  linked directly from the airflow documentation - the providers are
  preinstalled, which means that celery (and Kubernetes in the future)
  configuration is considered as important to be directly mentioned
  and linked from the core. Similarly Celery and Kubernetes executor
  documentation remains in the core documentation (still configuration
  options are detailed only in the provider documentation and only
  linked from the core.

* configuration writing happens in "main" not in the configuration
  initialization and we will always execute provider configuration
  initialization. This will make sure that the generated configuration
  will contain configuration for the providers as well.

* Related documentation about custom and community providers have been
  updated and somewhat refactored - I realized that some of it was quite
  out-of-date and some of it was really "developer" not user docs.
  The docs are restructured a bit, cleaned, missing information is
  added and old/irrelevant parts removed.

Co-authored-by: Jed Cunningham <[email protected]>

* Update airflow/configuration.py

Co-authored-by: Jed Cunningham <[email protected]>

---------

Co-authored-by: Jed Cunningham <[email protected]>
Comment on lines +465 to +466
Publishing the documentation
--------------------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is at least one of the reasons the build-docs is failing; the line needs to be the same length as the heading.

Suggested change
Publishing the documentation
--------------------------
Publishing the documentation
----------------------------

@@ -462,6 +462,51 @@ Those are all available flags of ``build-docs`` command:
:width: 100%
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in this file don't feel related to me. Am I missing something, or should this be a separate PR?


breeze release-management publish-docs

The publishing documentation consists steps:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The publishing documentation consists steps:
Publishing documentation consists of the following steps:

"""
Interact with AWS Neptune using proper client from the boto3 library.

Hook attribute `conn` has all methods that listed in documentation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Hook attribute `conn` has all methods that listed in documentation
Hook attribute `conn` has all methods listed in the documentation

Comment on lines +58 to +59
:rtype: str
:raises AirflowNotFoundException: If the DB cluster does not exist.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:rtype: str
:raises AirflowNotFoundException: If the DB cluster does not exist.

We don't list rtypes anymore and I haven't seen anywhere else that lists the :raises:, though I'm not necesarily against the idea on that one.

stop_db_response = self._stop_db()
if self.wait_for_completion:
self._wait_until_db_stopped()
return json.dumps(stop_db_response, default=str)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and above: Please verify this manually and let me know if this is working as you expect. I've seen json.dumps fail in the past when the json includes datetimes, and these responses do.

self.hook.wait_for_db_cluster_state(self.db_identifier, target_state="stopped")


__all__ = ["NeptuneStartDbOperator", "NeptuneStopDbOperator"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure you can drop this?



class NeptuneDbType(Enum):
"""Only available types for the AWS Neptune DB"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

D205 nitpick

Suggested change
"""Only available types for the AWS Neptune DB"""
"""Only available types for the AWS Neptune DB."""

Comment on lines +84 to +86
deferrable=True,
waiter_delay=30,
wait_for_completion=399,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't appear related to the rest of the PR.

Also: System tests don't currently support deferrable operators, and wait_for_completion is a bool.

catchup=False,
) as dag:
test_context = sys_test_context_task()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless there is a good reason otherwise, system tests should be self-contained so the Db should be created and destroyed within the test. I know there aren't operators for that yet, but you can use the boto API in a @task to do it, as seen in other tests. Something like:

@task
def create_db():
    client = NeptuneHook().conn
    
    client.create_db_cluster(foo)
    client.create_db_instance(bar)

Then a teardown task at the end to undo it.

@ferruzzi
Copy link
Contributor

Sorry, just noticed this was still a draft.

@github-actions
Copy link

github-actions bot commented Sep 8, 2023

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Sep 8, 2023
@github-actions github-actions bot closed this Sep 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers area:system-tests provider:amazon-aws AWS/Amazon - related issues stale Stale PRs per the .github/workflows/stale.yml policy file
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add AWS Neptune hook and operators
6 participants