Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge from author #5

Merged
merged 483 commits into from
Oct 9, 2019
Merged

merge from author #5

merged 483 commits into from
Oct 9, 2019

Conversation

dengweisysu
Copy link
Owner

No description provided.

jkakavas and others added 30 commits September 25, 2019 14:30
Re-enable BWC tests now that #46534 has been backported to 7.x
Today we log and swallow exceptions during cluster state application, but such
an exception should not occur. This commit adds assertions of this fact, and
updates the Javadocs to explain it.

Relates #47038
Simplify `SnapshotResiliencyTests` to more closely
match the structure of `AbstractCoordinatorTestCase` and allow for
future drying up between the two classes:

* Make the test cluster nodes a nested-class in the test cluster itself
* Remove the needless custom network disruption implementation and
  simply track disconnected node ids like `AbstractCoordinatorTestCase`
  does
This change removes unused parameters and simplifies the logic.
We already prevent flushing in Engine if it's recovering. Hence, we can
remove the protection in IndexShard.
#46180 added support for the `[source,console]`
language for snippets which should be tested.

This removes support for the `// CONSOLE` magic comment,
which serve a similar purpose.

Snippets that include the `// CONSOLE` magic comment will return
an exception.
This PR removes a use-case of the ClusterFormationTasks and converts a
project that flew under the radar so far.
There's probably more clean-up possible here, but for now the goal is
to be able to remove that code after `RunTask` is also updated.
* [DOCS] Reformats ranking evaluation API.
Co-Authored-By: James Rodewig <[email protected]>
This PR changes the ingest executing to be non blocking
by adding an additional method to the Processor interface
that accepts a BiConsumer as handler and changing
IngestService#executeBulkRequest(...) to ingest document
in a non blocking fashion iff a processor executes
in a non blocking fashion.

This is the second PR that merges changes made to server module from
the enrich branch (see #32789) into the master branch.

The plan is to merge changes made to the server module separately from
the pr that will merge enrich into master, so that these changes can
be reviewed in isolation.

This change originates from the enrich branch and was introduced there
in #43361.
* ILM: parse origination date from index name

Introduce the `index.lifecycle.parse_origination_date` setting that
indicates if the origination date should be parsed from the index name.
If set to true an index which doesn't match the expected format (namely
`indexName-{dateFormat}-optional_digits` will fail before being created.
The origination date will be parsed when initialising a lifecycle for an
index and it will be set as the `index.lifecycle.origination_date` for
that index.

A user set value for `index.lifecycle.origination_date` will always
override a possible parsable date from the index name.
Today if metadata persistence is excessively slow on a master-ineligible node
then the `ClusterApplierService` emits a warning indicating that the
`GatewayMetaState` applier was slow, but gives no further details. If it is
excessively slow on a master-eligible node then we do not see any warning at
all, although we might see other consequences such as a lagging node or a
master failure.

With this commit we emit a warning if metadata persistence takes longer than a
configurable threshold, which defaults to `10s`. We also emit statistics that
record how much index metadata was persisted and how much was skipped since
this can help distinguish cases where IO was slow from cases where there are
simply too many indices involved.
This debug was never logged, so although #45652
is not yet fixed there is no point keeping it.

The strange IndexNotFoundException comes entirely
from the authz layer.

Relates #46739
Relates #45652
Currently the logic to check if a connection to a remote discovery node
exists and otherwise create a proxy connection is mixed with the
collect nodes, cluster connection lifecycle, and other
RemoteClusterConnection logic. This commit introduces a specialized
RemoteConnectionManager class which handles the open connections.
Additionally, it reworks the "round-robin" proxy logic to create the list
of potential connections at connection open/close time, opposed to each
time a connection is requested.
This commit adds support for POST requests to the SLM `_execute` API,
because POST is a more appropriate HTTP verb for this action as it is
not idempotent. The docs are also changed to favor POST over PUT,
although PUT is not removed or officially deprecated.
Enables support for Cartesian geometries shape type. We still need to
decide how to handle the distance function since it is currently using
the haversine distance formula and returns results in meters, which
doesn't make any sense for Cartesian geometries.

Closes #46412
Relates to #43644
* [ML][Inference] adding tree model

* renaming features for updated schema
* Wait for snapshot completion in SLM snapshot invocation

This changes the snapshots internally invoked by SLM to wait for
completion. This allows us to capture more snapshotting failure
scenarios.

For example, previously a snapshot would be created and then registered
as a "success", however, the snapshot may have been aborted, or it may
have had a subset of its shards fail. These cases are now handled by
inspecting the response to the `CreateSnapshotRequest` and ensuring that
there are no failures. If any failures are present, the history store
now stores the action as a failure instead of a success.

Relates to #38461 and #43663
We can have a large number of shard copies in this test. For example,
the two recent failures have 24 and 27 copies respectively and all
replicas have to copy segment files as their stores are corrupted. Our
CI needs more than 30 seconds to start all these copies.

Note that in two recent failures, the cluster was green just after the
cluster health timed out.

Closes #41899
This commit removes a bunch of unused private fields and 
unused private methods from the code base.
…cified (#46897)

This commit adds the documentation to point the user that when one
creates API keys with no role descriptor specified then that API
key will have a point in time snapshot of user permissions.

Closes#46876
…es (#47019)

When we added support for wildcard application names, we started to build
the prefix query along with the term query but we used 'filter' clause
instead of 'should', so this would not fetch the correct application
privilege descriptor thereby failing the _has_privilege checks.
This commit changes the clause to use should and with minimum_should_match
as 1.
This commit adds a Java source formatter and checker into the build process.
This is not yet enabled for any sub-projects - to format and check a
sub-project, add its Gradle path into `build.gradle` and run:

    ./gradlew spotlessApply

to format, and:

    ./gradlew spotlessJavaCheck
    # or:
    ./gradlew precommit

to verify formatting.
jrodewig and others added 29 commits October 7, 2019 09:37
)

When deactivating a watch, there is a chance that it is fully deactivated
and reporting as not running but the history is not fully written yet.
There is not a tight coupling between the associated watcher history
index and the deactivation. This test assumes that once a watch is
deactivated that all history is fully written in a very short time period.
If the Watch is deactivated, but the history is slow to write it can result
in a failing test.

This change removes an assertion that assumes that the deactivation of a watch
ensured the all of the watch history was written. There is still a minor race
condition with respect to the remaining history assertions. However, if the
history is slow to be written, it will allow the test to still passing.

fixes #47503
move the main endpoint to /_transform/ from /_data_frame/transforms/ with providing backwards compatibility and deprecation warnings
…47611)

This has ELambda and ENewArrayFunctionRef add their generated synthetic
methods to the SClass node during the semantic pass and removes this 
data from the write pass. This is the first step to remove "Globals" (mutable 
state) from the write pass.
…7447)

* [ML][Inference] adjusting definition object schema and validation

* finalizing schema and fixing inference npe

* addressing PR comments
If a thread pool rejection exception happens, an alternative code
path is chosen to write history and delete the trigger. If an exception
happens during deletion of the trigger an exception may be thrown and not
caught.

This commit catches the exception and provides a meaning error message.

fixes #47008
This test is believed to be fixed by #43939

Closes #45585
These tests are believed to be fixed by #43939

closes #45582 and #43975
This test was less stable following a backport as the shards in these
indices did not always show up as allocated immediately after closing
them. This ensures those shards have stabilized before trying to roll
over.
This test is believed to be fixed by #43939

closes #43889
This test is believed to be fixed by #43939

closes #43988
* Explicit name for doc snippets

This change adds option to specify explicit name for doc snippet.
This name will be used instead of line number when creating
yml file in buildRestTests task.
Stable names should improve tracking changes through history and allow
Gradle to skip tests on non-code docs changes.

* Avoid duplication in names

* Changes id declaration, more examples

* Fix names in examples

* Unit test added

* Throw exception on duplicate names

* Moved UT to Java
The example use of a scoring script was incorrectly using a filter
script query, which has no scoring, and thus no _score variable
avialable. This commit converts the example doc to using the newer
script_score query.
This enhances the existing SLM test using users/roles/etc to also test
that SLM retention works when security is enabled.

Relates to #43663
Previously when retrieving an SLM policy it would always return a 200
with `{}` in the body, even if the policy did not exist. This changes
that behavior to throw an error (similar to our other APIs) if a
policy doesn't exist.

This also adds a basic CRUD yml test for the behavior.

Resolves #47664
Setting `cluster.routing.allocation.disk.include_relocations` to `false` is a
bad idea since it will lead to the kinds of overshoot that were otherwise fixed
in #46079. This commit deprecates this setting so it can be removed in the next
major release.
Importing dangling indices with aliases risks breaking functionalities
using those aliases. For instance, writing to an alias may break if
there is no is_write_index indication on the existing alias and the
dangling index import adds a second index to the alias. Or an
application could have an assumption about the alias only ever pointing
to one index and suddenly seeing the alias also linked to an old index
could break it. With this change we strip aliases of the index meta data
found before importing a dangling index.
Setting `cluster.routing.allocation.disk.include_relocations` to `false` is a
bad idea since it will lead to the kinds of overshoot that were otherwise fixed
in #46079. This setting was deprecated in #47443. This commit removes it.
When exceptions could be returned from another node, the exception
might be wrapped in a `RemoteTransportException`. In places where
we handled specific exceptions using `instanceof` we ought to unwrap
the cause first.

This commit attempts to fix this issue after searching code in the ML
plugin.
)

* Convert RunTask to use testclusers, remove ClusterFormationTasks

This PR adds a new RunTask and a way for it to start a
testclusters cluster out of band and block on it to replace
the old RunTask that used ClusterFormationTasks.

With this we can now remove ClusterFormationTasks.
* Add Snapshot Lifecycle Retention documentation

This commits adds API and general purpose documentation for SLM
retention.

Relates to #43663

* Fix docs tests

* Update default now that #47604 has been merged

* Update docs/reference/ilm/apis/slm-api.asciidoc

Co-Authored-By: Gordon Brown <[email protected]>

* Update docs/reference/ilm/apis/slm-api.asciidoc

Co-Authored-By: Gordon Brown <[email protected]>

* Update docs with feedback
Similar to #47507. We are throwing `SnapshotException` when
you (and SLM tests) would expect a `SnapshotMissingException`
for concurrent snapshot status and snapshot delete operations
with a very low probability.
Fixed the exception type and added a test for this scenario.
This commit introduces a simple remote connection strategy which will
open remote connections to a configurable list of user supplied
addresses. These addresses can be remote Elasticsearch nodes or
intermediate proxies. We will perform normal clustername and version
validation, but otherwise rely on the remote cluster to route requests
to the appropriate remote node.
Failed snapshots will eventually build up unless they are deleted. While
failures may not take up much space, they add noise to the list of
snapshots and it's desirable to remove them when they are no longer
useful.

With this change, failed snapshots are deleted using the following
strategy: `FAILED` snapshots will be kept until the configured
`expire_after` period has passed, if present, and then be deleted. If
there is no configured `expire_after` in the retention policy, then they
will be deleted if there is at least one more recent successful snapshot
from this policy (as they may otherwise be useful for troubleshooting
purposes). Failed snapshots are not counted towards either `min_count`
or `max_count`.
If a cluster sending monitoring data is unhealthy and triggers an
alert, then stops sending data the following exception [1] can occur.

This exception stops the current Watch and the behavior is actually
correct in part due to the exception. Simply fixing the exception
introduces some incorrect behavior. Now that the Watch does not
error in the this case, it will result in an incorrectly "resolved"
alert.  The fix here is two parts a) fix the exception b) fix the
following incorrect behavior.

a) fixing the exception is as easy as checking the size of the
array before accessing it.

b) fixing the following incorrect behavior is a bit more intrusive

- Note - the UI depends on the success/met state for each condition
to determine an "OK" or "FIRING"

In this scenario, where an unhealthy cluster triggers an alert and
then goes silent, it should keep "FIRING" until it hears back that
the cluster is green. To keep the Watch "FIRING" either the index
action or the email action needs to fire. Since the Watch is neither
a "new" alert or a "resolved" alert, we do not want to keep sending
an email (that would be non-passive too). Without completely changing
the logic of how an alert is resolved allowing the index action to
take place would result in the alert being resolved. Since we can
not keep "FIRING" either the email or index action (since we don't
want to resolve the alert nor re-write the logic for alert resolution),
we will introduce a 3rd action. A logging action that WILL fire when
the cluster is unhealthy. Specifically will fire when there is an
unresolved alert and it can not find the cluster state.
This logging action is logged at debug, so it should be noticed much.
This logging action serves as an 'anchor' for the UI to keep the state
in an a "FIRING" status until the alert is resolved.

This presents a possible scenario where a cluster starts firing,
then goes completely silent forever, the Watch will be "FIRING"
forever. This is an edge case that already exists in some scenarios
and requires manual intervention to remove that Watch.

This changes changes to use a template-like method to populate the 
version_created for the default monitoring watches. The version is 
set to 7.5 since that is where this is first introduced.

Fixes #43184
* Separate SLM stop/start/status API from ILM

This separates a start/stop/status API for SLM from being tied to ILM's
operation mode. These APIs look like:

```
POST /_slm/stop
POST /_slm/start
GET /_slm/status
```

This allows administrators to have fine-grained control over preventing
periodic snapshots and deletions while performing cluster maintenance.

Relates to #43663

* Allow going from RUNNING to STOPPED

* Align with the OperationMode rules

* Fix slmStopping method

* Make OperationModeUpdateTask constructor private

* Wipe snapshots better in test
This commit adds a thread filter for gradle unit tests which omits
threads gradle may create that we have no control over shutting down.
The current example of this is for project.exec which gradle pools.

closes #47417
@dengweisysu dengweisysu merged commit 50a4f37 into dengweisysu:master Oct 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.