SOLR-14613: Autoscaling replacement using placement plugins #1845

murblanc · 2020-09-09T00:47:16Z

Previous PR1684 was too large and too slow.

This new PR takes into account (most) comments made on the old PR.

This code is untested! It wasn't actually run in its current form. It is still work in progress but I want to make sure it is now acceptable and possibly start parallelizing work on it... (post merge?)

noblepaul · 2020-09-09T05:46:00Z

Looks good to me.

My only displeasure is the presence of SolrCollection ReplicaType etc. But, I can live with that

thelabdude · 2020-09-10T18:43:08Z

Building on @noblepaul 's concern, I do think we need to reconcile the newly added interfaces in the org.apache.solr.cluster.api package (under solrj) with the very similar interfaces in the org.apache.solr.cluster.placement package added in this PR. I do realize the org.apache.solr.cluster.api package was introduced while the work on this PR was already underway. My intent here is to move forward with a plan we’re all comfortable with and not rehash the past.

To recap my understanding from the previous PR comments, the intent of having the Node, SolrCollection, Shard, Replica interfaces in org.apache.solr.cluster.placement is to avoid exposing SolrCloud internals to the plugin writer.

While I agree that is a laudable goal in general esp. for plugins, this particular framework actually needs to have access to internal information about the cluster. In other words, placing replicas is a very internal (and core) concern. The fact we're exposing this as a pluggable implementation is really for operational convenience.

Moreover, I do believe implementing a placement strategy requires most of the metadata present in collections, shards, replicas, and nodes, so I don’t know if the cost of having two representations of the same domain objects in two different places is worth the benefit it provides? I think the community needs to decide this is how we want to move forward.

So referring back to Ilan's stated goals:

placement plugin writing is easy,
implementation for this placement API is efficient, and
plugins do not break when Solr is refactored

I’d actually argue that goal #1 is subjective and hard to measure. For instance, do we consider the SamplePluginAffinityReplicaPlacement impl easy to write? It seems to require a fair amount of knowledge of how Solr’s various objects interact.

I believe @murblanc has done a great job with goal #2 in this PR.

For goal #3, my sense is that any refactoring probably cannot break the interfaces defined in org.apache.solr.cluster.api without fundamentally changing the architecture of SolrCloud, which would likely break assumptions made in the org.apache.solr.cluster.placement as well.

Lastly, from our slack conversation, I was only suggesting that instead of the plugin impl introducing the AzWithNodes that we simply formalize this as a DataCenter in the API, which basically does what AzWithNodes is doing in the code, not a big deal though ...

murblanc · 2020-09-10T21:16:46Z

Thanks @thelabdude for your long and useful comment. Let me try to give my take on this.

By saying writing plugins should be easy in point 1, I meant the boilerplate code should not get in the way and force more code lines than really necessary. It's the ability to write for example things such as for (Replica replica : shard.replicas()), Collections.shuffle(nodes, new Random()) and replica.getShard().getState(). Such compact statements do not require any explanation and are natural for any Java programmer.

My thinking is that once the interfaces in org.apache.solr.cluster.placement are understood, given an understanding of the corresponding concepts in Solr, writing a placement plugin is relatively accessible. If you look at SamplePluginAffinityReplicaPlacement, most of the complexity is not Solr related but for implementing the business logic of the placement decisions (sorting, filtering etc).

For point 2 I didn't want the API to get in the way of efficiency. The current implementation of the API is definitely not optimized (no multithreading etc) but this can be changed without impact on the API or on existing plugins. I believe we reached a good place. I too prefer how the attribute fetching looks now (@noblepaul's contribution) than what I initially proposed.

Point 3 is very important. Any internal Solr interface is relatively easy to change, we have the code using it and adapt it as the interface is modified. Once we start handing these interfaces to external code (external to the lucene-solr github repo really), then changing (or not changing) them is a lot more complex and painful.

My assumption here is that placement code might be implemented by outside users to suit their specific needs, and that code might not be contributed back to the project (as opposed to the plugin I wrote and that will be the default one and a possible starting point for custom ones). Therefore, we want to be able to maintain these interfaces unchanged even if internal implementation changes. Of course if internal concepts change then the interfaces will likely have to change. For example if the notion of shard leader goes away (imagine...) then of course that part of the API (be it defined on the Replica or on the Shard) doesn't really make sense anymore.

Take as example the ongoing discussions about configuration. The plugin writer should not have to change the code based on how and where we decide placement plugin configuration should live.

Last, the cluster abstractions for the placement plugins do not necessarily represent the existing cluster! In the initial (current) proposal they do (see SimpleClusterAbstractionsImpl), but soon we'll want to provide a forward view of how the cluster will (likely) look after known past assignment decisions are applied (these things are quite slow to happen due to the structure of Overseer). We had a similar separate mechanism in Autoscaling in 8x (the notion of Session), here we can keep the plugin writer completely agnostic of that fact, and placement decisions will simply become better (esp. under high load) once we change the internal implementation. This is to say the interfaces should really focus on the concepts at play and not the current internal implementation of these concepts, as the implementations of these interfaces will drift away from their internal counterparts quite soon.
If we want to recreate a simulation environment, focusing on concepts rather than implementation simplifies things a lot as well. I guess everybody agrees on the last points made though.

All this being said, it would be better to unify cluster abstractions (and possibly other abstractions) that are to be used by external code and have a single set of abstractions (interfaces). External uses of such interfaces include placement code (this PR), event processing (see SOLR-14749) and possibly other external code that needs to interact with the cluster. The interfaces defined here were used to write the plugin, and were changed in the process to simplify the plugin code. I believe if we make them evolve to adapt to event processing we'll have a pretty good coverage of potential uses.

murblanc · 2020-09-14T10:34:05Z

I plan to commit this code soon, so please comment quickly if needed...

Note that this code is disabled by default until a user updates /clusterprops.json to use a placement plugin, so risk is low and limited to a few touch points: picking strategy in Assign.createAssignStrategy(), changes to CollectionHandlerApi and related classes/files for /clusterprops.json manipulation.

sigram · 2020-09-10T17:59:32Z

solr/core/src/java/org/apache/solr/cluster/placement/Cluster.java

+   *
+   * <p><b>WARNING:</b> this call will be extremely inefficient on large clusters. Usage is discouraged.
+   */
+  Set<String> getAllCollectionNames();


IIRC at some point we've considered using an Iterator here instead.

sigram · 2020-09-10T18:01:10Z

solr/core/src/java/org/apache/solr/cluster/placement/Node.java

+/**
+ * Representation of a SolrCloud node or server in the SolrCloud cluster.
+ */
+public interface Node {


So ... given that there's already a SolrNode interface in master, which already provides isolation from implementation details, shouldn't we use that here? The same applies to SolrCollection and ShardReplica.

sigram · 2020-09-14T12:00:59Z

solr/core/src/java/org/apache/solr/cluster/placement/PlacementRequest.java

+     * Returns the number of replica to create that is returned by the corresponding method {@link #getCountNrtReplicas()},
+     * {@link #getCountTlogReplicas()} or  {@link #getCountPullReplicas()}. Might delete the other three.
+     */
+    int getCountReplicasToCreate(Replica.ReplicaType replicaType);


I slightly prefer this method, as it allows us to modify available replica types without changing the interface.

sigram · 2020-09-14T12:01:38Z

solr/core/src/java/org/apache/solr/cluster/placement/Replica.java

+ * Objects of this type are returned by the Solr framework to the plugin, they are not directly built by the plugin. When the
+ * plugin wants to add a replica it goes through appropriate method in {@link PlacementPlanFactory}).
+ */
+public interface Replica {


This should be merged with the existing ShardReplica to avoid creating separate abstractions for each subsystem.

sigram · 2020-09-14T12:02:43Z

solr/core/src/java/org/apache/solr/cluster/placement/ReplicaPlacement.java

+  /**
+   * @return the name of the {@link Shard} for which the replica should be created
+   */
+  String getShardName();


Should we also have the collection name here for completeness?

sigram · 2020-09-14T12:03:50Z

solr/core/src/java/org/apache/solr/cluster/placement/Shard.java

+/**
+ * Shard in a {@link SolrCollection}, i.e. a subset of the data indexed in that collection.
+ */
+public interface Shard {


Similar to the other top-level abstractions, this interface should be merged with the existing Shard interface, after resolving the main differences (the use of iterators vs. SimpleMap, what getters we absolutely need in this interface, etc).

sigram · 2020-09-14T12:05:40Z

solr/core/src/java/org/apache/solr/cluster/placement/SolrCollection.java

+/**
+ * Represents a Collection in SolrCloud (unrelated to {@link java.util.Collection} that uses the nicer name).
+ */
+public interface SolrCollection {


See my other comments about merging this with the existing SolrCollection.

sigram · 2020-09-14T12:08:01Z

solr/solrj/src/resources/apispec/cluster.Commands.json

@@ -141,6 +141,21 @@
          }
        }
      }
-    }
+    },
+    "set-placement-plugin": {


@noblepaul do we still need these awful json apispecs if we use the V2 API annotations?

No, we do not need any more apispecs. I'm planning to eliminate the existing ones. @murblanc please remove this change. Annotations take care of this automatically

Can you please point me @noblepaul or @sigram to existing code in which some of the commands on a path (here /api/cluster) use annotations and some use apispec?
I haven't found such a mix, and given existing /api/cluster commands (add-role, remove-role, set-property, set-obj-property) are defined in the apispec json file, that's where I've added the two new ones.

Or, put differently (if there's no simple way to use annotations for the two new commands): when the existing 4 commands are migrated, migrating with them the two new ones is likely not going to make the task any harder.
Keeping the new definitions in apispecs would therefore make sense for now for the sake of simplicity.

I'll give patch

noblepaul · 2020-09-15T05:41:31Z

@murblanc

Please switch to annotations before you commit this

murblanc · 2020-09-15T10:44:35Z

I need more guidance. I've implemented the new commands the way existing ones under the same path are implemented. I don't know how to "switch to annotations". Can you please point me to developer documentation that would help me here?

noblepaul · 2020-09-16T07:04:55Z

I need more guidance.

Add your APIs here

…to push configuration to /clusterprops.json

murblanc · 2020-09-16T10:22:02Z

I need more guidance.

Add your APIs here

Thanks!

…ed /api/cluster API

sigram reviewed Sep 14, 2020

View reviewed changes

murblanc added 5 commits September 16, 2020 12:10

SOLR-14613: Autoscaling replacement using placement plugins

6a5580e

Fix precommit javadoc tool crash and fix precommit issues

6e5931a

Allow code to provide default values when fetching configs

67c1b80

Added admin commands set-placement-plugin and unset-placement-plugin …

1261d73

…to push configuration to /clusterprops.json

review comments + general clean up

6f64c42

Add set-placement-plugin and unset-placement-plugin to annotation bas…

607f164

…ed /api/cluster API

murblanc force-pushed the SOLR-14613 branch from 51f1e46 to 607f164 Compare September 16, 2020 17:50

murblanc merged commit c7d234c into apache:master Sep 16, 2020

murblanc deleted the SOLR-14613 branch September 16, 2020 22:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SOLR-14613: Autoscaling replacement using placement plugins #1845

SOLR-14613: Autoscaling replacement using placement plugins #1845

murblanc commented Sep 9, 2020

noblepaul commented Sep 9, 2020

thelabdude commented Sep 10, 2020 •

edited

Loading

murblanc commented Sep 10, 2020

murblanc commented Sep 14, 2020

sigram Sep 10, 2020

sigram Sep 10, 2020

sigram Sep 14, 2020

sigram Sep 14, 2020

sigram Sep 14, 2020

sigram Sep 14, 2020

sigram Sep 14, 2020

sigram Sep 14, 2020

noblepaul Sep 15, 2020 •

edited

Loading

murblanc Sep 15, 2020

murblanc Sep 15, 2020

noblepaul Sep 15, 2020

noblepaul commented Sep 15, 2020

murblanc commented Sep 15, 2020

noblepaul commented Sep 16, 2020 •

edited

Loading

murblanc commented Sep 16, 2020

SOLR-14613: Autoscaling replacement using placement plugins #1845

SOLR-14613: Autoscaling replacement using placement plugins #1845

Conversation

murblanc commented Sep 9, 2020

noblepaul commented Sep 9, 2020

thelabdude commented Sep 10, 2020 • edited Loading

murblanc commented Sep 10, 2020

murblanc commented Sep 14, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

noblepaul Sep 15, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

noblepaul commented Sep 15, 2020

murblanc commented Sep 15, 2020

noblepaul commented Sep 16, 2020 • edited Loading

murblanc commented Sep 16, 2020

thelabdude commented Sep 10, 2020 •

edited

Loading

noblepaul Sep 15, 2020 •

edited

Loading

noblepaul commented Sep 16, 2020 •

edited

Loading