Skip to content
nlake44 edited this page Oct 9, 2014 · 8 revisions

This page explains how to locate bottle necks and eliminate them.

Locate the bottleneck

Run "htop" or "top" on each node (aka machine, server, or VM). If any one of them is maxed out in terms of CPU (load number is higher than the number of cores) then you've found your bottleneck. If the node is an "appengine" role you may require adding a new node with that role. If the node is a datastore node, then you may need to add a new node with that role.

Better utilization

AppScale tries to get the default configurations right, but sometimes they can be off because different apps have different usage footprints.

If you do htop on all your nodes and none of your nodes are maxed out, you may be seeing requests getting queued up. The next thing to do is figure out where the requests are being queued up. You can find this by enabling the status page on HAProxy. Generally, the queuing can either happen at the front end before being served by the application servers, or on the backend, where there are not enough consuming threads for datastore requests.

Application Server Scaling

On the front end, if autoscaling is being used, application servers should be automatically scaled up if the queue length on HAProxy is seven or more. It may not be able to scale up because there are no resources available on any "appengine" nodes. In this case, you'll have to add more nodes of this role.

If manual scaling is being used where you've set the number of application servers per node in your AppScalefile, then you'll have to increase this number and restart your AppScale deployment.

Optionally, you can increase the maximum number of connections per application server in your haproxy configuration (default 7) and then reload haproxy.

service haproxy reload

Backend Scaling

Each datastore node has a process called "datastore_server.py" running. These processes take the datastore request from application servers and map them over to cassandra operations. If HAProxy is showing that requests are queuing up but the machine is not being maxed out with CPU, then you'll want to increase the number of datastore servers per node. Currently this is set to the number of CPUs on the node. To increase them change the code (https://github.com/AppScale/appscale/blob/master/AppController/lib/datastore_server.rb#L81) to be twice or three times the number of cores. This will reduce the queueing.

Cassandra Load Balancing

The data model required for doing the GAE Datastore API requires being able to do scans on key ranges. This requires putting data in contiguous sections, potentially leading to hot spots. One cassandra node (given its 1x replication) may be taking the load of an entire application. You'll want to do the following to rebalance the cluster to make sure that the load is split across multiple nodes.

  1. On the database node run: /root/appscale/AppDB/cassandra/cassandra/bin/nodetool status

This will given you what percent of load each node owns. If the load is even across nodes then you'll need to add new datastore nodes which will bisect existing nodes to take off load. If you have two nodes and have 2x replication, you'll see that both nodes are at 100% load. If you see one node taking on all the load while the others are idle then you'll want to do step 2.

  1. Get a sample of keys: /root/appscale/AppDB/cassandra/cassandra/bin/nodetool rangekeysample

It will return a sample of keys/tokens and how they are distributed in your cassandra cluster. From the status you'll notice that these keys seem to mostly live on one node. Pick one from the middle of the sampled keys.

  1. Set a node to use this key: /root/appscale/AppDB/cassandra/cassandra/bin/nodetool -h <IP_of_node> move <long_hex_key_from_step_2>

Also run repair after to clean up from the move: /root/appscale/AppDB/cassandra/cassandra/bin/nodetool repair

Your cluster is still operational but your performance may suffer during these operations.

This will move data around to match the new ring token you've set up. This can take a while if there is a lot of data to move. You can do the same thing for other nodes so that the keyspace is nicely uniform. If the layout is not uniform, you may need to go back to step 2 and repeat (the sample of keys may not have been equally spaced out).

Enabled HAProxy Stats

In your configuration file /etc/haproxy/gae_app_id.cfg add:

+# Enable the web UI for haproxy statistics.
+listen stats :1936
+    mode http
+    stats enable
+    stats hide-version
+    stats realm Haproxy\ Statistics
+    stats uri /
+    stats auth admin:passwordhere

And open up that port in the appscale/firewall.conf

iptables -A INPUT -p tcp --dport 1936 -j ACCEPT

The port may take about 30 seconds to open up, as the firewall configuration is renewed each 30 seconds.

Once you get to the HAProxy stats page you'll see a table of stats. For seeing the queue length check the "backend" column.

Clone this wiki locally