Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding https checks for HAproxy and Galera #494

Merged
merged 14 commits into from
Oct 2, 2023
Merged

Conversation

wejdross
Copy link
Contributor

What this PR does / why we need it:

  • This change ensures that Haproxy sends traffic only to nodes with wsred_ready=1 metric set. It queries prometheus endpoint and parses config. Buffer size for haproxy was raised from 16kB to 300kB, otherwise Haproxy can't "see" metrics.

Checklist

  • Chart Version bumped
  • I have run make docs
  • Variables are documented in the values.yaml using the format required by Helm-Docs.
  • PR contains the label that identifies the chart, e.g. chart/<chart-name>
  • PR contains the label that identifies the type of change, which is one of
    [ bug, enhancement, documentation, change, breaking, dependency ]

@wejdross wejdross added enhancement New feature or request chart/haproxy labels Sep 11, 2023
@wejdross wejdross requested a review from a team September 11, 2023 07:01
@wejdross wejdross self-assigned this Sep 11, 2023
@wejdross wejdross requested a review from TheBigLee September 11, 2023 07:02
Copy link
Contributor

@TheBigLee TheBigLee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that approach for a while now, and I don't think this is a good idea to use the metrics exporter for this:

  • The result from the exporter can get "huge" for really big instances with loads of databases and tables and the 300k in tune.bufsize might not be enough.
  • It puts a lot of strain on the haproxy if he has to parse the result every time for that mysql_global_status_wsrep_ready string. Especially for big clusters with loads of metrics
  • It puts unnecessary strain on the cluster as it queries the endpoint every second.
  • If the exporter fails (or the query is too slow or too big), the whole instance goes down.

@wejdross
Copy link
Contributor Author

so what do You suggest? should we write custom sidecar? how is that solution better?

@TheBigLee
Copy link
Contributor

so what do You suggest? should we write custom sidecar? how is that solution better?
Yes and using a script to determine if the galera node is up and running.

Also using mysql_global_status_wsrep_ready might not be enough to determine if a node is ready to accept connections. Afaik this value is true as well if the node is in the DONOR state. However, the node is blocked during that state and can't accept queries. Which again will result in downtimes as haproxy assumes the node is up.

@wejdross
Copy link
Contributor Author

I took info from docs:

Whether the server is ready to accept queries. If this status is OFF, almost all of the queries will fail with:
ERROR 1047 (08S01) Unknown Command

Ready to receive queries sounds like it should be ready to receive queries :D
But if You don't like this solution, what do You suggest? custom sidecar parsing metrics and exposing http endpoint with statuses 200/503?

the only reason why I like solution with sidecar is this buffer which is hard to estimate and 244kB is not that small amout. but saving memory in haproxy would cause that we need to run 3 sidecars on each database engine consuming resources as well.

It's Your call @TheBigLee I'm OK with both solutions

@wejdross wejdross requested review from zugao and Kidswiss September 12, 2023 13:42
@zugao
Copy link
Contributor

zugao commented Sep 13, 2023

Can this be some kind of an alternative solution ?

@wejdross
Copy link
Contributor Author

Can this be some kind of an alternative solution ?

unfortunately not, whole thing is to detect when galera is ready, not mariadb itself :/

@zugao
Copy link
Contributor

zugao commented Sep 13, 2023

Can this be some kind of an alternative solution ?

unfortunately not, whole thing is to detect when galera is ready, not mariadb itself :/

But MariaDB will not respond ok if galera cluster is not working on that node no?

@TheBigLee
Copy link
Contributor

Can this be some kind of an alternative solution ?

unfortunately not, whole thing is to detect when galera is ready, not mariadb itself :/

But MariaDB will not respond ok if galera cluster is not working on that node no?

It can actually. MariaDB can respond ok but not accept queries (eg. when in DONOR mode)
Furthermore, this is a haproxy enterprise feature.

@zugao
Copy link
Contributor

zugao commented Sep 13, 2023

Unfortunately I have limited knowledge with reverse proxies / load balancers such as HAProxy and Nginx. Nevertheless I took some time and read relevant docs and some blogs. There are 2 facts:

  1. HAProxy supports running a script to check on galera nodes.
  2. Increasing the tune.bufsize is strongly not recommended.

Now my 2 cents on these:

  1. Since this approach does not imply changing shady parameters I definitely would go with it, especially in our case where we might have huge databases. Now how we want to do it - via a simple script or full blown web server, we can discuss and decide as a team.
  2. From the docs it seems that changing the value of this parameter implies that other parameters have to be adjusted as well, otherwise nasty things may happen. My main concern is that later on we might have hidden issues because of this parameter which could take us an unbelievable amount of time to figure out the cause.

Conclusion:
I don't see this improvement as a serious reason to change a controversial parameter with 20x the default value, the risk is not justified. Either we do it properly or don't do it at all.

@Kidswiss we might also want to hear from you on this sensitive topic.

@Kidswiss
Copy link
Contributor

I agree with the consent here, given that the output of the exporter is so huge and we need to adjust not recommended settings, we should look for an alternative.

IMHO a script check would be the optimal solution here, the script could connect to the instance, query the wsrep status (or other relevant things, like donor status) and return back the readiness of a given node.

I would not go with webserver that exports that information again. Then we'll have yet another long running thing that could break. Also the indirect nature of that approach would probably introduce new edge cases.

@Kidswiss Kidswiss closed this Sep 14, 2023
@Kidswiss Kidswiss reopened this Sep 14, 2023
@wejdross
Copy link
Contributor Author

Short summary:

  • extended script logic to galera and galerak8s
  • added Dockerfile that builds haproxy image with mysql command required to our check
  • tested it locally with helm template and it looks good
  • tested logic in dev cluster by editing deployments, not by installing new ones

I'd be very happy to get some critic as I'm very inexperienced in helm charts development and I don't even know what should I look for.

@wejdross wejdross requested a review from TheBigLee September 18, 2023 07:44
Copy link
Contributor

@Kidswiss Kidswiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also no helm expert.

Just one nit, the rest looks good.

kind: ConfigMap
apiVersion: v1
data:
script.sh: "#!/bin/bash\n\n# Run the MySQL query and store the result in a Bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't helm read from a file and insert the string here?

Having the script separate in a readable form would greatly help the reviewability/maintainability of this script.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on that. I would try to do this via file if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a compromise I fixed formatting - wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A separate file would make it easier to check with shellcheck.

There's a way to read files in helm: https://helm.sh/docs/chart_template_guide/accessing_files/

Something like this should do the trick:

data:
  script.sh: |-
        {{ .Files.Get "script.sh" }}

kind: ConfigMap
apiVersion: v1
data:
script.sh: "#!/bin/bash\n\n# Run the MySQL query and store the result in a Bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on that. I would try to do this via file if possible.

appuio/haproxy/templates/configmap-galera.yaml Outdated Show resolved Hide resolved
appuio/haproxy/templates/configmap-galera-checkscript.yaml Outdated Show resolved Hide resolved
appuio/haproxy/templates/deployment.yaml Show resolved Hide resolved
appuio/haproxy/values.yaml Outdated Show resolved Hide resolved
@TheBigLee
Copy link
Contributor

The script doesn't seem to work in the current state:
Haproxy logs the following error:

Enter password: Enter password: Enter password: Enter password: Enter password: Enter password: Enter password: Enter password: Enter password: Enter password: Enter password: Enter password: Enter password: Enter password: Enter password: Enter password: Enter password: Enter password: Enter password: Enter password: Enter password: ERROR 2002 (HY000): Can't connect to MySQL server on '11.33.187.2' (115)
[WARNING]  (1) : Process 21639 exited with code 1 (Exit)

@wejdross wejdross requested a review from TheBigLee September 25, 2023 08:50
@@ -1,6 +1,6 @@
# haproxy

![Version: 2.3.1](https://img.shields.io/badge/Version-2.3.1-informational?style=flat-square) ![AppVersion: 2.7.3](https://img.shields.io/badge/AppVersion-2.7.3-informational?style=flat-square)
![Version: 2.3.2](https://img.shields.io/badge/Version-2.3.2-informational?style=flat-square) ![AppVersion: 2.7.3](https://img.shields.io/badge/AppVersion-2.7.3-informational?style=flat-square)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would set the version of the helm chart to 2.4.0.

@wejdross wejdross requested review from Kidswiss and zugao September 26, 2023 11:58
appuio/haproxy/files/galera-check.sh Outdated Show resolved Hide resolved
fi

if [ $wsrep_ready != "ON" ]; then
echo Error: wsrep_ready is not ON, actual status is: $WSREP_READY
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
echo Error: wsrep_ready is not ON, actual status is: $WSREP_READY
echo Error: wsrep_ready is not ON, actual status is: $wsrep_ready

@TheBigLee
Copy link
Contributor

Please run your script through shellcheck when the script is finished to catch some bash specific issues.

@TheBigLee TheBigLee self-requested a review September 29, 2023 12:27
@wejdross wejdross merged commit b7e0eae into master Oct 2, 2023
@wejdross wejdross deleted the haproxy/http_checks branch October 2, 2023 12:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chart/haproxy enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants