Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

THREESCALE-8404: Add ACL and TLS support for Redis #350

Closed
wants to merge 63 commits into from

Conversation

jlledom
Copy link
Contributor

@jlledom jlledom commented Sep 28, 2023

This is a work in progress to allow Apisonator to connect to Redis using ACL and TLS credentials.

Jira Issue:

THREESCALE-8404

Summary of changes:

  • Update gems (89fe2ef)
    • To add support to TLS
    • Stop using redis 4.3.1 fork and use upstream redis 5.0.7
  • Adapt code to the new boolifyed Redis API (bde5ea1)
    • Some methods like :sadd returned a boolean in redis 4.3.1, but now they return what they get from the Redis server.
    • redis 5.0.7 includes boolified versions of those methods, like :sadd?, which behave like the old :sadd
    • Also, now the redis client only accepts string as parameters
  • Adapt the code to the new pipelined Redis API (0677638)
    • redis 4.6.0 introduced some changes in the pipelined method, (Check Changelog).
    • Now it doesn't run Redis commands from the original client when under a pipelined scope. Commands can only be sent from the pipelined context.
  • Update the async client interface to match the new redis API (08e5a3b)
    • We want the business logic code to be equally valid in both sync and async modes. But redis, and async-redis gems provide different interfaces. We have an async client that reproduces the redis interface and translates it to async-redis calls.
    • Our async client interface has been updated to include calls to new boolyfied methods and use the correct pipelined context.
    • EDIT: I updated the pipeline logic and thread-isolated the async-redis client (c94bd27)
  • Fix tests (be9c1e6)
    • Updated the tests to adapt to the changes
  • Add new accepted params and environment variables (b99e307)
    • Add parameters to configure SSL Context
    • Provide environment variables to set those params
    • Update documentation
  • EDIT: Add support for ACL and TLS in async mode (8d8c4af)
    • Instead of the redis gem, the async mode uses the async-redis gem. async-redis hasn't been updated, but I modified our client to add support for TLS and ACL.

Connecting to Redis using TLS and ACL is done. Running Apisonator in sync mode works fine and all tests pass in sync and async mode.

I'm blocked due to a deadlock in the async mode which couldn't fix yet. To reproduce it, just run the tests with make test. The test failing is spec/integration/worker_async_spec.rb

Pending tasks:

  • Fix the async deadlock.
  • Implement TLS in the async mode (check docs)
  • Update test/unit/storage_sync_test.rb to add tests for TLS connections.
  • Ensure all tests pass for sync and async modes.
  • Actually set the ENV variables and check it works, so far I only tried editing the .3scale_backend.conf file.
  • Add a new parameter and env. variable to set the ca_path ssl param
  • Investigate the consequences of having stopped using a redis and resque forks and started using the upstream gems.
  • Fix ERR AUTH error

About the forks mentioned above, the problem is we were using a forked version of the redis gem before: https://github.com/3scale/redis-rb

That fork included 3 PRs which are not upstream:

Since the fork is stuck in redis-rb 4.1.3 and we need >= 5. Those three PRs are lost in this PR, so we need to investigate what were they for and whether are they still required.

The same happens with resque. We were using a fork before: https://github.com/3scale/resque

The fork included one PR and a few commits not in upstream:

Configure server and client

This requires three steps:

  1. Generate TLS certificates and keys for Redis.
  2. Configure Redis to enable ACL and TLS, and use the generated keys.
  3. Configure Apisoator.
    3.1. TLS Client
    3.2. Mutual TLS Client

Generate keys and certificates:

The first thing we need is a certification authority. For that, we need to create its key and certificate.

  • key: openssl genrsa -out ca-root-key.pem 4096
  • cert: openssl req -new -x509 -days 365 -key ca-root-key.pem -out ca-root-cert.pem

Then we'll have to create a key and certificate for the server, and sign it with our CA. Ensure setting localhost as CN, if that's the domain you're going to use to connect to Redis:

  • key: openssl genrsa -out redis.key 4096
  • cert request: openssl req -new -key redis.key -out redis.csr
  • signed cert: openssl x509 -req -days 365 -in redis.csr -CA ca-root-cert.pem -CAkey ca-root-key.pem -CAcreateserial -out redis.crt

Configure Redis:

First, we need a Redis configuration file which enables ACL and TLS, defines the users and sets the certificate the server will use and the CA the server will trust. This is the minimal redis.conf for that purpose:

port 6379
tls-port 26379
tls-cert-file /etc/redis.crt
tls-key-file /etc/redis.key
tls-ca-cert-file /etc/ca-root-cert.pem
tls-auth-clients optional
user default off
user porta on >sup3rS3cre1! ~* &* +@all
user apisonator on >secret#Passw0rd ~* &* +@all

Next step is to launch a container that has access to the certificates, keys and Redis config files created above. We create a volume for each file and modify the container start command to reference the configuration file we created. This is how I configured the redis pod in my docker-compose.yml. But the container can be launched by other ways:

  redis:
    image: redis:6.2-alpine
    container_name: redis-compose-ssl
    ports:
      - "26379:26379"
    volumes:
      - /home/jlledom/redis.conf:/etc/redis.conf:Z
      - /home/jlledom/redis.crt:/etc/redis.crt:z
      - /home/jlledom/redis.key:/etc/redis.key:z
      - /home/jlledom/ca-root-cert.pem:/etc/ca-root-cert.pem:z
    command: [redis-server, /etc/redis.conf]

The :z at the end of the volume definitions is required when using Selinux, for instance in Fedora.

Configure Apisonator:

Apisonator can be configured as a regular TLS client, when only the server needs a certificate, or a Mutual TLS Client, where both client and server must provide a certificate.

As TLS Client:

What I did is create a ~/.3scale-backend.conf, based on openshift/3scale_backend.conf and set these properties:

config.queues.master_name = "rediss://localhost:26379/5"
config.queues.username = "apisonator"
config.queues.password = "secret#Passw0rd"
config.queues.ssl_params = {
  ca_file: "/home/jlledom/ca-root-cert.pem"
}
config.redis.proxy = "rediss://localhost:26379/6"
config.redis.username = "apisonator"
config.redis.password = "secret#Passw0rd"
config.redis.ssl_params = {
  ca_file: "/home/jlledom/ca-root-cert.pem"
}

Note how the ssl_params sections are not always required. We are using them here because the server is using a certificate signed by an unknown authority we've just created, so we need to explicitly tell Porta to trust that authority. If the server were using a certificate signed by any of the well-known CAs, the ssl_params sections could be omitted.

Another possible situation is when the server is using a self-signed certificate. When this happens, there's no trusted CA to add to ssl_params, so the only way to go is to skip certificate validation. Unfortunately we don't support this yet because the redis-client gem still haven't added support for verify modes (see: redis-rb/redis-client#133) so we can't set SSL_VERIFY_NONE to the client. Anyway this is only useful for development purposes, we'll have to use our own CA for development for the moment.

As Mutual TLS Client:

Generate a key and a signed certificate for Apisonator:

  • key: openssl genrsa -out apisonator.key 4096
  • cert request: openssl req -new -key apisonator.key -out apisonator.csr
  • signed cert: openssl x509 -req -days 365 -in apisonator.csr -CA ca-root-cert.pem -CAkey ca-root-key.pem -CAcreateserial -out apisonator.crt

Then update the config file to set the next properties:

config.queues.master_name = "rediss://localhost:26379/5"
config.queues.username = "apisonator"
config.queues.password = "secret#Passw0rd"
config.queues.ssl_params = {
  ca_file: "/home/jlledom/ca-root-cert.pem",
  cert: "/home/jlledom/apisonator.crt",
  key: "/home/jlledom/apisonator.key"
}
config.redis.proxy = "rediss://localhost:26379/6"
config.redis.username = "apisonator"
config.redis.password = "secret#Passw0rd"
config.redis.ssl_params = {
  ca_file: "/home/jlledom/ca-root-cert.pem",
  cert: "/home/jlledom/apisonator.crt",
  key: "/home/jlledom/apisonator.key"
}

As mentioned above, the ca_file fields are only required when the server isn't using a certificate signed by one of the well-known CAs.

It could be also configured using the new ENV variables (see docs/configuration.md)

docs/configuration.md Outdated Show resolved Hide resolved
docs/configuration.md Outdated Show resolved Hide resolved
- Applies to: listener, worker, cron.
- Format: path to file as string.

### CONFIG_QUEUES_KEY
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if all of these make more sense to have standard file locations. And users would be able to mount secrets at these locations with the file contents. Because these files will still need some way to be mounted into the containers anyway. So it would be double the work. Not sure, just a thought.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's good to have the variables, since they can be very useful for development: the standard file locations you mention would probably be out of the project folder. What we could do is adding a default in the case they are not provided.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I meant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in a81a4e9

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will we have these variables ever empty? Just wondering how it will work. As if empty, I assume empty value may confuse us. Or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed some changes to ensure empty variables doesn't reach Redis: c429e71.

Some redis methods that used to return boolean in the old version now
return whatever the redis server returns. To keep using the boolifyed
versions we need to call the same methods but with a `?` suffix.

Also, now the redis client only accepts strings as parameters
@jlledom jlledom force-pushed the THREESCALE-8404-redis-acl-tls branch from bc44398 to b99e307 Compare October 25, 2023 16:09
eguzki
eguzki previously requested changes Oct 31, 2023
Copy link
Member

@eguzki eguzki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

review still in progress, first comments

Gemfile Outdated Show resolved Hide resolved
docs/configuration.md Show resolved Hide resolved
docs/configuration.md Outdated Show resolved Hide resolved
docs/configuration.md Outdated Show resolved Hide resolved
@eguzki
Copy link
Member

eguzki commented Nov 14, 2023

For some reason the call to Storage.instance.get(stats_key) in https://github.com/jlledom/apisonator/blob/THREESCALE-8404-redis-acl-tls/spec/integration/worker_async_spec.rb#L63 never returns. It happens after a few calls to it. Usually after second or third call.

It needs to be fixed

Gemfile Outdated Show resolved Hide resolved
The block seemed to be caused by different threads sharing the redis
client.

async-redis calls `Async` to run the commands, and Async uses Fibers
internally, so sharing the redis client between threads led to sharing
fibers across threads, which is unsupported.
@@ -46,124 +48,13 @@ def initialize(opts)
port ||= DEFAULT_PORT

endpoint = Async::IO::Endpoint.tcp(host, port)
@redis_async = Async::Redis::Client.new(
@redis_async = Concurrent::ThreadLocalVar.new{ Async::Redis::Client.new(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with such an approach if using proper async is not trivial. But it appears to me that async-redis was built to handle concurrency. So mayeb gove a little more explanation why do we need a separate client per thread?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

async-redis relies on Fibers to implement concurrency, but concurrency doesn't mean parallelism. What async-redis does, AFAIK, is to switch between fibers on IO operations, e.g. when one fiber is waiting for answer from Redis. But all Fibers run in the same Thread. In Our case, we are calling Storage.instance from different Threads, @redis_async was being shared between threads and that caused an error Fiber called across threads which blocked the test.

@eguzki Does this make sense to you?

@jlledom
Copy link
Contributor Author

jlledom commented Nov 23, 2023

For some reason the call to Storage.instance.get(stats_key) in https://github.com/jlledom/apisonator/blob/THREESCALE-8404-redis-acl-tls/spec/integration/worker_async_spec.rb#L63 never returns. It happens after a few calls to it. Usually after second or third call.

It needs to be fixed

@eguzki This fixed it: c94bd27

What do you think?

@eguzki
Copy link
Member

eguzki commented Nov 23, 2023

You are saying that multiple threads (were) are using the same async redis client. Which threads are those?

@jlledom
Copy link
Contributor Author

jlledom commented Nov 23, 2023

@eguzki

You are saying that multiple threads (were) are using the same async redis client. Which threads are those?

For the failing test, there are three threads accessing the storage in parallel:

  1. The test itself: https://github.com/3scale/apisonator/blob/master/spec/integration/worker_async_spec.rb#L63
  2. The Thread that runs the worker: https://github.com/3scale/apisonator/blob/master/spec/integration/worker_async_spec.rb#L52
  3. The thread consuming Resque jobs: https://github.com/3scale/apisonator/blob/master/lib/3scale/backend/worker_async.rb#L90

@eguzki
Copy link
Member

eguzki commented Nov 23, 2023

here are three threads accessing the storage in par

What changed in your PR to be broken? It makes all the sense that an async client is not shared between threads. Is it shared in (current) master branch? I understand that in the worker, one thread is reading the queue and another one is processing the tasks. It works with one async client shared among threads. If it is not supported, how is it working?

If the multithreading is added in the tests, then we should change the tests, not the base code of apisonator.

@jlledom
Copy link
Contributor Author

jlledom commented Nov 23, 2023

What changed in your PR to be broken?

I don't know.

Is it shared in (current) master branch?

Yes

If it is not supported, how is it working?

I don't know, maybe I'm wrong. Ideas?

If the multithreading is added in the tests, then we should change the tests, not the base code of apisonator.

The tests are the same as before, if they passed before, they should pass now to ensure the behavior is the same.

@jlledom
Copy link
Contributor Author

jlledom commented Nov 28, 2023

@akostadinov I created this script to prove async_redis doesn't allow sharing the connection between threads:

# frozen_string_literal: true

require "bundler/inline"

gemfile(true) do
  source "https://rubygems.org"

  git_source(:github) { |repo| "https://github.com/#{repo}.git" }

  gem 'concurrent-ruby', '~> 1.1.6'
  gem 'hiredis', '~> 0.6.1'
  gem 'async-redis', '~> 0.7.0'
  gem 'redis', git: 'https://github.com/3scale/redis-rb', branch: 'apisonator'
end

require 'concurrent'
require 'async/io'
require 'async/redis/client'
require 'hiredis'
require 'redis'

def async_thread_isolate
  client = Concurrent::ThreadLocalVar.new do
    endpoint = Async::IO::Endpoint.tcp('localhost', 22121)
    Async::Redis::Client.new(endpoint, limit: 5)
  end

  Async { client.value.set("random_key", "a_key") }.wait

  result = []
  5.times.map { Thread.new {
    Async { result.push client.value.get("random_key") }
  }}.each(&:join)

  puts result
end

def async_thread_share
  endpoint = Async::IO::Endpoint.tcp('localhost', 22121)
  client = Async::Redis::Client.new(endpoint, limit: 5)

  Async { client.set("random_key", "a_key") }.wait

  result = []
  5.times.map { Thread.new {
    Async { result.push client.get("random_key") }
  }}.each(&:join)

  puts result
end

def sync_thread_share
  client = Redis.new({url: 'redis:localhost:22121'})

  client.set("random_key", "a_key")

  result = []
  5.times.map { Thread.new {
    result.push client.get("random_key")
  }}.each(&:join)

  puts result
end

method = ARGV.first || "async_thread_isolate"
send(method)

I run it from inside the apisonator container, like this (I saved the script as tmp/async_tests.rb under the apisonator source folder):

jlledom@fedora apisonator:master=]$ make dev

[ruby@apisonator-dev apisonator]$ bundle exec script/services start

ruby@apisonator-dev apisonator]$ ruby tmp/async_tests.rb sync_thread_share
Fetching https://github.com/3scale/redis-rb
Fetching gem metadata from https://rubygems.org/.....
Resolving dependencies...
Using nio4r 2.6.1 (was 2.5.9)
Using protocol-redis 0.6.1
Using bundler 2.3.5
Using hiredis 0.6.3
Using redis 4.1.3 from https://github.com/3scale/redis-rb (at apisonator@7210a9d)
Using fiber-annotation 0.2.0
Using fiber-local 1.0.0
Using timers 4.3.5
Using concurrent-ruby 1.1.10 (was 1.1.6)
Using console 1.23.2 (was 1.16.2)
Using async 1.31.0
Using async-pool 0.4.0 (was 0.3.12)
Using async-io 1.38.0 (was 1.34.3)
Using async-redis 0.7.0
a_key
a_key
a_key
a_key
a_key

[ruby@apisonator-dev apisonator]$ ruby tmp/async_tests.rb async_thread_isolate
Fetching https://github.com/3scale/redis-rb
Fetching gem metadata from https://rubygems.org/.....
Resolving dependencies...
Using fiber-annotation 0.2.0
Using fiber-local 1.0.0
Using timers 4.3.5
Using bundler 2.3.5
Using redis 4.1.3 from https://github.com/3scale/redis-rb (at apisonator@7210a9d)
Using nio4r 2.6.1 (was 2.5.9)
Using protocol-redis 0.6.1
Using concurrent-ruby 1.1.10 (was 1.1.6)
Using hiredis 0.6.3
Using console 1.23.2 (was 1.16.2)
Using async 1.31.0
Using async-io 1.38.0 (was 1.34.3)
Using async-pool 0.4.0 (was 0.3.12)
Using async-redis 0.7.0
a_key
a_key
a_key
a_key
a_key

[ruby@apisonator-dev apisonator]$ ruby tmp/async_tests.rb async_thread_share
Fetching https://github.com/3scale/redis-rb
Fetching gem metadata from https://rubygems.org/.....
Resolving dependencies...
Using bundler 2.3.5
Using fiber-local 1.0.0
Using nio4r 2.6.1 (was 2.5.9)
Using timers 4.3.5
Using protocol-redis 0.6.1
Using hiredis 0.6.3
Using redis 4.1.3 from https://github.com/3scale/redis-rb (at apisonator@7210a9d)
Using concurrent-ruby 1.1.10 (was 1.1.6)
Using fiber-annotation 0.2.0
Using console 1.23.2 (was 1.16.2)
Using async 1.31.0
Using async-pool 0.4.0 (was 0.3.12)
Using async-io 1.38.0 (was 1.34.3)
Using async-redis 0.7.0
  0.7s     warn: Async::Pool::Controller [oid=0xab4] [ec=0xac8] [pid=263] [2023-11-28 16:19:03 +0000]
               | Closing resource while still in use!
               | {"resource":"#<Async::Redis::Protocol::RESP2::Connection:0x00007fd8d0072ac8>","usage":1}
  0.7s     warn: Async::Task [oid=0xadc] [ec=0xaf0] [pid=263] [2023-11-28 16:19:03 +0000]
               | Task may have ended with unhandled exception.
               |   NoMethodError: undefined method `zero?' for nil:NilClass
               |   → /home/ruby/.rbenv/versions/3.0.2/lib/ruby/gems/3.0.0/gems/async-pool-0.4.0/lib/async/pool/controller.rb:236 in `reuse'
               |     /home/ruby/.rbenv/versions/3.0.2/lib/ruby/gems/3.0.0/gems/async-pool-0.4.0/lib/async/pool/controller.rb:112 in `release'
               |     /home/ruby/.rbenv/versions/3.0.2/lib/ruby/gems/3.0.0/gems/async-pool-0.4.0/lib/async/pool/controller.rb:102 in `ensure in acquire'
               |     /home/ruby/.rbenv/versions/3.0.2/lib/ruby/gems/3.0.0/gems/async-pool-0.4.0/lib/async/pool/controller.rb:102 in `acquire'
               |     /home/ruby/.rbenv/versions/3.0.2/lib/ruby/gems/3.0.0/gems/async-redis-0.7.0/lib/async/redis/client.rb:119 in `call'
               |     /home/ruby/.rbenv/versions/3.0.2/lib/ruby/gems/3.0.0/gems/protocol-redis-0.6.1/lib/protocol/redis/methods/strings.rb:62 in `get'
               |     tmp/async_tests.rb:46 in `block (3 levels) in async_thread_share'
               |     /home/ruby/.rbenv/versions/3.0.2/lib/ruby/gems/3.0.0/gems/async-1.31.0/lib/async/task.rb:261 in `block in make_fiber'
a_key
a_key
a_key
a_key

$ruby tmp/async_tests.rb async_thread_share returns a different problem each time you run it.

I haven't found any mentions to thread-safety in the async-redis repo, but async-redis is built on top of async, which according to this issue doesn't allow sharing resources between threads.

It seems logical to me: an Async::Redis::Client instance will have it's own @pool (https://github.com/socketry/async-redis/blob/v0.8.0/lib/async/redis/client.rb#L38), which we are sharing between threads like in the issue I mentioned above.

I still don't have an answer to why does it work in production, where Async::Redis::Client is also shared among threads.

@akostadinov
Copy link
Contributor

I'm inclined to think that the reason not to fail in production is either that we don't really use threaded workers there, or load is not high enough or pure luck. I agree that async framework is not thread safe but with some changes to your test bed. Please bear with my long post.

At this part there is a small issue with the test bed.

  5.times.map { Thread.new {
    Async { result.push client.get("random_key") }
  }}.each(&:join)

  puts result

We wait for each thread but don't wait for each Async task. So Threads can die and the program may exit before all async tasks have completed. And we may (or may not) see the error message you posted above.

If we add a #wait call to the async task, then things always work. I created a gist so I'm able to link diffs and code lines. Here's the modification I'm talking about:
https://gist.github.com/akostadinov/d7a0b00fabb48ca74f08fb54090993e7/revisions?diff=unified&w=#diff-240e2d0b41b99e3ebf81ae17215a432402bfe29406f7e71df271abee4b1b544fR48

But a concurrency of 5 is not very stressing. So I changed it a little to create 55 async tasks in 55 threads, then wait for all threads and then for all async tasks before printing the results. Like this:

  55.times.map { Thread.new {
    Async { result.push client.get("random_key") }
  }}.map(&:value).map(&:wait)

  puts result

See https://gist.github.com/akostadinov/d7a0b00fabb48ca74f08fb54090993e7/revisions#diff-240e2d0b41b99e3ebf81ae17215a432402bfe29406f7e71df271abee4b1b544f

And the result was quiet telling. First of all I saw in console several messages:

               | Task may have ended with unhandled exception.
               |   FiberError: fiber called across threads

And additionally the process has hanged indefinitely until I interrupted it with Ctrl+C. And btw if I run the same code with 5 threads/tasks, then it completes alright.

Needless to say, the isolated client per thread method worked without issue, see https://gist.github.com/akostadinov/d7a0b00fabb48ca74f08fb54090993e7/revisions#diff-240e2d0b41b99e3ebf81ae17215a432402bfe29406f7e71df271abee4b1b544f

Unfortunately I'm looking at the async framework for the first time but it appears to me that it is a known fact that these async tasks are not thread safe. And tasks and pools shouldn't be shared across threads. While looking at your PR I spotted an async project issue where the resolution was:

the pool created in one thread may have been memoized and used from another thread. I removed the memoization and it fixed the issue

This reads to me as this is a known design limitation. So overall the taken approach by you - create an async client per thread and avoid sneaky modifications of instance variables, are the right (if not the only) way to go forward. Thank you for doing a great detective job and for the nice test bed for thread safety.

An interesting question still is, why test in the master branch didn't fail. So I think it is worth spending some more time checking whether the updated code is fully sound. Or is it just a difference of the generated instructions? We can try increasing the concurrency of the original test in master and see if it's gonna fail at some point.

P.S. I'm running the script locally, just using Ruby 3.0.4, it is reproducible like that just fine at least for me.

@jlledom
Copy link
Contributor Author

jlledom commented Nov 29, 2023

@akostadinov Thanks for your research. In fact, calling .wait for tasks makes all tests always pass with 5 threads, but only the async_thread_shared test fail with 55 threads.

About why it doesn't fail in master, I have a couple of theories.

In worker_async_spec.rb (the test failing) we have the worker running in a separated thread, but that doesn't represent production. In production the worker runs in its own process. That process has two threads, one to read from the queue (code), another one to process the jobs (code). The thread processing the jobs connects to redis through the shared instance (code). But the thread reading from the queue access redis using the JobFetcher redis instance (code) which is initialized once during the startup (code), and it creates a new instance (code) instead of taking the shared one. Observe the comment over redis_client (code). So basically there are not shared client among threads in the worker, the fetcher has it's own instance and the processor it's taking the shared one and not sharing it with anybody else because there are no more threads.

About the listener process, I'm not sure how multithreaded is it really, I guess Puma creates a thread for every request, but as proven in the test script we wrote, it can cope fine with a certain amount of parallelism without failing, maybe we never reach that workload in production. On the other hand, there's that problem about all threads sending their commands to the pipeline whenever any thread opens any pipeline anywhere, and those commands are eventually sent to redis one after another from a single thread when the pipeline closes. That would contribute to reduce the number of situations where there are multiple threads actually accessing async_redis at the same time.

@jlledom jlledom requested a review from mayorova May 7, 2024 07:31
end

# Authenticated RESP2 if credentials are provided, RESP2 otherwise
def make_redis_protocol(opts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this code just new custom development, or copy from upstream?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's custom. The async-redis client is so simple we need to implement the most basic things like sending the credentials to the redis server when it requires authentication. We also have to handle the SSL connections ourselves and. All private methods in the Client class are there to deal with these things.

The code it's base in the examples from the repo:

https://github.com/socketry/async-redis/blob/main/examples/auth/protocol.rb
https://github.com/socketry/async-redis/tree/main/guides/getting-started

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, that's sad that it doesn't even do the basic stuff...
Thanks for the info!

private_constant :METHODS_TO_BOOLIFY

METHODS_TO_BOOLIFY.each do |method|
command = method.to_s.delete('?')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the story behind all the methods that got an extra ? ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, probably because of

# 2) Methods that need to be "boolified". These are methods for which
# redis-rb returns a boolean, but redis just returns an integer.
# For example, Redis returns 0 or 1 for the EXISTS command, but
# redis-rb transforms that into a boolean.
?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aisonator uses the API from the redis-rb gem, which we use in sync mode. In order to work in both sync and async modes, our custom async mode replicates the redis-rb API. The Methods class implement this.

The API changed in redis-rb 5, now they addded the methods :exists? :sadd? and :srem? https://github.com/redis/redis-rb/blob/master/CHANGELOG.md#500. We have to adapt our API to reflect that.

For the commands coming with an ending ? we need to remove it to get the actual command to send to Redis.

#
# These are the different cases:
# 1) Methods that can be called directly. For example SET:
# @redis_async.call('SET', some_key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# @redis_async.call('SET', some_key)
# call('SET', some_key)

Not sure if that's correct, but assuming that there is no mention of @redis_async in this file, and the calls are "delegated" to @redis_async anyway...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I copied that comment from the old file and didn't check it.

@@ -277,7 +277,7 @@ def persist_attribute(attribute, value, ignore_nils = false)
def persist_sets
storage.sadd storage_key_by_provider(:ids), id
storage.sadd encode_key("services_set"), id
storage.sadd encode_key("provider_keys_set"), provider_key
storage.sadd encode_key("provider_keys_set"), provider_key unless provider_key.nil?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can provider key be nil? Should we conditionally add or should we raise an error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replied at #384 (comment)

@@ -85,6 +85,8 @@ def try_pop_from_queue(max=nil)
def wait_pop_from_queue
queue, encoded = @redis.blpop(*@queues, timeout: @fetch_timeout)
[queue, [encoded]] if encoded
rescue RedisClient::ReadTimeoutError => _e
# Do nothing, we expect frequent timeouts from the sync client when the queue is empty
Copy link
Contributor

@akostadinov akostadinov May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, when @fetch_timeout is reached, the method will return nil, not raise. A read timeout on the other hand is a low-level TCP timeout that shouldn't happen regularly.

On the other hand, we don't need to be very much bothered by a connection timeout, if we can just reconnect. It's good to be resilient in case of network glitches.

I would suggest though to log the error, can be without a trace, just dome text like:

Que redis (ip/hostname of server if possible) connection timeout at #{FILE}:#{LINE}

In this way we can have an indication of network stability and see whether particular servers or locations are more problematic. This will help debugging and improving the cluster setup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replied at #384 (comment)

@jlledom
Copy link
Contributor Author

jlledom commented May 27, 2024

Closing this PR. I created #391 to make this easier to review:

@jlledom jlledom closed this May 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants