Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal authentication of the cluster failed via Pegasus 2.4 #2114

Open
ninsmiracle opened this issue Sep 12, 2024 · 3 comments
Open

Internal authentication of the cluster failed via Pegasus 2.4 #2114

ninsmiracle opened this issue Sep 12, 2024 · 3 comments

Comments

@ninsmiracle
Copy link
Contributor

General Question

when I use Pegasus 2.4 access controller , I can use peagsus-shell access target cluster , but I found all the RPC will be failed in the internal of the cluster.

  1. I use pegasus_prc/[email protected] as my server principal.

Here is my keytab Principal used in target cluster(To check my keytab file is consistent with principal):

[work@xxxxxxxx pegasus]$ klist -k [email protected]
Keytab name: FILE:[email protected]
KVNO Principal
---- --------------------------------------------------------------------------
   1 pegasus_prc/[email protected]
   1 pegasus_prc/[email protected]
  1. Here is the config in target cluster's config.ini:
[[security]]
  enable_acl = false
  super_users = u_guoningshen
  service_name = pegasus_prc
  service_fqdn = pegasus
  sasl_plugin_path = /usr/lib64/sasl2
  krb5_keytab = /home/work/app/pegasus/[email protected]
  krb5_config = /home/work/app/pegasus/krb5.conf
  krb5_principal = pegasus_prc/[email protected]
  mandatory_auth = false
  enable_auth = true
  1. Here is my pegasus-shell ini file , and I use it to access target cluster
[apps..default]
run = true
count = 1

[apps.mimic]
type = dsn.app.mimic
arguments =
pools = THREAD_POOL_DEFAULT,THREAD_POOL_META_SERVER
run = true
count = 1

[core]
tool = nativerun
pause_on_start = false

logging_start_level = LOG_LEVEL_DEBUG
logging_factory_name = dsn::tools::simple_logger
logging_flush_on_exit = false

enable_default_app_mimic = true

data_dir = ./pegasus_shell.data

[tools.simple_logger]
short_header = false
fast_flush = true
max_number_of_log_files_on_disk = 10
stderr_start_level = LOG_LEVEL_FATAL

[tools.simulator]
random_seed = 0

[network]
io_service_worker_count = 4

[threadpool..default]
worker_count = 4
partitioned = false
worker_priority = THREAD_xPRIORITY_NORMAL

[threadpool.THREAD_POOL_DEFAULT]
name = default
worker_count = 20

[threadpool.THREAD_POOL_META_SERVER]
name = meta_server

[task..default]
is_trace = false
is_profile = false
allow_inline = false
rpc_call_header_format = NET_HDR_DSN
rpc_call_channel = RPC_CHANNEL_TCP
rpc_timeout_milliseconds = 10000


[pegasus.clusters]
c4tst-function2 = 10.xxx.xx.1:32601,10.xxx.xx.2:32601

[security]
enable_auth = true
krb5_keytab = /home/work/2.4.4_pegasus/pegasus/u_guoningshen.keytab
krb5_config = /etc/krb5.conf
krb5_principal = [email protected]
sasl_plugin_path = /home/work/2.4.4_pegasus/pegasus/thirdparty/output/lib/sasl2
service_fqdn = pegasus
service_name = pegasus_prc
  1. What happen?
  • Connected to cluster via pegasus-shell
./run.sh shell -c ker.ini
  • u_guoningshen is super user for cluster , so I have full permissions.
The cluster name is: c4tst-function2
The cluster meta list is: 10.xxx.xx.1:32601,10.xxx.xx.2:32601
>>> ls
[general_info]
app_id  status     app_name  app_type  partition_count  replica_count  is_stateful  create_time          drop_time  drop_expire  envs_count  
238     AVAILABLE  test      pegasus   4                3              true         2024-09-11_07:30:20  -          -            0           
239     AVAILABLE  gns       pegasus   4                3              true         2024-09-12_02:30:50  -          -            0           

[summary]
total_app_count  : 2

>>> drop gns
reserve_seconds = 0
drop app gns succeed

>>> ls
[general_info]
app_id  status     app_name  app_type  partition_count  replica_count  is_stateful  create_time          drop_time  drop_expire  envs_count  
238     AVAILABLE  test      pegasus   4                3              true         2024-09-11_07:30:20  -          -            0           

[summary]
total_app_count  : 1

>>> 
  • But I can not create table, because all the rpc send from master meta to another nodes will negotiation failed, with err = ERR_UNKNOWN, msg = ERR_UNKNOWN
>>> create gns_test
create app gns_test succeed, waiting for app ready
gns_test not ready yet, still waiting... (0/4)
gns_test not ready yet, still waiting... (0/4)
gns_test not ready yet, still waiting... (0/4)
gns_test not ready yet, still waiting... (0/4)
...
@ninsmiracle
Copy link
Contributor Author

When I check the app status:

[replicas]
pidx  ballot  replica_count  primary  secondaries  
0     0       0/3            -        []           
1     0       0/3            -        []           
2     0       0/3            -        []           
3     0       0/3            -        [] 

We can see that even the primary is not successfully established. So I check the log of replica server for this replica:

D2024-09-11 15:32:08.284 (1726039928284328485 93977) replica.io-thrd.93977: server_negotiation.cpp:40:start(): SERVER_NEGOTIATION(CLIENT=10.xxx.xx.1:41689): start negotiation
D2024-09-11 15:32:08.284 (1726039928284356416 93977) replica.io-thrd.93977: network.cpp:696:on_server_session_accepted(): server session accepted, remote_client = 10.xxx.xx.1:41689, current_count = 1
D2024-09-11 15:32:08.284 (1726039928284364187 93977) replica.io-thrd.93977: network.cpp:701:on_server_session_accepted(): ip session inserted, remote_client = 10.xxx.xx.1:41689, current_count = 1
W2024-09-11 15:32:08.289 (1726039928289020684 93993) replica.default10.04006f170001004b: server_negotiation.cpp:137:do_challenge(): SERVER_NEGOTIATION(CLIENT=10.xxx.xx.1:41689): negotiation failed, with err = ERR_UNKNOWN, msg = ERR_UNKNOWN
D2024-09-11 15:32:08.289 (1726039928289039571 93993) replica.default10.04006f170001004b: network.cpp:738:on_server_session_disconnected(): session 10.xxx.xx1:41689 disconnected, the total client sessions count remains 0
D2024-09-11 15:32:08.289 (1726039928289046389 93993) replica.default10.04006f170001004b: network.cpp:744:on_server_session_disconnected(): client ip 10.xxx.xx.1:41689 has no more session to this server
E2024-09-11 15:32:08.289 (1726039928289069504 93975) replica.io-thrd.93975: asio_rpc_session.cpp:96:operator()(): asio read from 10.xxx.xx.1:41689 failed: Operation canceled

I think the key message is server_negotiation.cpp:137:do_challenge(): SERVER_NEGOTIATION(CLIENT=10.xxx.xx.1:41689): negotiation failed, with err = ERR_UNKNOWN, msg = ERR_UNKNOWN

In my opinion, it failed on SASL_INITIATE step of SASL process between meta server and replica server.

       client                              server
        | ---    SASL_LIST_MECHANISMS     --> |
        | <--  SASL_LIST_MECHANISMS_RESP  --- |
        | --     SASL_SELECT_MECHANISMS   --> |
        | <-- SASL_SELECT_MECHANISMS_RESP --- |
        |                                     |
        | ---       SASL_INITIATE         --> |
        |                                     |
        | <--       SASL_CHALLENGE        --- |
        | ---     SASL_CHALLENGE_RESP     --> |
        |                                     |
        |               .....                 |
        |                                      |
        | <--       SASL_CHALLENGE        --- |
        | ---     SASL_CHALLENGE_RESP     --> |
        |                                     |
        |                                     |
        | <--         SASL_SUCC           --- |
        |                                     |
        |                                     |
        | ---         RPC_CALL           ---> |
        | <--         RPC_RESP           ---- |

image

@ninsmiracle
Copy link
Contributor Author

I also check the log in KDC:

Sep 11 15:26:24 [c4-hadoop-krb02.bj](http://c4-hadoop-krb02.bj/) krb5kdc[59974](info): AS_REQ (6 etypes {18 17 16 23 25 26}) [10.XXX.XX.1](http://10.XXX.XX.1/): NEEDED_PREAUTH: pegasus_prc/[email protected] for krbtgt/[email protected], Additional pre-authentication required
Sep 11 15:26:24 [c4-hadoop-krb02.bj](http://c4-hadoop-krb02.bj/) krb5kdc[59974](info): TGS_REQ (6 etypes {18 17 16 23 25 26}) [10.XXX.XX.1](http://10.XXX.XX.1/): ISSUE: authtime 1726039584, etypes {rep=18 tkt=18 ses=18}, pegasus_prc/[email protected] for pegasus_prc/[email protected]
Sep 11 15:26:38 [c4-hadoop-krb02.bj](http://c4-hadoop-krb02.bj/) krb5kdc[59974](info): TGS_REQ (4 etypes {18 17 16 23}) [10.132.5.3](http://10.132.5.3/): ISSUE: authtime 1726039598, etypes {rep=18 tkt=18 ses=18}, pegasus_prc/[email protected] for pegasus_prc/[email protected]
Sep 11 15:26:47 [c4-hadoop-krb02.bj](http://c4-hadoop-krb02.bj/) krb5kdc[59974](info): TGS_REQ (6 etypes {18 17 16 23 25 26}) [10.XXX.XX.1](http://10.XXX.XX.1/): ISSUE: authtime 1726039567, etypes {rep=18 tkt=18 ses=18}, pegasus_prc/[email protected] for pegasus_prc/[email protected]
Sep 11 15:26:47 [c4-hadoop-krb02.bj](http://c4-hadoop-krb02.bj/) krb5kdc[59974](info): TGS_REQ (6 etypes {18 17 16 23 25 26}) [10.XXX.XX.1](http://10.XXX.XX.1/): ISSUE: authtime 1726039567, etypes {rep=18 tkt=18 ses=18}, pegasus_prc/[email protected] for pegasus_prc/[email protected]
Sep 11 15:26:48 [c4-hadoop-krb02.bj](http://c4-hadoop-krb02.bj/) krb5kdc[59974](info): AS_REQ (6 etypes {18 17 16 23 25 26}) [10.XXX.XX.1](http://10.XXX.XX.1/): ISSUE: authtime 1726039608, etypes {rep=18 tkt=18 ses=18}, pegasus_prc/[email protected] for krbtgt/[email protected]
Sep 11 15:27:48 [c4-hadoop-krb02.bj](http://c4-hadoop-krb02.bj/) krb5kdc[59974](info): TGS_REQ (6 etypes {18 17 16 23 25 26}) [10.XXX.XX.1](http://10.XXX.XX.1/): ISSUE: authtime 1726039608, etypes {rep=18 tkt=18 ses=18}, pegasus_prc/[email protected] for pegasus_prc/[email protected]

I can see TGS_REQ here, so I think client(here is meta server,use sasl client side to pass the permission check on replica servers) can get TGT from KDC.

And I check backup mate server, still have some error on it:

server_negotiation.cpp:137:do_challenge(): SERVER_NEGOTIATION(CLIENT=10.XXX.XX.1:55545): negotiation failed, with err = ERR_UNKNOWN, msg = ERR_UNKNOWN

@ninsmiracle
Copy link
Contributor Author

I also tried the following. When I changed the principal used by the pegasus-shell to be consistent with the server, I found that the pegasus-shell could not connect to the cluster at all. When you enter the ls command, the pegasus-shell will return an ERR_TIMEOUT. I checked the information from internet and it shows that when sasl calls kerberos, both ends of sasl cannot be set to the same principal.

So I was confused. If this is the case, different nodes in the cluster must use the same principal, because we cannot assign a principal to each node. In this way, internal verification will definitely fail.

At the same time, I noticed that mandatory_auth = false was developed in the code, and it seems that the internal verification between nodes can be skipped from the code logic. But I don’t know why it didn’t work.
There is too little log information in various places, and the only error ERR_UNKNOW report is not very usefull, so my work is in trouble...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant