[Draft] Add ErrorHandling for LdkServer APIs #19

G8XSU · 2024-10-30T21:06:49Z

This is a draft PR, please ignore some missing docs and field naming etc.
Only last 2 commits are of importance.

Adds proto definition for api errors.
- When HttpStatusCode is not ok (200), the response content contains a serialized ErrorResponse.
Add error struct for LdkServerError.
- It is mainly used as error struct that will be returned from internal layers of ldk-server implementation.
- It will be converted to proto::error::ErrorResponse at top level service layer.
- It is intentionally kept flat instead of nesting of fields in enum, so that we can re-use something similar in ldk-server-client. A flat structure without nesting of enums/string might be helpful in case we want to generate ldk-server-client bindings later.

jkczyz · 2024-10-30T22:00:53Z

protos/src/proto/error.proto

+    // Used when an internal server error occurred, client is probably at no fault and can safely retry
+    // this error with exponential backoff.
+    InternalServerError internal_server_error = 6;


Usually there is something like "deadline exceeded" if the client should retry, whereas an internal error means something is seriously wrong?

yes internal_server_error usually means there is something wrong with server implementation,
an error that shouldn't have happened at all or is unexpected.
It could be a dependency outage, or service outage and is similar to 5xx error, hence it needs to be retried.

jkczyz · 2024-10-30T22:12:24Z

protos/src/proto/error.proto

+  OPERATION_FAILED = 1;
+
+  // There was a timeout during the requested operation.
+  OPERATION_TIMED_OUT = 2;


When would we see this over InternalServerError?

There were bunch "Timeout" errors in ldk-node, but i think many of them might not be applicable here.
I was going to merge them into OPERATION_TIMED_OUT.
But it is fair, that we can merge them into OPERATION_FAILED, as both might need to be retried and from client perspective it might not be that important.

jkczyz · 2024-10-30T22:13:50Z

protos/src/proto/error.proto

+  OPERATION_TIMED_OUT = 2;
+
+  // Sending a payment has failed.
+  PAYMENT_SENDING_FAILED = 3;


Why not use OPERATION_FAILED?

Yes normally we could just use that, but overtime, i assume we would want to expose more out of "https://github.com/lightningdevkit/rust-lightning/blob/main/lightning/src/ln/outbound_payment.rs#L446-L455"

But i think for now, i could reuse operation_failed.

jkczyz · 2024-10-30T22:39:27Z

protos/src/proto/error.proto

+message LightningError {
+  LightningErrorCode lightning_error_code = 1;
+}


Seems like we should have this use a oneof containing an enum for each possible operation. Then each enum would be defined specifically for the operation. Otherwise, we are left with values for LightningErrorCode that aren't relevant for the given operation, which results in the user needing to know which values are possible for said operation.

Alternatively, we could make a dedicated ErrorResponse for each operation with the same structure, only using an operation-specific enum here. This would result in a lot of duplication, for the non-LightningError part, though.

oneof containing an enum for each possible operation. Then each enum would be defined specifically for the operation.

I tried couple of combinations for oneofs and enums.
Haven't tried this one but my general learning has been one-ofs introduce high-complexity when combined with other nested forms.
If we do this, it will be struct->struct->oneof->enum, which might become unwieldy but i can try.
Also, note that modelling these in protobuf is just part of the problem, we don't have a good way to represent these structures in rust enums. (which are binding safe.)
Another problem with oneofs is when a new field is introduced, they are treated as unset, which might be unexpected.

Otherwise, we are left with values for LightningErrorCode that aren't relevant for the given operation, which results in the user needing to know which values are possible for said operation.

I 100% agree to this concern, as i have been to trying to address it with an error model, but there isn't one right solution here. Not only this, but a user has to handle/understand all or atleast most variants of error for error-handling. This was the main motivation for going with sub_error_code or nesting within lightning_error_code, so that user can choose to just treat lightning_error as another retryable error for most operations.

Another possibility is to introduce List<String> error_tags in ErrorResponse (similar to this),
and they will contain specific tag related to error acc. to the api being used, and we can have enums:
GenericTags, PaymentSendErrorTags etc.. and for every api we can document which tags it can use.
Might look over-complicated, but i think it might be simpler than having api level structs, since most of the time error is generic and not api specific, and client can only consume the tags they are interested in. (also bindings safe.)

we could make a dedicated ErrorResponse for each operation

Yes ideally, we would have dedicated api level error structs, but neither ldk nor ldk-node have api level error structs, so if we do this, there is still a possibility that server might get an error that it didn't expect for a particular operation, and for that it will need "OTHER_ERROR" in each of those structs as a fallback.
Moreover such api level error struct are mostly not that different from each other. In reality, there are very few places where we need such granularity.

I tried couple of combinations for oneofs and enums. Haven't tried this one but my general learning has been one-ofs introduce high-complexity when combined with other nested forms. If we do this, it will be struct->struct->oneof->enum, which might become unwieldy but i can try.

Currently, it is:

message -> oneof -> message -> enum

which in rust is represented as:

struct -> enum -> struct -> enum

Though, can't we replace LightningError with LightningErrorCode in the oneof to get rid of the second struct? i.e., the enum representing the oneof would have one variant containing the LightningErrorCode enum, while the other variants would contain the marker structs.

message -> oneof -> message|enum

Or in rust:

struct -> enum -> struct|enum

And what I suggested was:

message -> oneof -> message -> oneof -> enum

Or in rust:

struct -> enum -> struct -> enum -> enum

Alternatively, we can combine both oneofs by removing LightningError entirely and adding a field to the oneof for each operation. This gives:

message -> oneof -> message|enum

Or in rust:

struct -> enum -> struct|enum

Also, note that modelling these in protobuf is just part of the problem, we don't have a good way to represent these structures in rust enums. (which are binding safe.)

Could you explain what you mean by this? Some restriction on rust enums that make them bindings-safe?

Another problem with oneofs is when a new field is introduced, they are treated as unset, which might be unexpected.

Not sure what you mean here. Are you saying a new field to the oneof? Internally they are just optional fields where only one can be set and where the rust representation uses an enum, IIUC. Do you mean if the server and client versions of the proto are not in sync?

I 100% agree to this concern, as i have been to trying to address it with an error model, but there isn't one right solution here. Not only this, but a user has to handle/understand all or atleast most variants of error for error-handling. This was the main motivation for going with sub_error_code or nesting within lightning_error_code, so that user can choose to just treat lightning_error as another retryable error for most operations.

Should we simply have a retryable and non-retryable mapping then? The actual error would still be conveyed via the message field.

Another possibility is to introduce List<String> error_tags in ErrorResponse (similar to this), and they will contain specific tag related to error acc. to the api being used, and we can have enums: GenericTags, PaymentSendErrorTags etc.. and for every api we can document which tags it can use. Might look over-complicated, but i think it might be simpler than having api level structs, since most of the time error is generic and not api specific, and client can only consume the tags they are interested in. (also bindings safe.)

FWIW, this is a bit more type safe, but not sure if I'd consider it just yet.

https://github.com/googleapis/googleapis/blob/c7ce97ebdeb85009fed49b1256586dbd3867adc6/google/rpc/status.proto#L48

Yes ideally, we would have dedicated api level error structs, but neither ldk nor ldk-node have api level error structs, so if we do this, there is still a possibility that server might get an error that it didn't expect for a particular operation, and for that it will need "OTHER_ERROR" in each of those structs as a fallback. Moreover such api level error struct are mostly not that different from each other. In reality, there are very few places where we need such granularity.

Could you remind me why we decided to not expose ldk_node::NodeError? Just want to wrap my head around what level of granularity we should expose. Seems like we have a few options:

None (i.e., LightningError with no oneof)

Retryable / Non-retryable

Some mapping from ldk_node::NodeError to a smaller set (i.e., like this PR)

Subset per operation (i.e., what I originally suggested, but can be fragile as you pointed out)

All (i.e., create a proto enum matching ldk_node::NodeError)

In all cases, ldk_node::NodeError::to_string would populate message.

Could you explain what you mean by this? Some restriction on rust enums that make them bindings-safe?

Yes, Uniffi error enums cannot have nested fields.
a struct with message, enum, error_code will work fine, but if we move more fields inside of enum, that isn't supported.
Modelling of one-of in rust will involve enums, and we can't have proper fields inside of enums.

I can discuss further details offline.

G8XSU · 2024-11-07T16:26:10Z

Closing in favor of #20,
We can add the error_details field later.

G8XSU added 6 commits October 29, 2024 23:40

Move existing protos to api and types mod.

c5df6a4

Adjust ldk-node-server acc. to proto path changes.

b76917e

Adjust client acc. to proto path changes.

d862978

Adjust cli acc. to proto path changes.

c42ef6a

Add proto definitions for ErrorResponse.

9dd4f12

Add error struct for LdkServerError.

51ee669

G8XSU requested a review from jkczyz October 30, 2024 21:07

G8XSU mentioned this pull request Oct 30, 2024

Ldk-Server Milestone 2 #10

Open

13 tasks

jkczyz reviewed Oct 30, 2024

View reviewed changes

G8XSU mentioned this pull request Nov 5, 2024

Simpler Error Model for ldk-server #20

Merged

G8XSU closed this Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Draft] Add ErrorHandling for LdkServer APIs #19

[Draft] Add ErrorHandling for LdkServer APIs #19

G8XSU commented Oct 30, 2024 •

edited

Loading

jkczyz Oct 30, 2024

G8XSU Oct 30, 2024

jkczyz Oct 30, 2024

G8XSU Oct 30, 2024

jkczyz Oct 30, 2024

G8XSU Oct 30, 2024

jkczyz Oct 30, 2024

G8XSU Oct 30, 2024

jkczyz Oct 31, 2024

G8XSU Nov 1, 2024

G8XSU commented Nov 7, 2024

[Draft] Add ErrorHandling for LdkServer APIs #19

[Draft] Add ErrorHandling for LdkServer APIs #19

Conversation

G8XSU commented Oct 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

G8XSU commented Nov 7, 2024

G8XSU commented Oct 30, 2024 •

edited

Loading