Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Record the exception raised while serializing gRPC metadata #4625

Closed
trustin opened this issue Jan 11, 2023 · 6 comments · Fixed by #4686
Closed

Record the exception raised while serializing gRPC metadata #4625

trustin opened this issue Jan 11, 2023 · 6 comments · Fixed by #4686
Labels
defect sprint Issues for OSS Sprint participants
Milestone

Comments

@trustin
Copy link
Member

trustin commented Jan 11, 2023

Armeria currently doesn't catch and record the exception raised by io.grpc.Metadata.serialize(). As a result, when gRPC fails to serialize some metadata, the request will time out or fail silently. The exception should instead be recorded into RequestLog.request/responseCause() so a user can analyze the root cause more easily.

Here's an example stack trace of such exceptions:

java.lang.ArrayIndexOutOfBoundsException: Index 4 out of bounds for length 4
 at io.grpc.Metadata.serialize(Metadata.java:473)
 at io.grpc.InternalMetadata.serialize(InternalMetadata.java:78)
 at com.linecorp.armeria.internal.common.grpc.MetadataUtil.fillHeaders(MetadataUtil.java:68)
 at com.linecorp.armeria.server.grpc.ArmeriaServerCall.statusToTrailers(ArmeriaServerCall.java:663)
 at com.linecorp.armeria.server.grpc.ArmeriaServerCall.doClose(ArmeriaServerCall.java:422)
 at com.linecorp.armeria.server.grpc.ArmeriaServerCall.lambda$close0$6(ArmeriaServerCall.java:362)
 at com.linecorp.armeria.common.RequestContext.lambda$makeContextAware$3(RequestContext.java:547)
 at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
 at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
 at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
 at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:391)
 at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:995)
 at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
 at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
 at java.base/java.lang.Thread.run(Thread.java:829)
@trustin trustin added the defect label Jan 11, 2023
@trustin trustin added this to the 1.22.0 milestone Jan 11, 2023
@trustin
Copy link
Member Author

trustin commented Jan 11, 2023

Thanks for reporting the issue, @sfc-gh-nalekseev!

@ikhoon ikhoon modified the milestones: 1.22.0, 1.23.0 Feb 7, 2023
@ikhoon
Copy link
Contributor

ikhoon commented Feb 7, 2023

I didn't have enough time to fix this bug. Let me handle it in the next iteration. 🙇‍♂️

@trustin trustin added the sprint Issues for OSS Sprint participants label Feb 16, 2023
@aki-s
Copy link
Contributor

aki-s commented Feb 17, 2023

FYI:

I noticed that grpc/src/main/java/com/linecorp/armeria/server/grpc/ArmeriaServerCall.java is renamed to grpc/src/main/java/com/linecorp/armeria/server/grpc/AbstractServerCall.java
at this commit d246880
(
https://github.com/line/armeria/pull/4192/files#diff-77a9c8cba521c7ccb32f217b4a1c0f7f32450a51aeef23ee68b04447c0cecea0 )

As array size is preserved in the instance of io.grpc.Metadata beforehand, but
java.lang.ArrayIndexOutOfBoundsException is thrown.
This means the size has changed unexpectedly (usually implementation error at server-side or at client-side.


io/grpc/ClientCall.java

grpc/grpc-java@29d238a#diff-5ee94bb737d9714214695ecb38fb0307e2d997d90a05f2fed9082363e3086ceeR180

   * <p>Since {@link Metadata} is not thread-safe, the caller must not access (read or write) {@code
   * headers} after this point.

io/grpc/ServerCall.java

https://github.com/grpc/grpc-java/pull/8943/files#diff-6e854ec0ab72b9db4c205489e0ef27bb3b9f9348783792931e2fdf5fb79c8fe1R121

   * <p>Since {@link Metadata} is not thread-safe, the caller must not access (read or write) {@code
   * headers} after this point.

@jrhee17
Copy link
Contributor

jrhee17 commented Feb 20, 2023

Right 👍
As you've noticed, the stacktrace is for an older armeria version so the class names,line numbers may not match the most recent codebase.

Let me share a little background on this change:
ArmeriaServerCall has been divided into UnaryServerCall and StreamingServerCall.
Each armeria service may specify whether the request/response is unary or streaming. When a request/response is unary armeria can apply some special handling to make the code path more efficient.

  • UnaryServerCall: when both the request and response is unary
  • StreamingServerCall: when either one of request or response is streaming

I would guess that we need to verify both unary and streaming implementations to check if the problem still persists.

As array size is preserved in the instance of io.grpc.Metadata beforehand, but
java.lang.ArrayIndexOutOfBoundsException is thrown.
This means the size has changed unexpectedly (usually implementation error at server-side or at client-side.

I think Metadata can be specified by the user.
For instance, StatusRuntimeException can contain a Metadata. If a StatusRuntimeException containing an invalid Metadata is thrown/returned from a gRPC service I guess we might be able to reproduce the problem.

@aki-s
Copy link
Contributor

aki-s commented Feb 20, 2023

I noticed just wrapping with try catch would be easy to set RequestLog.request/responseCause(),
but there's risk that unnecessary implementation is included by this change.

I also noticed that finding root cause and reproducing the problem is a bit tough than I've expected..


Reason 1)

The change of

ArmeriaServerCall has been divided into UnaryServerCall and StreamingServerCall.

has happened on
Jul 4, 2022

It looks the error reported here could happen less than armeria-1.17.0.

I'll read the code of armeria-1.16.1, and try to understand why the bug could happen.


Reason 2)

Because io.grpc.Metadata usually manages private property size properly by itself
(construction part https://github.com/grpc/grpc-java/blob/release-1.52/api/src/main/java/io/grpc/Metadata.java#L120-L212 )
(serialization part https://github.com/grpc/grpc-java/blob/release-1.52/api/src/main/java/io/grpc/Metadata.java#L451-L479 )

It would be the timing issue to reproduce the error.
To surely reproduce the error, we might need to use reflection util, or need to mock implementation...

@jrhee17
Copy link
Contributor

jrhee17 commented Feb 22, 2023

Thanks!

I think we can create an invalid metadata using an unsafe API provided by gRPC.
e.g.

InternalMetadata.newMetadata(2, new byte[] { 1 }, new byte[] { 1 });

Using this, I think we can throw a StatusRuntimeException including the Metadata from a gRPC service:

new TestServiceImplBase() {
  public void someMethod(MyRequest request, StreamObserver<MyResponse> responseObserver) {
      responseObserver.onError(/* StatusRuntimeException with invalid metadata */)
  }
}

aki-s added a commit to aki-s/armeria that referenced this issue Mar 3, 2023
aki-s added a commit to aki-s/armeria that referenced this issue Mar 3, 2023
aki-s added a commit to aki-s/armeria that referenced this issue Mar 3, 2023
aki-s added a commit to aki-s/armeria that referenced this issue Mar 6, 2023
aki-s added a commit to aki-s/armeria that referenced this issue Mar 6, 2023
aki-s added a commit to aki-s/armeria that referenced this issue Mar 6, 2023
aki-s added a commit to aki-s/armeria that referenced this issue Mar 8, 2023
aki-s added a commit to aki-s/armeria that referenced this issue Mar 8, 2023
aki-s added a commit to aki-s/armeria that referenced this issue Mar 8, 2023
aki-s added a commit to aki-s/armeria that referenced this issue Mar 8, 2023
aki-s added a commit to aki-s/armeria that referenced this issue Mar 10, 2023
aki-s added a commit to aki-s/armeria that referenced this issue Mar 10, 2023
Partial fix for line#4625 .
This fix sets 'RequestLog.request/responseCause()' only the case when Metadata was invalid.
If we catch Exception at com.linecorp.armeria.server.grpc.StreamingServerCall#doClose
and set them as new metadata, then they could be transferred to client side before timeout.
@ikhoon ikhoon modified the milestones: 1.23.0, 1.24.0 Mar 15, 2023
aki-s added a commit to aki-s/armeria that referenced this issue Mar 17, 2023
Partial fix for line#4625 .
This fix sets 'RequestLog.request/responseCause()' only the case when Metadata was invalid.
If we catch Exception at com.linecorp.armeria.server.grpc.StreamingServerCall#doClose
and set them as new metadata, then they could be transferred to client side before timeout.
aki-s added a commit to aki-s/armeria that referenced this issue May 31, 2023
Partial fix for line#4625 .
This fix sets 'RequestLog.request/responseCause()' only the case when Metadata was invalid.
If we catch Exception at com.linecorp.armeria.server.grpc.StreamingServerCall#doClose
and set them as new metadata, then they could be transferred to client side before timeout.
@ikhoon ikhoon removed this from the 1.24.0 milestone Jun 2, 2023
@ikhoon ikhoon added this to the 1.25.0 milestone Jun 2, 2023
aki-s added a commit to aki-s/armeria that referenced this issue Jul 26, 2023
Partial fix for line#4625 .
This fix sets 'RequestLog.request/responseCause()' only the case when Metadata was invalid.
If we catch Exception at com.linecorp.armeria.server.grpc.StreamingServerCall#doClose
and set them as new metadata, then they could be transferred to client side before timeout.
aki-s added a commit to aki-s/armeria that referenced this issue Aug 11, 2023
Partial fix for line#4625 .
This fix sets 'RequestLog.request/responseCause()' only the case when Metadata was invalid.
If we catch Exception at com.linecorp.armeria.server.grpc.StreamingServerCall#doClose
and set them as new metadata, then they could be transferred to client side before timeout.
aki-s added a commit to aki-s/armeria that referenced this issue Aug 16, 2023
Partial fix for line#4625 .
This fix sets 'RequestLog.request/responseCause()' only the case when Metadata was invalid.
If we catch Exception at com.linecorp.armeria.server.grpc.StreamingServerCall#doClose
and set them as new metadata, then they could be transferred to client side before timeout.
ikhoon pushed a commit that referenced this issue Aug 17, 2023
Motivation:

I found an exception is thrown when "Armeria tries to set Metadata, but the Metadata is corrupted". 

Modifications:

- Log an exception thrown while Armeria tries to set Metadata.
- Returns the default `Metadata` with `INTERNAL` code instead as a fallback `Metadata`.

Result:

- Fixed `GrpcService` to return an INTERNAL error code if an error occurs while serializing gRPC metadata.
- Closes #4625
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect sprint Issues for OSS Sprint participants
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants