Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive data consumption #1909

Closed
coratron opened this issue May 13, 2024 · 15 comments
Closed

Excessive data consumption #1909

coratron opened this issue May 13, 2024 · 15 comments

Comments

@coratron
Copy link

coratron commented May 13, 2024

I am using the following:

A. Before migrating the project to this new stack I had all the comms (unencrypted) running on the Modem by using AT commands. Data consumption for the operation: 1kB

B. After migrating to the embedded SDK (using PKCS#11 for managing certificates, etc). Data consumption for the operation: 19kB. I am not talking about the initial 30kB for the TLS handshake etc, but just the ongoing MQTT transmission.

C. From B. but changing the following
#define TRANSPORT_SEND_RECV_TIMEOUT_MS 5000U
to an absurd value of:
#define TRANSPORT_SEND_RECV_TIMEOUT_MS 50000U
The data consumption for the operation : 4kB

What is going on here?

The ideal goal is to achieve the minimum amount of data consumption as possible as we are using SIM cards that are billed by the MB. I would expect a higher consumption (mostly due to the handshake), but not the ongoing 4x compared to what we used to have before migrating the project to use this sdk and certificates.

Any ideas?
The only thing I can see on the terminal at the moment is a periodic:

D (690224) AWS: Failed to read data. However, a read can be retried on this error. mbedTLSError= SSL - The operation timed out : <No-Low-Level-Code>.
D (690227) coreMQTT: MQTT PINGREQ packet size is 2.
D (690233) coreMQTT: sendBuffer: Bytes Sent=2, Bytes Remaining=0
D (690243) coreMQTT: Sent 2 bytes of PINGREQ packet.

I am wondering if this is causing some extra data consumption in the background.

@YOSI-yoshidayuji
Copy link

If the cause is that PINGREQ is occurring frequently, then I think the following patch is needed:
FreeRTOS/coreMQTT#235

@coratron
Copy link
Author

coratron commented May 13, 2024

If the cause is that PINGREQ is occurring frequently, then I think the following patch is needed: FreeRTOS/coreMQTT#235

thanks for your prompt reply @YOSI-yoshidayuji , I was precisely looking at the keep alive conditions and the other defines that can override the values. Then following your link did find out that indeed this is missing in that LTS release which most certainly would have an impact:

    if( status == MQTTSuccess )
    {
        pContext->lastPacketRxTime = pContext->getTime();
    }

I have added that manually for now and will do some tests, but may have to look at using the master branch. Unfortunately, nobody at ESP is maintaining the esp-aws-iot repo... The issues keep piling up hence why I sought help here.

@coratron
Copy link
Author

No changes unfortunately, still 4kB.

The following was appearing every 50s. Not a coincidence.

D (690224) AWS: Failed to read data. However, a read can be retried on this error. mbedTLSError= SSL - The operation timed out : <No-Low-Level-Code>.
D (690227) coreMQTT: MQTT PINGREQ packet size is 2.
D (690233) coreMQTT: sendBuffer: Bytes Sent=2, Bytes Remaining=0
D (690243) coreMQTT: Sent 2 bytes of PINGREQ packet.

That TRANSPORT_SEND_RECV_TIMEOUT_MS is what gets passed to
static MbedtlsPkcs11Status_t configureMbedtls( MbedtlsPkcs11Context_t * pMbedtlsPkcs11Context, const char * pHostName, const MbedtlsPkcs11Credentials_t * pMbedtlsPkcs11Credentials, uint32_t recvTimeoutMs )

I have increased the value to 120s now and the message on the terminal indeed appears roughly every 120s. The data consumption is further reduced to 3kB.

Whatever is happening in the background in terms of transactions after this odd timeout occurs is what is driving this unwanted data consumption. Seems to be TLS related.

@aggarg
Copy link
Member

aggarg commented May 13, 2024

Can you track where this log is coming from:

D (690224) AWS: Failed to read data. However, a read can be retried on this error. mbedTLSError= SSL - The operation timed out : <No-Low-Level-Code>.

The reason I ask it might help us in finding out if some operation is being retried and resulting in more data consumption.

@coratron
Copy link
Author

coratron commented May 13, 2024

@aggarg

It comes from mbedtls_pkcs11_posix.c:

int32_t Mbedtls_Pkcs11_Recv( NetworkContext_t * pNetworkContext,
                             void * pBuffer,
                             size_t bytesToRecv )
{
    MbedtlsPkcs11Context_t * pMbedtlsPkcs11Context = NULL;
    int32_t tlsStatus = 0;

    assert( ( pNetworkContext != NULL ) && ( pNetworkContext->pParams != NULL ) );

    pMbedtlsPkcs11Context = pNetworkContext->pParams;
    tlsStatus = ( int32_t ) mbedtls_ssl_read( &( pMbedtlsPkcs11Context->context ),
                                              pBuffer,
                                              bytesToRecv );

    if( ( tlsStatus == MBEDTLS_ERR_SSL_TIMEOUT ) ||
        ( tlsStatus == MBEDTLS_ERR_SSL_WANT_READ ) ||
        ( tlsStatus == MBEDTLS_ERR_SSL_WANT_WRITE ) )
    {
        LogDebug( ( "Failed to read data. However, a read can be retried on this error. "
                    "mbedTLSError= %s : %s.",
                    mbedtlsHighLevelCodeOrDefault( tlsStatus ),
                    mbedtlsLowLevelCodeOrDefault( tlsStatus ) ) );

        /* Mark these set of errors as a timeout. The libraries may retry read
         * on these errors. */
        tlsStatus = 0;
    }
    else if( tlsStatus < 0 )
    {
        LogError( ( "Failed to read data: mbedTLSError= %s : %s.",
                    mbedtlsHighLevelCodeOrDefault( tlsStatus ),
                    mbedtlsLowLevelCodeOrDefault( tlsStatus ) ) );
    }
    else
    {
        /* Empty else marker. */
    }

    return tlsStatus;
}

which in turn is the function that I assigned to transport.recv

  /* Fill in TransportInterface send and receive function pointers.
   * For this demo, TCP sockets are used to send and receive data
   * from the network. pNetworkContext is an SSL context for OpenSSL.*/
  transport.pNetworkContext = pNetworkContext;
  transport.send = Mbedtls_Pkcs11_Send;
  transport.recv = Mbedtls_Pkcs11_Recv;
  transport.writev = NULL; 

in mqtt_operations.c, just like it is done in the fleet provisioning example:

https://github.com/aws/aws-iot-device-sdk-embedded-C/blob/main/demos/fleet_provisioning/fleet_provisioning_with_csr/mqtt_operations.c

@aggarg
Copy link
Member

aggarg commented May 13, 2024

Okay - do we know what read operation fails. In other words, do we have a call stack when the read fails? Is there any possibility of capturing network traffic? Also, are you using QoS1? If yes, can you try QoS0?

@coratron
Copy link
Author

This is over CAT-M1, I do not have the tools to capture this traffic unfortunately. I may have to switch over to WiFi if it comes to that but metering would be done differently (unable to use the original reference for comparison).

In any case, the increased data consumption appears to be the symptom of this timeout error.

Here is the call stack.

0x400e2ac5: Mbedtls_Pkcs11_Recv at mbedtls_pkcs11_posix.c:965 (discriminator 34)
0x400e8d8a: receiveSingleIteration coreMQTT/coreMQTT/source/core_mqtt.c:1658
0x400e92d0: MQTT_ProcessLoop at coreMQTT/coreMQTT/source/core_mqtt.c:3071
0x400dde31: processMqtt() see below...
0x400d9432: networkingTask(void*) 
void processMqtt(void)
{
    MQTTStatus_t mqttStatus = MQTTBadParameter;

    /* Acquire the mqtt mutex lock. */
    if(pthread_mutex_lock(&mqttMutex) == 0)
    {
        /* Loop to receive packet from transport interface. */
        mqttStatus = MQTT_ProcessLoop(&mqttContext);

        pthread_mutex_unlock(&mqttMutex);
    }
    if(mqttStatus != MQTTSuccess)
    {
        LogError(("Failed to process MQTT loop with error = %u.", mqttStatus));
    }
    else
    {
        LogInfo(("Processed MQTT loop successfully."));
    }
}

I call processMqtt() periodically to handle any pending MQTT traffic in my networkingTask
What happens is the system stalls for the duration TRANSPORT_SEND_RECV_TIMEOUT_MS . When that ends up in failure (timeout), the system must be sending something.

I was using QoS1 but I have tried QoS0 and it produces the same results.

@aggarg
Copy link
Member

aggarg commented May 15, 2024

I call processMqtt() periodically to handle any pending MQTT traffic in my networkingTask
What happens is the system stalls for the duration TRANSPORT_SEND_RECV_TIMEOUT_MS . When that ends up in failure (timeout), the system must be sending something.

Does it mean that there is no data to read and this receive is expected to timeout? We need to still find out why the amount of data consumption is more. Can you log the number of bytes sent and bytes received in the transport send and receive functions and compare them?

@coratron
Copy link
Author

coratron commented May 15, 2024

@aggarg Thanks for bearing with me.

I would assume that this timeout indeed should not be a problem if there is no data to receive. The observation is the data consumption is directly proportional to the timeout condition, I did not know what data is being exchanged here. I finally managed to enable the mbedTLS debug logging - (platform specific vs library...).

Cornering the problem a little further. Here are the new observations.

I get 5 messages like this (one message for every processMqtt() call as discussed previously).

I (418920) mbedTLS: |2| 0x3ffb49e4: in_left: 0, nb_want: 5
I (418920) mbedTLS: |2| 0x3ffb49e4: ssl->f_recv(_timeout)() returned -26624 (-0x6800)
I (418924) mbedTLS: |1| 0x3ffb49e4: mbedtls_ssl_fetch_input() returned -26624 (-0x6800)
I (418933) mbedTLS: |1| 0x3ffb49e4: ssl_get_next_record() returned -26624 (-0x6800)
I (418943) mbedTLS: |1| 0x3ffb49e4: mbedtls_ssl_read_record() returned -26624 (-0x6800)

D (418951) AWS: Failed to read data. However, a read can be retried on this error. mbedTLSError= SSL - The operation timed out : <No-Low-Level-Code>.


Backtrace: ... (same as before)

I (418985) mbedTLS: |2| 0x3ffb49e4: => read
I (418985) mbedTLS: |2| 0x3ffb49e4: => read record
I (418991) mbedTLS: |2| 0x3ffb49e4: => fetch input
I (418992) mbedTLS: |2| 0x3ffb49e4: in_left: 0, nb_want: 5

And then the 6th looks like this:

I (534185) mbedTLS: |2| 0x3ffb49e4: in_left: 0, nb_want: 5
I (534185) mbedTLS: |2| 0x3ffb49e4: ssl->f_recv(_timeout)() returned -26624 (-0x6800)
I (534192) mbedTLS: |1| 0x3ffb49e4: mbedtls_ssl_fetch_input() returned -26624 (-0x6800)
I (534200) mbedTLS: |1| 0x3ffb49e4: ssl_get_next_record() returned -26624 (-0x6800)
I (534208) mbedTLS: |1| 0x3ffb49e4: mbedtls_ssl_read_record() returned -26624 (-0x6800)

D (534215) AWS: Failed to read data. However, a read can be retried on this error. mbedTLSError= SSL - The operation timed out : <No-Low-Level-Code>.

Backtrace: ... (same as before)

I (534240) mbedTLS: |2| 0x3ffb49e4: => write
I (534245) mbedTLS: |2| 0x3ffb49e4: => write record
I (534250) mbedTLS: |2| 0x3ffb49e4: => encrypt buf
I (534256) mbedTLS: |4| 0x3ffb49e4: dumping 'before encrypt: output payload' (2 bytes)
I (534265) mbedTLS: |4| 0x3ffb49e4: 0000:  c0 00                                            ..
I (534274) mbedTLS: |4| 0x3ffb49e4: dumping 'IV used (internal)' (12 bytes)
I (534282) mbedTLS: |4| 0x3ffb49e4: 0000:  ed 28 2b 61 00 00 00 00 00 00 00 2c              .(+a.......,
I (534292) mbedTLS: |4| 0x3ffb49e4: dumping 'IV used (transmitted)' (8 bytes)
I (534300) mbedTLS: |4| 0x3ffb49e4: 0000:  00 00 00 00 00 00 00 2c                          .......,
I (534313) mbedTLS: |4| 0x3ffb49e4: dumping 'additional data used for AEAD' (13 bytes)
I (534321) mbedTLS: |4| 0x3ffb49e4: 0000:  00 00 00 00 00 00 00 2c 17 03 03 00 02           .......,.....
I (534330) mbedTLS: |3| 0x3ffb49e4: before encrypt: msglen = 2, including 0 bytes of padding
I (534339) mbedTLS: |4| 0x3ffb49e4: dumping 'after encrypt: tag' (16 bytes)
I (534353) mbedTLS: |4| 0x3ffb49e4: 0000:  c4 51 34 23 49 fc d6 2d 2d d7 a6 d6 8f d4 8a 43  .Q4#I..--......C
I (534361) mbedTLS: |2| 0x3ffb49e4: <= encrypt buf
I (534363) mbedTLS: |3| 0x3ffb49e4: output record: msgtype = 23, version = [3:3], msglen = 26
I (534372) mbedTLS: |4| 0x3ffb49e4: dumping 'output record sent to network' (31 bytes)
I (534381) mbedTLS: |4| 0x3ffb49e4: 0000:  17 03 03 00 1a 00 00 00 00 00 00 00 2c f5 8a c4  ............,...
I (534394) mbedTLS: |4| 0x3ffb49e4: 0010:  51 34 23 49 fc d6 2d 2d d7 a6 d6 8f d4 8a 43     Q4#I..--......C
I (534402) mbedTLS: |2| 0x3ffb49e4: => flush output
I (534410) mbedTLS: |2| 0x3ffb49e4: message length: 31, out_left: 31
I (534418) mbedTLS: |2| 0x3ffb49e4: ssl->f_send() returned 31 (-0xffffffe1)
I (534423) mbedTLS: |2| 0x3ffb49e4: <= flush output
I (534434) mbedTLS: |2| 0x3ffb49e4: <= write record
I (534435) mbedTLS: |2| 0x3ffb49e4: <= write
I (534442) mbedTLS: |2| 0x3ffb49e4: => read
I (534450) mbedTLS: |2| 0x3ffb49e4: => read record
I (534452) mbedTLS: |2| 0x3ffb49e4: => fetch input
I (534458) mbedTLS: |2| 0x3ffb49e4: in_left: 0, nb_want: 5
I (535394) mbedTLS: |2| 0x3ffb49e4: in_left: 0, nb_want: 5
I (535394) mbedTLS: |2| 0x3ffb49e4: ssl->f_recv(_timeout)() returned 5 (-0xfffffffb)
I (535401) mbedTLS: |2| 0x3ffb49e4: <= fetch input
I (535404) mbedTLS: |4| 0x3ffb49e4: dumping 'input record header' (5 bytes)
I (535411) mbedTLS: |4| 0x3ffb49e4: 0000:  17 03 03 00 1a                                   .....
I (535421) mbedTLS: |3| 0x3ffb49e4: input record: msgtype = 23, version = [0x303], msglen = 26
I (535434) mbedTLS: |2| 0x3ffb49e4: => fetch input
I (535442) mbedTLS: |2| 0x3ffb49e4: in_left: 5, nb_want: 31
I (535443) mbedTLS: |2| 0x3ffb49e4: in_left: 5, nb_want: 31
I (535450) mbedTLS: |2| 0x3ffb49e4: ssl->f_recv(_timeout)() returned 26 (-0xffffffe6)
I (535458) mbedTLS: |2| 0x3ffb49e4: <= fetch input
I (535463) mbedTLS: |4| 0x3ffb49e4: dumping 'input record from network' (31 bytes)
I (535474) mbedTLS: |4| 0x3ffb49e4: 0000:  17 03 03 00 1a 00 00 00 00 00 00 00 13 07 c7 bc  ................
I (535482) mbedTLS: |4| 0x3ffb49e4: 0010:  c1 42 da 3d de f3 a6 3c b5 ac 2a df 30 e3 a8     .B.=...<..*.0..
I (535493) mbedTLS: |2| 0x3ffb49e4: => decrypt buf
I (535499) mbedTLS: |4| 0x3ffb49e4: dumping 'additional data used for AEAD' (13 bytes)
I (535507) mbedTLS: |4| 0x3ffb49e4: 0000:  00 00 00 00 00 00 00 13 17 03 03 00 02           .............
I (535522) mbedTLS: |4| 0x3ffb49e4: dumping 'IV used' (12 bytes)
I (535525) mbedTLS: |4| 0x3ffb49e4: 0000:  ca 77 ab 33 00 00 00 00 00 00 00 13              .w.3........
I (535535) mbedTLS: |4| 0x3ffb49e4: dumping 'TAG used' (16 bytes)
I (535542) mbedTLS: |4| 0x3ffb49e4: 0000:  bc c1 42 da 3d de f3 a6 3c b5 ac 2a df 30 e3 a8  ..B.=...<..*.0..
I (535553) mbedTLS: |2| 0x3ffb49e4: <= decrypt buf
I (535563) mbedTLS: |4| 0x3ffb49e4: dumping 'input payload after decrypt' (2 bytes)
I (535571) mbedTLS: |4| 0x3ffb49e4: 0000:  d0 00                                            ..
I (535579) mbedTLS: |2| 0x3ffb49e4: <= read record
I (535582) mbedTLS: |2| 0x3ffb49e4: <= read
I (536393) mbedTLS: |2| 0x3ffb49e4: => read
I (536394) mbedTLS: |2| 0x3ffb49e4: => read record
I (536395) mbedTLS: |2| 0x3ffb49e4: => fetch input
I (536401) mbedTLS: |2| 0x3ffb49e4: in_left: 0, nb_want: 5

Then it starts all over.

@aggarg
Copy link
Member

aggarg commented May 15, 2024

The observation is the data consumption is directly proportional to the timeout condition, I did not know what data is being exchanged here.

How are you measuring this data consumption?

@coratron
Copy link
Author

Directly from the carrier (that's how they bill), I am using their API as well to get higher resolution - but can only see data in chunks of about 15 min.

@coratron
Copy link
Author

@aggarg just found this 6th packet is happening for a ping request by enabling coreMQTT logs as well.
D (302432) coreMQTT: MQTT PINGREQ packet size is 2.

It seems I am back to @YOSI-yoshidayuji 's suggestion. I will have to check if that patch has been fully implemented in the version I am using.

@coratron
Copy link
Author

coratron commented May 15, 2024

Ok, I think we can put this to rest. I have removed PINGREQ for now. Data consumption is 3kB per operation which is substantially higher than what I expected but will have to settle for that. I can't see what else but TLS is responsible for the added data.

Thanks for the help @aggarg and @YOSI-yoshidayuji

@aggarg
Copy link
Member

aggarg commented May 15, 2024

I can't see what else but TLS is responsible for the added data.

I did not realize that you were not using encryption before. If you want to further dive deep, we would need some way to capture data. One way can be to setup MQTT broker on a PC and run capture on that.

Thank you for sharing your solution!

@coratron
Copy link
Author

Thanks @aggarg

No problem, this will work for now. The next big thing will be to see if the folks at ESP will release a better wrapper for this SDK. Just for perspective, my binary has gone from 800kB to 1.5MB in size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants