Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure Data Explorer - Ingestion Failures - BadRequest_NoRecordsOrWrongFormat #26651

Closed
akakarikos opened this issue Sep 12, 2023 · 11 comments
Closed
Labels
bug Something isn't working exporter/azuredataexplorer needs triage New item requiring triage

Comments

@akakarikos
Copy link

Component(s)

exporter/azuredataexplorer

What happened?

Description

When exporting to Azure Data Explorer exporter, there are several ingestion failures when ingesting for the OMNITELLogs and OMNITELTraces tables. One example of the error is the below:

"OperationId": 3bbeb9f5-41bd-4611-b2e2-43f650e4166b,
"Database": adx-prod-omni-telemetry,
"Table": OMNITELTraces,
"FailedOn": 2023-09-12T09:57:07.3958316Z,
"IngestionSourcePath": https://282strldadxadpprdwe00101.blob.core.windows.net/20230912-ingestdata-e5c334ee145d4b4-0/adx-prod-omni-telemetry_OMNITELTraces_2023-09-12%2009:56:45.25714285%20+0000%20UTC%20m=+2228478.125662742_25eb7f10-60c3-436c-bc55-ff498b909fa3.json,
"Details": BadRequest_NoRecordsOrWrongFormat: The input stream produced 0 bytes. This usually means that the input JSON stream was ill formed.,
"FailureKind": Permanent,
"RootActivityId": a747c063-18d3-4ed6-aba4-0a3272bb14c6,
"OperationKind": DataIngestPull,
"OriginatesFromUpdatePolicy": false,
"ErrorCode": BadRequest_NoRecordsOrWrongFormat,
"Principal": dstsapp=KustoIngestionProd,
"ShouldRetry": false,
"User": ,
"IngestionProperties": [Format=Json/mandatory, IngestionMapping=[{"ColumnName":"TraceID","ColumnType":"string","Properties":{"Field":"TraceID"}},{"ColumnName":"SpanID","ColumnType":"string","Properties":{"Field":"SpanID"}},{"ColumnName":"ParentID","ColumnType":"string","Properties":{"Field":"ParentID"}},{"ColumnName":"SpanName","ColumnType":"string","Properties":{"Field":"SpanName"}},{"ColumnName":"SpanStatus","ColumnType":"string","Properties":{"Field":"SpanStatus"}},{"ColumnName":"SpanKind","ColumnType":"string","Properties":{"Field":"SpanKind"}},{"ColumnName":"StartTime","ColumnType":"datetime","Properties":{"Field":"StartTime"}},{"ColumnName":"EndTime","ColumnType":"datetime","Properties":{"Field":"EndTime"}},{"ColumnName":"ResourceAttributes","ColumnType":"dynamic","Properties":{"Field":"ResourceAttributes"}},{"ColumnName":"TraceAttributes","ColumnType":"dynamic","Properties":{"Field":"TraceAttributes"}},{"ColumnName":"Events","ColumnType":"dynamic","Properties":{"Field":"Events"}},{"ColumnName":"Links","ColumnType":"dynamic","Properties":{"Field":"Links"}}], ValidationPolicy=[Options=ValidateCsvInputConstantColumns, Implications=BestEffort, IsDetailedErrorReportingEnabled=False], Tags=[ToStringEmpty], IngestIfNotExists=[ToStringEmpty], ZipPattern=[null]],
"NumberOfSources": 1

Steps to Reproduce

Not specific steps. Just set up an application with the OTel collector that exports to ADX exporter.

Expected Result

The logs and traces are successfully exported and ingested by the Azure Data Explorer.

Actual Result

There are multiple ingestion failures emitted by the Azure Data Explorer.

Collector version

0.80.0

Environment information

Environment

Docker image deployed in AKS.
OS: linux amd64

OpenTelemetry Collector configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-config
data:
  config.yaml: |-
    receivers:
      zipkin:
        endpoint: localhost:9411
      otlp:
        protocols:
          grpc:
            endpoint: localhost:4317
          http:
            endpoint: localhost:4318
            
    processors:
      memory_limiter:
        check_interval: 1s
        limit_percentage: 80
        spike_limit_percentage: 30

      batch:
        send_batch_size: 50
    
    exporters:
      logging:
        verbosity: detailed
    
      azuremonitor:
        instrumentation_key: 
        spaneventsenabled: true

      azuredataexplorer:
        cluster_uri: 
        application_id: 
        application_key: 
        tenant_id:
        db_name: 
        metrics_table_name: 
        logs_table_name: 
        traces_table_name:
    
    service:
      pipelines:
        traces:
          receivers: [zipkin,otlp]
          processors: [memory_limiter,batch]
          exporters: [azuredataexplorer,azuremonitor,logging]
        logs:
          receivers: [otlp]
          processors: [memory_limiter,batch]
          exporters: [azuredataexplorer,azuremonitor,logging]

Log output

No response

Additional context

Is there a way to select the format of the ingestion between JSON and MultiJSON?

@akakarikos akakarikos added bug Something isn't working needs triage New item requiring triage labels Sep 12, 2023
@github-actions
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@ag-ramachandran
Copy link
Contributor

Hello @akakarikos, the trace data is serialized as multi-json as there can be batching. There is no user preference to choose between JSON and MultiJSON.

Does this lead to a data-loss when sampled ?

The root cause of these ingestion failures are possibly when empty JSON files are sent for ingestion. If there is no data loss, this can be ignored

@akakarikos
Copy link
Author

Hi @ag-ramachandran and thanks for your answer. Unfortunately, this leads to a data loss. The tables for traces and logs don't get updated with new logs/traces coming in.

Have you come across such a situation before?

@akakarikos
Copy link
Author

akakarikos commented Sep 12, 2023

@ag-ramachandran I just checked the console logs of the OTel collector and came across the below. It is referring to the Azure Monitor exporter, but I am wondering if this is related to the ingestion failures of the ADX. Although after that error, the console, continued to output logs and traces.

warn [email protected]/log_to_envelope.go:60 MessageData.Properties has value with length exceeding max of 8192: DataValue {"kind": "exporter", "data_type": "logs", "name": "azuremonitor"}

@ag-ramachandran
Copy link
Contributor

@ag-ramachandran I just checked the console logs of the OTel collector and came across the below. It is referring to the Azure Monitor exporter, but I am wondering if this is related to the ingestion failures of the ADX. Although after that error, the console, continued to output logs and traces.

warn [email protected]/log_to_envelope.go:60 MessageData.Properties has value with length exceeding max of 8192: DataValue {"kind": "exporter", "data_type": "logs", "name": "azuremonitor"}

Hi @akakarikos

Azure monitor has a separate exporter that is not related to ADX exporter. Monitor is a SaaS solution for observability and uses ADX internally for storage.

Regarding data loss, it will be interesting if this can be verified. The error is because there was an ingestion and the file sent for ingestion was 0 bytes. So this is not a data loss scenario as i think of it.

@ag-ramachandran
Copy link
Contributor

ag-ramachandran commented Sep 12, 2023

Hi @ag-ramachandran and thanks for your answer. Unfortunately, this leads to a data loss. The tables for traces and logs don't get updated with new logs/traces coming in.

Have you come across such a situation before?

You should be able to see the last failure running

.show ingestion failures on your cluster and take the TS of the last failure

Once you have this , If you can run

TracesTable | extend ig = ingestion_time() | where ig >= datetime(**last-failure-ts**) | take 10

This will show if all records subsequently are failing if this is a 0 result set.

Have you come across such a situation before?
No

@akakarikos
Copy link
Author

akakarikos commented Sep 12, 2023

Thanks @ag-ramachandran. Yes, I'm taking a 0 result set so all subsequent calls are failing. Any ideas on how could we troubleshoot further?

Regarding data loss, it will be interesting if this can be verified. The error is because there was an ingestion and the file sent for ingestion was 0 bytes. So this is not a data loss scenario as i think of it.

We have stopped getting logs and traces for specific microservices, so I think there is data loss or I am missing something.

@ag-ramachandran
Copy link
Contributor

ag-ramachandran commented Sep 14, 2023

Hello @akakarikos , Sorry for the delay, was away yesterday.

a) For troubleshooting, there are a few things we can look at. Are we using queued mode or managed mode in the exporter config. You can execute these commands to see when the last failure was at.

image

image

b) Now if we assume that all of these ingestions failures point to the 0 file size ingestion, the next step would be to see why there is a 0 file size export. This will more likely be something that has changed in the OTEL configuration than something on Kusto. Do we see any logs on the exporter side in OTEL that reports any failures ?

c) Are there any ingestion commands that are coming in at all. You can change the database name and run the query below and see what the latest ingestion was on ?

image

d) It is strange that ingestion seems to be not happening. Was just tracking server side telemetry, there was an ingestion at 2023-09-13T23:59:53.3262871Z

$$USAGE[Gateway]: Api='Admin', User='Sanitized!3E16611D', Application='Kusto.WinSvc.DM.Svc', PrincipalType='dSTS Application', PrincipalIdentity='KustoIngestionProd', DatabaseName='adx-prod-omni-telemetry', ClientAddress='::ffff:20.238.186.214:2069', Host='[none]', KustoClient='Kusto.Dotnet.Client:{2023.09.11.0646-2335-4237d37859-WeeklyProdPubExt}|Runtime:{.NETv4.7.2/CLRv4.0.30319.42000/4.7.4051.0_built_by:_NET472REL1LAST_B}', ClientRequestProperties=({"SecurityTokenPresent":false,"AuthorizationScheme":null,"RequestHostName":"https://***.kusto.windows.net:443/","LocalClusterName":"https://***/","OriginClusterName":"https://***.kusto.windows.net/","Options":{"request_impersonation_disabled":true,"request_callout_disabled":true,"api_version":"v1","version":"2019-02-13","request_app_name":"Kusto.WinSvc.DM.Svc"}}), Text=.ingest-from-storage into OMNITELTraces with (creationTime='0001-01-01T00:00:00Z',format='json',ignoreFirstRecord='false')<|'******'

Status of this ingestion was success, So it is surprising that you mention no traces are coming in

DatabaseName='adx-prod-omni-telemetry', ClientRequestId='DM.IngestionExecutor;e4f9dfb6-3531-4946-9f7a-0076a1569a64;1', RootActivityId='bcc8c35c-b4ec-478f-b55a-294cab5465fb', Duration='00:00:00.2499961', State='Completed',

@akakarikos
Copy link
Author

Hello @ag-ramachandran! Big thanks for your support and detailed guidance.

So, what I did as a last option yesterday, was to kill/restart the specific pod that didn't emit any telemetry to ADX. After doing that, it started working again. It's really strange as I couldn't find the root cause behind this - how did it suddenly stop emitting data.

I think we can close this for now, and if it occurs again, I will touch base with my findings.

Out of curiosity, how do you track the server-side telemetry that you mentioned?

@ag-ramachandran
Copy link
Contributor

Hello @akakarikos, Good Evening. Thanks for the confirmation.

Out of curiosity, how do you track the server-side telemetry that you mentioned?

Being a PaaS service, we track server telemetry to maintain server health. This is Kusto built in. Given a specific operation id, we can track if there is something erroneous happening on the server and track it. Note that the statistics on operations with respect to the provisioned service health (not at a user ops level)

I think we can close this for now, and if it occurs again, I will touch base with my findings.

Sure, we can close the issue, you can reach me at ramacg (at) ms (dot) com in case you need a call further for any troubleshooting!

@akakarikos
Copy link
Author

Hey @ag-ramachandran! That sounds great! Thanks for your explanation and support on this matter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working exporter/azuredataexplorer needs triage New item requiring triage
Projects
None yet
Development

No branches or pull requests

2 participants