-
Notifications
You must be signed in to change notification settings - Fork 282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] java.io.OptionalDataException error in OpenSearch 2.0 rc #1961
Comments
@HyunSeaHoon Thanks for reporting the issue. Can you please provide steps to reproduce the issue? |
@adnapibar ^ says The occurrence of errors is irregular, and errors sometimes disappear after a certain period of time has passed. I am afraid we have to work backwards from the error to understand how it could happen. |
@dblock Yes agree! But I think we need more details that can help us to reproduce this error. Also, does this error happen with other API calls during the time? I see one discussion on es 7.9 reporting similar issue (not with logstash) https://discuss.elastic.co/t/temp-downtimes-optionaldataexception-500-on-all-apis-aws/257571 which may be completely unrelated. |
Thank you for your interest. Our service collects logs with logstash, stores them with ElasticSearch, and retrieves logs with a NodeJS-based web server. We are changing from using ElasticSearch (with or without the Opendistro plugin) service to OpenSearch service. logstash uses OSS version 7.16.2 and the output plugin uses logstash-output-opensearch (1.2.0). The web server uses the javascript client elastic/elasticsearch. As far as I know, the OptionalDataException error has been confirmed so far in the OpenSearch V2.0Rc environment. Based on the symptoms occurring in the test environment, it seems to occur as follows:
I wonder how many cases of this symptom are there, and I wonder if it is an opensearch bug, or a problem with the version of the library we use or a problem with data structure. So far, it seems difficult to predict the cause of the occurrence. Obviously, I have not experienced it with any version other than OpenSearch 2.0Rc1. |
It happened with 2.0.0 official release too. I migrated my cluster on 8h30 and the problem started 9h50 and took 1h do stop. It recovered from the error alone (no intervention) and I don't know if it is going to come back. |
The same symptoms appear in the full version of Opensearch 2.0. The environment file used for testing is attached. After checking the following, an error also occurred in that case.
After all the services are up (docker-compose up -d), an OptionalDataException is thrown at some point after a certain amount of time has passed. (Occurs within 1 hour, occurs after 10 hours, etc.) The peculiar thing is that when monitoring with only the service running, OptionalDataException occurs in Logstash, but the log is not visible in OpenSearch. At this time, when accessing the OpenSearchDashboard UI, an error 500 occurs, and an OptionalDataException occurs in the OpenSearchDashboard log and OpenSearch log. |
Hi @HyunSeaHoon Thank you so much for providing such detailed steps to reproduce, appreciate your time. |
Hello @tlfeng . The architecture we adopted is as follows: When the problem happens, the logs from Logstash to OpenSearch Client Nodes stops flowing. I assumed the problem was on Logstash and tried to restart it. After LS restart, logs were still not flowing. Only when I restarted the OS Client Nodes, the logs started to flow again. It looks like a problem in the OpenSearch side. |
Hi! Any update on this one? |
Same happens here. Multiple times a day Logstash fails to sent logs to OpenSearch cluster. Started happening after upgrade to V2. |
I tried to downgrade but due to Lucene upgrade to v9 cluster failed to read cluster state and indices metadata.
|
The only workaround that helped me a lot was to schedule restart hourly for the client nodes. This reduced the problem a lot, but of course is an ugly workaround. The performance improvements in v2 were important to me and that's why I performed the upgrade. |
Some information, based on what we've experienced: We've recently updated Opensearch to the version 2.0.0. The problem seems to happen periodically (at least, in our case), every 60 minutes. After some further investigation, we detected that the problem seems to be associated with the Opensearch Security Plugin cache, and the frequency of the occurence of the problem seems to be directly associated with the TTL defined for the cache (currently defined in I don't have any more practical evidences/logs about it, but, as a temporary solution, we've increased the cache's TTL, and started to flush the cache once a day. Additional info: we use the Opensearch's javascript client (v2.0.0) to save the documents in Opensearch. |
Yes as temp workaround we run 'DELETE _plugins/_security/api/cache' this solves the problem till next cache expire. Or "Purge cache" button from Dashboars > Security tab |
Experienced the exact same error on a OS 2.0 cluster without Logstash. It occurred after indexing close to 1 billion documents with elasticsearch python client (7.10.0). Same exact python code didn't show the error on 1.3 OS and also the error doesn't happen consistently on OS 2.0 either. Experienced the bug for the first time after calling |
I will be looking into this. |
Upgraded to version 2.0.1 on Friday and turned off all scheduled cache purge. There were no issues for some time but on Sunday got same error again. So issue still exists but appears less frequently. |
Hi @JustinasKO . I also tried 2.0.1 and confirmed that it does not fix this issue. |
Same here, latest fix didn't seem to have any impact. Still get a node randomly failing which causes this error. In that case I have to restart the node as a tmp fix. I have installed it via helm chart if it means anything |
While making changes in the security information (user, roles and so on) via OS Dashboards, the problem appeared again. |
I am also using the Bulk API with the elasticsearch-net client. I followed the advice above and changed the This decreased the frequency of errors drastically. I then implemented logic to detect the error and call the flush cache api in my app before retrying. It's super hacky but it's getting me by for now. if (item.Error.Reason == "java.io.OptionalDataException")
{
try
{
var flushCacheResponse = await _simpleCacheClient.FlushAuthCacheAsync();
}
catch (Exception ex)
{
_logger.LogError(ex, "Error flushing auth cache.");
}
} public class SimpleCacheClient : ISimpleCacheClient
{
private readonly HttpClient _http;
private readonly ILogger<SimpleCacheClient> _logger;
private readonly AppOptions _options;
public SimpleCacheClient(
HttpClient httpClient,
IOptions<AppOptions> options,
ILogger<SimpleCacheClient> logger
)
{
_http = httpClient;
_logger = logger;
_options = options.Value;
}
public async Task<FlushCacheResponse> FlushAuthCacheAsync()
{
_logger.LogDebug("Begin {method}", nameof(FlushAuthCacheAsync));
List<Exception> innerExceptions = null;
foreach (var item in _options.ElasticClusterNodes)
{
try
{
var url = $"{item}/_opendistro/_security/api/cache";
var response = await _http.DeleteAsync(url);
_logger.LogDebug("Attempting to flush cache at url {url}", url);
if (response.IsSuccessStatusCode)
{
using var stream = await response.Content.ReadAsStreamAsync();
return JsonSerializer.Deserialize<FlushCacheResponse>(stream);
}
else
{
throw new HttpRequestException($"Error sending request: {response.ReasonPhrase}");
}
}
catch (Exception ex)
{
if (innerExceptions == null)
innerExceptions = new List<Exception>();
innerExceptions.Add(ex);
}
await Task.Delay(1000);
}
var finalException = new AggregateException("Unable to flush the cache from any of the available OpenSearch hosts.", innerExceptions);
throw finalException;
}
} |
Right now everyone, who is using OpenSearch 2.x (we are running OpenSearch 2.0.1) with Security Plugin and Logstash 7.16.2 with OpenSearch output plugin, is probably losing data. And because it happens sporadic, it could go completely unnoticed. |
I can confirm i can reproduce the issue with few requests which fix the issue or open it:
2.0.1 with logstash oss 7.16.2 My problem appears since 1.3.2 -> 2.0.0 PS: if i backup the config, config is detected as version 1 and not 2, i don't know why.... |
So I'm beginning to think that this may be caused by an incompatibility with the Elasticsearch version of the clients. I implemented my own bulk index client and haven't experienced these issues. Originally I was using elasticsearch-net client version 7.12. I'm not using Logstash but I imagine it is using the Elasticsearch versions of the client libraries internally and is running into the same issues. I think we have just hit a point where the API has diverged enough that the Elasticsearch clients are no longer compatible with > 2.x OpenSearch. |
I can confirm, it solved my issue with: |
@msoler8785 I do remember seeing this change in OpenSearch 2. |
I ran the exact same python script (repeatedly calling bulk api) once utilizing elasticsearch-py (7.10.0) and once using openearch-py client (2.0.0). On the first attempt (elasticsearch client) to index 1b docs (~111gb), data went missing and I got the same exceptions as mentioned above with only 1b docs (~105gb) indexed. Using the opensearch-py client the exception didn't occur. Since so far this exception seems to be somewhat random, I can't say that with more data the opensearch client wont experience the same issue but seems to likely be related to incompatibility with elasticsearch client |
We are using 2.0.0, and are also using the Java client 2.0.0 (opensearch-rest-client and opensearch-rest-high-level-client) and have been experiencing the same issues as others on this thread. So not sure it is a client incompatibility. As with others happens intermittently. We are running a 3 node cluster, and problem may occur only once in a few days or within hours.
|
Indeed, I solved the issue as well, when I set this configuration! |
Has anyone with this problem updated to 2.1.0 and verified if it still happens? Adding ttl_minutes does not solve the problem, it just alleviates it. |
We are still having this problem with 2.1.0. |
I'm still trying to reproduce to bug locally, and hope it can help me locate the problem. public static void main(String[] args) throws IOException, InterruptedException, NoSuchAlgorithmException, KeyStoreException, KeyManagementException {
String timeStamp = new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime());
final CredentialsProvider credentialsProvider = new BasicCredentialsProvider();
credentialsProvider.setCredentials(AuthScope.ANY,
new UsernamePasswordCredentials("admin", "admin"));
SSLContext sslContext = new SSLContextBuilder()
.loadTrustMaterial(null, (certificate, authType) -> true).build();
RestClientBuilder builder = RestClient.builder(new HttpHost("localhost", 9200, "https"))
.setHttpClientConfigCallback(new RestClientBuilder.HttpClientConfigCallback() {
@Override
public HttpAsyncClientBuilder customizeHttpClient(HttpAsyncClientBuilder httpClientBuilder) {
return httpClientBuilder.setDefaultCredentialsProvider(credentialsProvider)
.setSSLContext(sslContext)
.setSSLHostnameVerifier(new NoopHostnameVerifier());
}
});
RestHighLevelClient client = new RestHighLevelClient(builder);
for(int i=1110; i<19000; i++) {
BulkRequest request = new BulkRequest();
request.add(new IndexRequest(String.valueOf(i)).id("1")
.source(XContentType.JSON,"date", timeStamp));
request.add(new DeleteRequest(String.valueOf(i), "1"));
request.add(new IndexRequest(String.valueOf(i)).id("2")
.source(XContentType.JSON,"date", timeStamp));
request.add(new UpdateRequest(String.valueOf(i), "2")
.doc(XContentType.JSON,"field2", "value2"));
BulkResponse bulkResponse = client.bulk(request, RequestOptions.DEFAULT);
System.out.println(i + " " + timeStamp);
System.out.println(bulkResponse.status());
TimeUnit.SECONDS.sleep(1);
}
client.close();
} In pom.xml <dependencies>
<dependency>
<groupId>org.opensearch.client</groupId>
<artifactId>opensearch-rest-high-level-client</artifactId>
<version>2.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-core</artifactId>
<version>2.18.0</version>
</dependency>
</dependencies> |
I can perhaps explain our very easy setup which raised the bug yesterday, but not anymore (for now ~11 hours since I purged the security plugin cache):
This raised the mentioned OptionalDataException exactly 6 hours after recreating the Docker container for a duration of exactly 1 hour. About 2 hours later, I purged the security plugin cache, and the problem didn't happen again for now ~11 hours. |
@tlfeng When the issue occurs for us, it is not related to a Bulk API request. We have applied the mitigations
But we still get the issue, although much less frequently, but typically once a day. I have added a much larger portion of the stack trace in the event it helps. (**** = removed text)
|
Faced same issue after upgrade from 1.3 so, I've found how to reproduce this bug locally I used vm (4cpu, hdd)
4 sh script and json payload can be found in my repo https://github.com/denisvll/os_test
|
The bug described in this issue is also happening to me. My OpenSearch infra context is:
My workaround solution was to create a script that checks the log ingestion status. If the ingestion log has no activity in the last 1 minute, it is executed The following query is to check log ingestion is stopped: GET my-index*/
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"range": {
"ingest_time": {
"from": "now-1m",
"to": "now"
}
}
}
]
}
}
}
|
This is also happening with me. |
According to log traces, it all ends in Base64Helper in security plugin
and all workarounds and steps to reproduce are security plugin related. I tried to reproduce same issue without security plugin and it didn't happen. |
@cwperks Would you mind taking the first look at this issue? |
Any update on this one? |
@amalgamm thanks for asking, this should be closed and the fix will be released in the upcoming OpenSearch |
Describe the bug
I am collecting data with OpenSearch with Logstash.
OpenSearch throws java.io.OptionalDataException error during service operation.
This did not happen in the OpenSearch 1.3 operating environment.
The occurrence of errors is irregular, and errors sometimes disappear after a certain period of time has passed.
Data collection is collected in different indexes depending on the data type, and some index collections work when an error occurs.
The data that is failing is not just a specific index, but a variety of things.
To Reproduce
Expected behavior
Plugins
Screenshots
Host/Environment (please complete the following information):
Additional context
[OpenSearch Log]
The text was updated successfully, but these errors were encountered: