binary fields are getting double base64 encoded when round tripping through opensearch-hadoop #404

AlJohri · 2024-02-07T01:18:03Z

What is the bug?

When grabbing a binary value out of opensearch via opensearch-hadoop and writing the data back to opensearch, it gets base64 encoded again.

How can one reproduce the bug?

Simply write some binary data to opensearch, read it into spark using opensearch-hadoop and write it back out again. The binary data will now be base64 encoded twice.

import base64
import numpy as np

es_options = {
    "pushdown": "true",
    "opensearch.nodes": CLUSTER_ENDPOINT,    
    "opensearch.port": "443",
    "opensearch.nodes.resolve.hostname": "false",
    "opensearch.nodes.wan.only": "true",
    "opensearch.net.ssl" : "true",
    "opensearch.aws.sigv4.enabled": "true",
    "opensearch.aws.sigv4.region": REGION,
    "opensearch.batch.size.entries": "0",
    "opensearch.batch.size.bytes": "2m",
    "opensearch.batch.write.retry.count": "5",
    "opensearch.http.timeout": "2m",
    "opensearch.http.retries": "5",
    "opensearch.read.field.as.array.include": "approved_for,speakables,sent_bound", 
}

# read data from ES
source = spark.read.format("opensearch").options(**es_options).load(SOURCE_INDEX)
smalldf = source.limit(1)

# base64 decode binary data only once to read it
embedding = np.frombuffer(base64.b64decode(smalldf.head(1)[0].embedding), dtype=np.float16)
print(embedding[0:5])
print(len(embedding))

# write row back to ES as is
(
    smalldf
    .write.format("opensearch")
    .options(**es_options)
    .option("es.write.operation", "index")
    .save(DESTINATION_INDEX)
)

# read new index back into spark
dest = spark.read.format("opensearch").options(**es_options).load(DESTINATION_INDEX)

# now we need to run `base64.b64decode` twice to get the original embedding out!
embedding = np.frombuffer(base64.b64decode(base64.b64decode(dest.head(1)[0].embedding)), dtype=np.float16)
print(embedding[0:5])
print(len(embedding))

What is the expected behavior?

I expect to be able to write data to opensearch without it base64 encoding my already base64 encoded binary data.

What is your host/environment?

Glue 4.0 Notebook
opensearch-spark-30_2.12-1.0.1.jar

Do you have any screenshots?

The text was updated successfully, but these errors were encountered:

dblock · 2024-06-17T16:33:51Z

Catch All Triage - 1 2 3 4 5

AlJohri added bug Something isn't working untriaged labels Feb 7, 2024

dblock removed the untriaged label Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

binary fields are getting double base64 encoded when round tripping through opensearch-hadoop #404

binary fields are getting double base64 encoded when round tripping through opensearch-hadoop #404

AlJohri commented Feb 7, 2024 •

edited

Loading

dblock commented Jun 17, 2024

binary fields are getting double base64 encoded when round tripping through opensearch-hadoop #404

binary fields are getting double base64 encoded when round tripping through opensearch-hadoop #404

Comments

AlJohri commented Feb 7, 2024 • edited Loading

What is the bug?

How can one reproduce the bug?

What is the expected behavior?

What is your host/environment?

Do you have any screenshots?

dblock commented Jun 17, 2024

AlJohri commented Feb 7, 2024 •

edited

Loading