Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

binary fields are getting double base64 encoded when round tripping through opensearch-hadoop #404

Open
AlJohri opened this issue Feb 7, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@AlJohri
Copy link

AlJohri commented Feb 7, 2024

What is the bug?

When grabbing a binary value out of opensearch via opensearch-hadoop and writing the data back to opensearch, it gets base64 encoded again.

How can one reproduce the bug?

Simply write some binary data to opensearch, read it into spark using opensearch-hadoop and write it back out again. The binary data will now be base64 encoded twice.

import base64
import numpy as np

es_options = {
    "pushdown": "true",
    "opensearch.nodes": CLUSTER_ENDPOINT,    
    "opensearch.port": "443",
    "opensearch.nodes.resolve.hostname": "false",
    "opensearch.nodes.wan.only": "true",
    "opensearch.net.ssl" : "true",
    "opensearch.aws.sigv4.enabled": "true",
    "opensearch.aws.sigv4.region": REGION,
    "opensearch.batch.size.entries": "0",
    "opensearch.batch.size.bytes": "2m",
    "opensearch.batch.write.retry.count": "5",
    "opensearch.http.timeout": "2m",
    "opensearch.http.retries": "5",
    "opensearch.read.field.as.array.include": "approved_for,speakables,sent_bound", 
}

# read data from ES
source = spark.read.format("opensearch").options(**es_options).load(SOURCE_INDEX)
smalldf = source.limit(1)

# base64 decode binary data only once to read it
embedding = np.frombuffer(base64.b64decode(smalldf.head(1)[0].embedding), dtype=np.float16)
print(embedding[0:5])
print(len(embedding))

# write row back to ES as is
(
    smalldf
    .write.format("opensearch")
    .options(**es_options)
    .option("es.write.operation", "index")
    .save(DESTINATION_INDEX)
)

# read new index back into spark
dest = spark.read.format("opensearch").options(**es_options).load(DESTINATION_INDEX)

# now we need to run `base64.b64decode` twice to get the original embedding out!
embedding = np.frombuffer(base64.b64decode(base64.b64decode(dest.head(1)[0].embedding)), dtype=np.float16)
print(embedding[0:5])
print(len(embedding))

What is the expected behavior?

I expect to be able to write data to opensearch without it base64 encoding my already base64 encoded binary data.

What is your host/environment?

  • Glue 4.0 Notebook
  • opensearch-spark-30_2.12-1.0.1.jar

Do you have any screenshots?

image

@AlJohri AlJohri added bug Something isn't working untriaged labels Feb 7, 2024
@dblock
Copy link
Member

dblock commented Jun 17, 2024

Catch All Triage - 1 2 3 4 5

@dblock dblock removed the untriaged label Jun 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants