You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When grabbing a binary value out of opensearch via opensearch-hadoop and writing the data back to opensearch, it gets base64 encoded again.
How can one reproduce the bug?
Simply write some binary data to opensearch, read it into spark using opensearch-hadoop and write it back out again. The binary data will now be base64 encoded twice.
importbase64importnumpyasnpes_options= {
"pushdown": "true",
"opensearch.nodes": CLUSTER_ENDPOINT,
"opensearch.port": "443",
"opensearch.nodes.resolve.hostname": "false",
"opensearch.nodes.wan.only": "true",
"opensearch.net.ssl" : "true",
"opensearch.aws.sigv4.enabled": "true",
"opensearch.aws.sigv4.region": REGION,
"opensearch.batch.size.entries": "0",
"opensearch.batch.size.bytes": "2m",
"opensearch.batch.write.retry.count": "5",
"opensearch.http.timeout": "2m",
"opensearch.http.retries": "5",
"opensearch.read.field.as.array.include": "approved_for,speakables,sent_bound",
}
# read data from ESsource=spark.read.format("opensearch").options(**es_options).load(SOURCE_INDEX)
smalldf=source.limit(1)
# base64 decode binary data only once to read itembedding=np.frombuffer(base64.b64decode(smalldf.head(1)[0].embedding), dtype=np.float16)
print(embedding[0:5])
print(len(embedding))
# write row back to ES as is
(
smalldf
.write.format("opensearch")
.options(**es_options)
.option("es.write.operation", "index")
.save(DESTINATION_INDEX)
)
# read new index back into sparkdest=spark.read.format("opensearch").options(**es_options).load(DESTINATION_INDEX)
# now we need to run `base64.b64decode` twice to get the original embedding out!embedding=np.frombuffer(base64.b64decode(base64.b64decode(dest.head(1)[0].embedding)), dtype=np.float16)
print(embedding[0:5])
print(len(embedding))
What is the expected behavior?
I expect to be able to write data to opensearch without it base64 encoding my already base64 encoded binary data.
What is your host/environment?
Glue 4.0 Notebook
opensearch-spark-30_2.12-1.0.1.jar
Do you have any screenshots?
The text was updated successfully, but these errors were encountered:
What is the bug?
When grabbing a
binary
value out of opensearch via opensearch-hadoop and writing the data back to opensearch, it gets base64 encoded again.How can one reproduce the bug?
Simply write some binary data to opensearch, read it into spark using
opensearch-hadoop
and write it back out again. The binary data will now be base64 encoded twice.What is the expected behavior?
I expect to be able to write data to opensearch without it base64 encoding my already base64 encoded binary data.
What is your host/environment?
Do you have any screenshots?
The text was updated successfully, but these errors were encountered: