Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST]List<Struct<String, String>> is same as List<Map<String, String>> ? #1554

Closed
chenrui17 opened this issue Jan 20, 2021 · 4 comments
Closed
Labels
question Further information is requested

Comments

@chenrui17
Copy link

What is your question?
I am tyring to implement my udf jsonToArray , and my cpu udf function is

public List<Map<String, String>> evaluate(String jsonArrayStr) {
        List<Map<String, String>> result = new ArrayList<Map<String, String>>();
        try {
            Object json = new JSONTokener(jsonArrayStr).nextValue();
            if (json instanceof JSONObject) {
                Map<String, String> newMap = new HashMap<String, String>();
                JSONObject newItem = new JSONObject(jsonArrayStr);
                Iterator keys = newItem.keys();
                while (keys != null && keys.hasNext()) {
                    String key = String.valueOf(keys.next());
                    String value = ((JSONObject) json).get(key).toString();
                    if (StringUtils.isNotBlank(key))
                        newMap.put(key, value);
                }
                result.add(newMap);
            } else if (json instanceof JSONArray) {
                JSONArray jsonArray = new JSONArray(jsonArrayStr);
                for (int j = 0; j < jsonArray.length(); j++) {
                    if (jsonArray.get(j) instanceof JSONObject) {
                        JSONObject item = jsonArray.getJSONObject(j);
                        if (item != null) {
                            Map<String, String> newItem = new HashMap<String, String>();
                            Iterator keys = item.keys();
                            while (keys != null && keys.hasNext()) {
                                String key = String.valueOf(keys.next());
                                String value = item.get(key).toString();
                                if (StringUtils.isNotBlank(key))
                                    newItem.put(key, value);
                            }
                            result.add(newItem);
                        }
                    }
                }
            }
        } catch (Exception e) {
            System.out.println(e);
        }
        return result;
    }

I am ready to finish this udf in udf-example reference stringWordCount, but i am confused about return type , cpu udf return List<Map<String, String>>, while cudf only can make column type List<Struct<String, String>, i think they are same for me , right ? if i am wrong , please correct me.

In addition , for jsonToArray udf , my thinking step is

  1. read json string object by cudf_io::read_json
  2. make column by call cudf::make_structs_column and cudf::make_List_column in cudf
  3. return std::unique_ptr<cudf::column>
    please give me some advise if it's wrong, thanks a lot ! @jlowe
@chenrui17 chenrui17 added ? - Needs Triage Need team to review and classify question Further information is requested labels Jan 20, 2021
@chenrui17 chenrui17 changed the title [QST]<Struct<String, String>> is same as <Hash<String, String> ? [QST]List<Struct<String, String>> is same as List<Hash<String, String>> ? Jan 20, 2021
@chenrui17 chenrui17 changed the title [QST]List<Struct<String, String>> is same as List<Hash<String, String>> ? [QST]List<Struct<String, String>> is same as List<Map<String, String>> ? Jan 20, 2021
@chenrui17
Copy link
Author

jsonToArray udf is likely rely on rapidsai/cudf#6985 , right?

@jlowe
Copy link
Member

jlowe commented Jan 20, 2021

cpu udf return List<Map<String, String>>, while cudf only can make column type List<Struct<String, String>, i think they are same for me , right ?

Yes, these are the same types. libcudf does not have a map type, so it's implemented as a list of struct of key,value pairs. This shouldn't be a problem in practice for the UDF because Spark knows the expected type being returned from the UDF (this type is derived from the CPU implementation), and therefore the RAPIDS Accelerator knows the expected Spark result type as well. The plugin checks that the returned ColumnVector can be converted to the expected Spark type, and Spark's MapType is implemented in libcudf as a LIST of STRUCT of two STRING columns.

read json string object by cudf_io::read_json

I don't think this will work. If I understand the intent of the jsonToArray UDF properly, for each row it needs to parse a JSON object within a string column. cudf::io::read_json is designed to read from a file or memory buffer, not row-by-row in a string column. It does not take a column as input. You may be able to leverage some of the implementation of that method, but I don't see how you could use it directly.

After constructing the std::unique_ptr<cudf::column> that contains your output, you'll need to release the pointer and convert the cudf::column * into a jlong to be returned to Java. Then that long needs to be passed to the ColumnVector constructor. See the StringWordCount JNI and StringWordCount Java implementations for what needs to be done to get the resulting libcudf column back in to Java.

@jlowe
Copy link
Member

jlowe commented Jan 20, 2021

jsonToArray udf is likely rely on rapidsai/cudf#6985 , right?

Probably. Implementing get_json_object will require implementing a row-by-row JSON object parser which is closer to what jsonToArray needs than what cudf::io::read_json does.

@jlowe jlowe removed the ? - Needs Triage Need team to review and classify label Jan 21, 2021
@jlowe
Copy link
Member

jlowe commented Jan 21, 2021

Closing as answered. Please reopen if there are further questions on this.

@jlowe jlowe closed this as completed Jan 21, 2021
@NVIDIA NVIDIA locked and limited conversation to collaborators Apr 28, 2022
@sameerz sameerz converted this issue into discussion #5368 Apr 28, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants