Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

not_exist_ok for dict_set only allows to add a non existing key to a dictionary, or non existing entry to a list. Not a whole nested structure implied by the query #1402

Closed
wants to merge 11 commits into from

Conversation

dafnapension
Copy link
Collaborator

closes: #1400

Can still set a whole nested structure, but as the value being set, not through processing the query by dict_set.

@dafnapension
Copy link
Collaborator Author

dafnapension commented Nov 29, 2024

Hi @yoavkatz , I think the needed patches are rather ugly... and I hardly managed to cover all (with axes and pry-bars..; translating from Shai K Ophir, min 4:32 here). Users will have to employ similar measures in the future, or perhaps change their existing code right now, if we pull the rag under their legs (the rag that allowed to dict_set a whole construct through the query; although that rag looked odd to you).
I recommend to return back home..

@@ -47,6 +47,7 @@
),
preprocess_steps=[
RenameSplits(mapper={"dev": "train", "validation": "test"}),
SetEmptyDictIfDoesNotExist(field="media"),
Copy link
Member

@yoavkatz yoavkatz Dec 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the 'media' field is used by generic code, so it should be added by the code and not the cards.

See for example:

"media": instance.get("media", {}),

Hence, this field should be added

https://github.com/IBM/unitxt/blob/reduce-error-message-clutter/src/unitxt/standard.py#L297

@@ -277,6 +278,20 @@ def process(
return instance


class SetEmptyDictIfDoesNotExist(InstanceOperator):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior of this is method is more than expected by the name. It not only creates an empty dict if it does not exist, but also transform the field from string to dict if string is a json. I think we should avoid such complex behavior, which is more indicative of a bug somewhere else.

For example, on Friday I found a bug in post processing pipeline:

image

(Still not in main)

Copy link
Collaborator Author

@dafnapension dafnapension Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @yoavkatz , If this helps, I found what triggered my need to add this json.loads to SetEmptyDictIfDoesNotExist . Again, just for patching with גרזנים ולומיק the need to manually generate container elements (lists and dicts, thus far I wrote a לומיק just for dicts, but will need to add lists too), to not surprise the user of dict_set:

This method, shamelessly makes task_data into a string:

    def serialize_instance_fields(self, instance, task_data):
        if settings.task_data_as_text:
            instance["task_data"] = json.dumps(task_data)

        if not isinstance(instance["source"], str):
            instance["source"] = json.dumps(instance["source"])
        return instance

@dafnapension dafnapension force-pushed the simpler_dict_utils branch 3 times, most recently from 5019948 to 3c1ab49 Compare December 2, 2024 18:53
@dafnapension
Copy link
Collaborator Author

dafnapension commented Dec 2, 2024

Hi @yoavkatz , I suspect that the new SetEmptyDictIfDoesNotExist that I added, and the about to be added SetEmptyListIfDoesNotExist and the enforcement to use them before each and every existing Set and more over: each and every existing Operator that writes anything to any to_field, will drive our users crazy. Let alone the runtime it will consume.
I suggest to consider any of the following solutions:
(1) extend the possible values of not_exist_ok to : No, at_most_last_component, any_missing_component,
and throw exception accordingly (my recommendation)
(2) only allow at_most_last_component (that you feel is the decent, expected behavior), and then once an operator and an input instance like those that started the current saga will meet:
applying operator=Copy(field="source", to_field="task_data/source") on input instance
{"source": "hello", "task_data" :3} will throw a Warning, saying something like:
operator Copy employed dict_set in order to set value "hello" to an element (subfield) "task_data/source" that does not exist in the input instance, nor its implied containing subfield of type dict "task_data". (which exists, but is not a dictionary). dict_set is not allowed to generate into the instance elements implied by the query, other than the element implied by the last component.
dict_set was invoked instead with the following query and value: dict_set("task_data", {"source" : "hello"}) which sets a value, of type dict, to an existing element, "task_data", replacing its old value. This seems to achieve the requested result, and still comply with the rule that dict_set changes at most the element identified by the last component of the query.

Other than these explanation and apologies, the computed results will remain the same, users will not have to change their existing code, and/or write tedious future invocations of dict_set s.

@yoavkatz
Copy link
Member

yoavkatz commented Dec 3, 2024

Hi @yoavkatz , I suspect that the new SetEmptyDictIfDoesNotExist that I added, and the about to be added SetEmptyListIfDoesNotExist and the enforcement to use them before each and every existing Set and more over: each and every existing Operator that writes anything to any to_field, will drive our users crazy. Let alone the runtime it will consume. I suggest to consider any of the following solutions: (1) extend the possible values of not_exist_ok to : No, at_most_last_component, any_missing_component, and throw exception accordingly (my recommendation) (2) only allow at_most_last_component (that you feel is the decent, expected behavior), and then once an operator and an input instance like those that started the current saga will meet: applying operator=Copy(field="source", to_field="task_data/source") on input instance {"source": "hello", "task_data" :3} will throw a Warning, saying something like: operator Copy employed dict_set in order to set value "hello" to an element (subfield) "task_data/source" that does not exist in the input instance, nor its implied containing subfield of type dict "task_data". (which exists, but is not a dictionary). dict_set is not allowed to generate into the instance elements implied by the query, other than the element implied by the last component. dict_set was invoked instead with the following query and value: dict_set("task_data", {"source" : "hello"}) which sets a value, of type dict, to an existing element, "task_data", replacing its old value. This seems to achieve the requested result, and still comply with the rule that dict_set changes at most the element identified by the last component of the query.

Other than these explanation and apologies, the computed results will remain the same, users will not have to change their existing code, and/or write tedious future invocations of dict_set s.

Hi. I guess what I'm missing is how prevalent is this problem and wherher its actually a problem caused by another issue.. I means media and task data are generic fields that should always be there and added by generic cide. If not, we need to understand why.

In general, we need to strongly avoid complex flags, that increase the learning curve of the library.

@dafnapension
Copy link
Collaborator Author

dafnapension commented Dec 4, 2024

I agree (although I like flags..). Can this issue hint to what you are looking for? I bumped into it while needing to push this LoadJson into the patch.
You forgot to mention recipe_metadata that is assumed to exist, at least by Operator ApplyTempate.

…dictionary, or non existing entry to a list. Not a nested structure as implied by the query

Signed-off-by: dafnapension <[email protected]>
…w fails if dict_set only adds to an existing dictionary

Signed-off-by: dafnapension <[email protected]>
…t ensure task_data exists, because default value is used

Signed-off-by: dafnapension <[email protected]>
…dexing into a list element

Signed-off-by: dafnapension <[email protected]>
…dictionary, or non existing entry to a list. Not a nested structure as implied by the query

Signed-off-by: dafnapension <[email protected]>
Signed-off-by: dafnapension <[email protected]>
Signed-off-by: dafnapension <[email protected]>
@dafnapension dafnapension marked this pull request as draft December 13, 2024 14:20
@dafnapension
Copy link
Collaborator Author

Too much existing code trust dict_set to fill in all missing component, and not just the last component

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Strange Behavior with Copy
2 participants