#269: fixed token class span output #270

MarleneKress79789 · 2024-10-31T13:50:37Z

All Submissions:

Is the title of the Pull Request correct?
Is the title of the corresponding issue correct?
Have you updated the changelog?
Have you checked to ensure there aren't other open Pull Requests for the same update/change?
Are you mentioning the issue which this PullRequest fixes ("Fixes...")
Before you merge don't forget to run tests in AWS CodeBuild, by adding [CodeBuild] to the commit message

fixes #269
fixes #272
fixes #273

…intainable

tests/unit_tests/udfs/test_token_classification.py

tkilias · 2024-11-01T13:10:57Z

tests/unit_tests/udfs/test_token_classification.py

-    result_output = Output(result[0].rows)
+    mock_base_model_factory: Union[ModelFactoryProtocol, MagicMock] = create_autospec(ModelFactoryProtocol,
+                                                                                      _name="mock_base_model_factory")
+    number_of_intendet_used_models = params.expected_model_counter# todo is this always same?


This depends on the test case, so having the number in the parameters is correct

Only the name expected_model_counters is a bit odd. Maybe expected_model_calls or something like thst

…f agg strategy is "none"

… mocks

doc/changes/changes_2.2.0.md

...df_wrapper_params/sequence_classification/error_on_prediction_single_model_multiple_batch.py

tkilias · 2024-11-16T21:44:57Z

...ts/udf_wrapper_params/token_classification/error_not_cached_multiple_model_multiple_batch.py

 class ErrorNotCachedMultipleModelMultipleBatch:
    """
    not cached error, multiple model, multiple batch
    """
-    expected_model_counter = 0
+    expected_model_counter = 1


Same as https://github.com/exasol/transformers-extension/pull/270/files#r1845243573 I guess , but shouldn't it be called twice

Improve docstring

improve docstrings

new issue for docstring improvements #276

tkilias · 2024-11-16T21:45:22Z

...ests/udf_wrapper_params/token_classification/error_not_cached_single_model_multiple_batch.py

 class ErrorNotCachedSingleModelMultipleBatch:
    """
    not cached error, single model, multiple batch
    """
-    expected_model_counter = 1
+    expected_model_counter = 0


Same as https://github.com/exasol/transformers-extension/pull/270/files#r1845243573 I guess

Improve doc string

Improve docstring

tkilias · 2024-11-16T21:49:36Z

tests/unit_tests/udf_wrapper_params/token_classification/make_data_row_functions.py

+             entity_covered_text, entity_type, score, entity_docid, entity_char_begin, entity_char_end,
+             error_msg)]
+
+#todo if use in all tests


If it is a permanent todo, please create a ticket

tkilias · 2024-11-16T21:51:29Z

tests/unit_tests/udf_wrapper_params/token_classification/make_data_row_functions.py

+score=0.1
+error_msg = None
+
+#todo comment explain entity/token naming mess


What is with this todo

tkilias · 2024-11-18T11:05:01Z

...udf_wrapper_params/token_classification/error_on_prediction_multiple_model_multiple_batch.py

-
-    def run(ctx: UDFContext):
-        udf.run(ctx)
-
 class ErrorOnPredictionMultipleModelMultipleBatch:
    """
    not cached error, multiple model, multiple batch


improve docstring

...ts/udf_wrapper_params/token_classification/error_not_cached_multiple_model_multiple_batch.py

...ests/udf_wrapper_params/token_classification/error_not_cached_single_model_multiple_batch.py

...udf_wrapper_params/token_classification/error_on_prediction_multiple_model_multiple_batch.py

tkilias · 2024-11-18T22:37:24Z

...s/udf_wrapper_params/token_classification/error_on_prediction_single_model_multiple_batch.py

-
-    def run(ctx: UDFContext):
-        udf.run(ctx)
-
 class ErrorOnPredictionSingleModelMultipleBatch:
    """
    error on prediction, single model, multiple batch,


Improve docstring

tests/unit_tests/udf_wrapper_params/token_classification/make_data_row_functions.py

tkilias · 2024-11-18T22:43:13Z

tests/unit_tests/udf_wrapper_params/token_classification/make_data_row_functions.py

+entity_type="ENTITY_TYPE"
+score=0.1
+error_msg = None
+


Add type hints to functions

also added to new ticket

tkilias · 2024-11-18T22:44:28Z

tests/unit_tests/udf_wrapper_params/token_classification/make_data_row_functions.py

+    while the type/class of the found token is called "entity_group".
+    unless aggregation_strategy == "none", then the type/class of the found
+    token is called "entity" in the model output.
+    returns a list of number_entities times the model output row.


Use proper docstring formatting for return values, such that we use a unified style, see other projects. See https://google.github.io/styleguide/pyguide.html#383-functions-and-methods

also added to new ticket

tests/unit_tests/udf_wrapper_params/token_classification/make_data_row_functions.py

tkilias · 2024-11-18T22:53:42Z

tests/unit_tests/udf_wrapper_params/token_classification/make_data_row_functions.py

+
+def make_number_of_strings(input_str: str, desired_number: int):
+    """
+    returns desired number of "input_strX", where X is counting up to desired_number.


Docstring style

tkilias · 2024-11-19T13:16:59Z

tests/unit_tests/udfs/test_token_classification.py

@@ -110,6 +119,66 @@ def create_mock_metadata(udf_wrapper):
    )
    return meta

+# todo these functions should be reusable for the other unit tests. should we move them to a utils file or something?


If you need them, yes it is probably a good idea to move them. However, I suggest you create a ticket and do it in another PR

added to #274

tkilias · 2024-11-19T13:33:16Z

...it_tests/udf_wrapper_params/token_classification/multiple_model_multiple_batch_incomplete.py

-    token_docid = 1
-    start = 0
-    end = 20
+    bfs_conn1, bfs_conn2 = make_number_of_strings(bucketfs_conn, 2) # todo why two in this test case? multiple model could still be same bfs con right?


probably no specific reason might make sense to use the same bfsconn

tkilias · 2024-11-21T11:19:46Z

...unit_tests/udf_wrapper_params/token_classification/multiple_model_multiple_batch_complete.py

@@ -13,34 +13,33 @@ class MultipleModelMultipleBatchComplete:
    data_size = 2
    n_entities = 3

-    bfs_conn1, bfs_conn2 = make_number_of_strings(bucketfs_conn, 2) # todo why two in this test case? multiple model could still be same bfs con right?
    sub_dir1, sub_dir2 = make_number_of_strings(sub_dir, 2)


Maybe for another PR, also multiple subdir might be not necessary

MarleneKress79789 added 2 commits October 31, 2024 14:48

token_classification_udf with spans now returns input span

28b95c2

started changing unit tests to use stadaloneudfmack, and be easier ma…

c9fa1c1

…intainable

tkilias reviewed Nov 1, 2024

View reviewed changes

tests/unit_tests/udfs/test_token_classification.py Outdated Show resolved Hide resolved

tkilias reviewed Nov 1, 2024

View reviewed changes

MarleneKress79789 added 14 commits November 1, 2024 15:58

switched more unit tests

6a11269

switched more unit tests

2aa4434

switched more unit tests

3029ba3

fixed error in unit tests asserts

4323a76

fix failing unit tests (mock pipeline, mock tokenizer)

5daea9d

fix failing unit tests wrong model_name (many files, sorry)

c31b57a

fix failing unit tests wrong assumptions

34ef3db

more unit test conversion, fixed output creat function not matching i…

5ff1b94

…f agg strategy is "none"

converted error unit tests

f8e72b0

converted error unit tests

a459d0a

fixed expectations of error unit tests, way of throwing exceptions in…

fd06cd8

… mocks

cleanup and comments

2593dfa

version update

a18e178

changes file

e4c85a6

tkilias reviewed Nov 16, 2024

View reviewed changes

doc/changes/changes_2.2.0.md Outdated Show resolved Hide resolved

tkilias reviewed Nov 16, 2024

View reviewed changes

...df_wrapper_params/sequence_classification/error_on_prediction_single_model_multiple_batch.py Show resolved Hide resolved

tkilias reviewed Nov 16, 2024

View reviewed changes