Scatter table refactoring #1319

nkanazawa1989 · 2023-11-14T20:51:57Z

Summary

This PR modifies ScatterTable which is introduced in #1253.

This change resolves some code issues in #1315 and #1243.

Details and comments

In the original design ScatterTable is tied to the fit models, and the columns contains model_name (str) and model_id (int). Also the fit module only allows to have three categorical data; "processed", "formatted", "fitted". However, #1243 breaks this assumption, namely, the StarkRamseyXYAmpScanAnalysis fitter defines two fit models which are not directly mapped to the results data. The data fed into the models is synthesized by consuming the input results data. The fitter needs to manage four categorical data; "raw", "ramsey" (raw results), "phase" (synthesized data for fit), and "fitted".

This PR relaxes the tight coupling of data to the fit model. In above example, "raw" and "ramsey" category data can fill new fields name (formally model_name) and class_id (model_id) without indicating a particular fit model. Usually, raw category data is just classified according to the data_subfit_map definition, and the map doesn't need to match with the fit models. The connection to fit models is only introduced in a particular category defined by new option value fit_category. This option defaults to "formatted", but StarkRamseyXYAmpScanAnalysis fitter would set "phase" instead. Thus fit model assignment is effectively delayed until the formatter function.

Also the original scatter table is designed to store all circuit metadata which causes some problem in data formatting, especially when it tries to average the data over the same x value in the group. Non-numeric data is averaged by builtin set operation, but this assumes the metadata value is hashable object, which is not generally true. This PR also drops all metadata from the scatter table. Note that important metadata fields for the curve analysis are one used for model classification (classifier fields), and other fields just decorate the table with unnecessary memory footprint requirements. The classifier fields and name (class_id) are sort of duplicated information. This implies the name and class_id fields are enough for end-users to reuse the table data for further analysis once after it's saved as an artifact.

- Remove context of fit model from data name and index; model_name -> name, model_id -> class_id - Remove extra metadata from the scatter table

data_subfif_map becomes a single source of the data name. This means data is classified without notion of the fit model. Index mapping is made in the formatter function through the data-model name identity.

nkanazawa1989 · 2023-11-16T09:33:32Z

Sorry @wshanks if you already started review. I added new commit 72a5e6a. The context of the commit is the following; In the first commit in the PR, _run_data_processing() still classifies the data based on the model names, which introduces hidden coupling to the fit model although "raw" category data should be unaware of it. In the second commit, I made the change so that the function classifies the data based on the data_subfit_map, and then the data index to model index mapping is done in the _format_data() function.

I understand that the point of the discussion is the (implicit) index matching between self._models and data_subfit_map key. In 72a5e6a the index matching is explicitly considered by the data-model name consistency.

wshanks

In my in-line comments, I had one question that might require action and one typo fix. Besides that, I think all the changes here are good. My one hesitation with merging this PR as is though is documentation. For one thing, ScatterTable is not in the built documentation at all. From a comment in the ScatterTable code, I wonder if we could add the pandas .inv url to docs/conf.py? Beyond that, the column names are not documented. In particular, I think the curve analysis documentation (maybe curve-analysis-workflow?) should describe category, name, and class_id. The doc-strings for _run_data_processing and _format_data could also be clearer about what they do. It is hard to understand how the workflow manipulates the data in order to know how to make a customization like #1243 does with the current documentation.

wshanks · 2023-11-16T15:24:29Z

qiskit_experiments/curve_analysis/curve_analysis.py

+            source[idx]["shots"] = datum.get("shots", -1)
+
+            # Assign entry name and class id
+            # Enumerate starts at 1 so that unclassified data becomes class_id = 0.


Should unclassified data be allowed at this point? Previously, an exception was raised for unclassified data. Maybe the comment should give an example of why data could remain unclassified.

It is not a big deal but one side effect of allowing unclassified data as class 0 is that the formatted data classes are offset by 1 from the raw data classes in the default case, which might be confusing.

Added release note about this behavior change and replaced class id with null value in 4aa8505. The scatter table can keep every user data from the experiment (at least in the "raw" category, as this doesn't need to be a part of fitting). So I believe this behavior is more convenient for users, especially when they rerun analysis from scatter table stored in the artifact.

qiskit_experiments/curve_analysis/curve_analysis.py

wshanks · 2023-11-16T15:54:58Z

qiskit_experiments/curve_analysis/curve_analysis.py

            averaged["xval"] = xv
            averaged["yval"] = avg_yval
            averaged["yerr"] = avg_yerr
-            averaged["model_id"] = mid
+            averaged["name"] = g_dict["name"][0]


I looked at this part for a while trying to understand. It seemed like there was a split here with the name preserving the class name but the id changing from the class to the model, until I realized that the model id only gets set if the class name matches the model name. I am not sure much can help with that.

Do you suggest removing name field if nothing matches? This could be a bit confusing as you say, but this requires Stark analysis to overwrite the entire method just to keep the data name for visualization of formatted data. Another solution would be re-adding separate model_name and model_id field, but for almost all analyses this information is duplicated and may confuse the users.

releasenotes/notes/add-dataframe-curve-data-a8905c450748b281.yaml

Co-authored-by: Will Shanks <[email protected]>

…gned data with null value.

nkanazawa1989 · 2023-11-19T00:02:09Z

Thanks Will. I updated tutorial in 61ebcd9. In #1253 @coruscating also tried to find an approach for class doc rendering, and we also consulted with Qiskit core members, but no one cannot find valid solution. Likely subclassing a class in the external package is not good direction in terms of documentation (we should to contribute to the package instead).

coruscating · 2023-11-19T02:14:15Z

@wshanks The docs issue was that by default, all attributes inherited from pandas were rendered and trying to parse the pandas docstrings in our build fails with a lot of warnings. We already render externally inherited attributes like in https://qiskit.org/ecosystem/experiments/dev/stubs/qiskit_experiments.framework.ExperimentEncoder.html, but ideally we shouldn't. "inherited-members": None is already set in the autodoc options, which should work in theory but doesn't.

itoko · 2023-11-27T08:20:19Z

Sorry for my late comment. I think it's worth considering separation of the implementation (Pandas Dataframe) and the interface (ScatterTable) not only for fixing the issue on the doc generation but also for making the future implementation change easier (since we have no good control on the 3rd party software even though I know Pandas is a very stable and reliable library). Also, for example (data: ScatterTable), I as a developer would like to write data.raw_xval instead of data[data.category == "raw"].xval.to_numpy(). For Pandas users, we may have data.to_dataframe() function. I think forwarding is more suitable than inheritance for the implementation of ScatterTable. I'm expecting the overhead for the wrapping would be acceptable (sufficiently small). What do you think? @nkanazawa1989 @wshanks @coruscating

coruscating

I'm good with merging this now and @nkanazawa1989 following up with another PR to alter the ScatterTable interface as suggested by @itoko.

wshanks

Sorry for the delay. I read through the latest changes and I think everything looks good code-wise. I had final suggestions for the wording.

I like the documentation additions.

docs/tutorials/images/curve_analysis_structure.png

docs/tutorials/curve_analysis.rst

releasenotes/notes/add-dataframe-curve-data-a8905c450748b281.yaml

wshanks · 2023-12-13T16:23:18Z

Should we make a new issue for @itoko's point? Or will you just start with a PR @nkanazawa1989? One thing similar to @itoko's point that I wanted to raise is that I turned one all warnings while trying qiskit 1.0 with qiskit-experiments and I noticed that our tests generate a lot of PendingDeprecationWarnings for the old CurveData methods like x, y, and y_err. Maybe adding convenience methods like data.raw_xval would allow us to update the analysis classes to stop generating deprecation warnings internally.

qiskit_experiments/curve_analysis/curve_analysis.py

Co-authored-by: Will Shanks <[email protected]>

nkanazawa1989 · 2023-12-22T01:36:54Z

I plan to start with PR, since we already have some consensus and no further discussion is needed. Regarding the warning, I think we just need to remove these pending warnings. With @itoko's suggestion of composition (forwarding), we need to wrap the data frame with Qiskit Experiments classes instead of inheriting from pandas object. This means .x doesn't automatically refers to the x column of the data frame. So we want to keep the method as a convenient accessor.

### Summary This PR modifies `ScatterTable` which is introduced in qiskit-community#1253. This change resolves some code issues in qiskit-community#1315 and qiskit-community#1243. ### Details and comments In the original design `ScatterTable` is tied to the fit models, and the columns contains `model_name` (str) and `model_id` (int). Also the fit module only allows to have three categorical data; "processed", "formatted", "fitted". However, qiskit-community#1243 breaks this assumption, namely, the `StarkRamseyXYAmpScanAnalysis` fitter defines two fit models which are not directly mapped to the results data. The data fed into the models is synthesized by consuming the input results data. The fitter needs to manage four categorical data; "raw", "ramsey" (raw results), "phase" (synthesized data for fit), and "fitted". This PR relaxes the tight coupling of data to the fit model. In above example, "raw" and "ramsey" category data can fill new fields `name` (formally model_name) and `class_id` (model_id) without indicating a particular fit model. Usually, raw category data is just classified according to the `data_subfit_map` definition, and the map doesn't need to match with the fit models. The connection to fit models is only introduced in a particular category defined by new option value `fit_category`. This option defaults to "formatted", but `StarkRamseyXYAmpScanAnalysis` fitter would set "phase" instead. Thus fit model assignment is effectively delayed until the formatter function. Also the original scatter table is designed to store all circuit metadata which causes some problem in data formatting, especially when it tries to average the data over the same x value in the group. Non-numeric data is averaged by builtin set operation, but this assumes the metadata value is hashable object, which is not generally true. This PR also drops all metadata from the scatter table. Note that important metadata fields for the curve analysis are one used for model classification (classifier fields), and other fields just decorate the table with unnecessary memory footprint requirements. The classifier fields and `name` (`class_id`) are sort of duplicated information. This implies the `name` and `class_id` fields are enough for end-users to reuse the table data for further analysis once after it's saved as an artifact. --------- Co-authored-by: Will Shanks <[email protected]>

@itoko

### Summary This PR updates the implementation of `ScatterTable` and `AnalysisResultTable` based on the [comment](#1319 (comment)) from @itoko . ### Details and comments Current pattern heavily uses inheritance; `Table(DataFrame, MixIn)`, but this causes several problems. Qiskit Experiments class directly depends on the third party library, resulting in Sphinx directive mismatch and poor robustness of the API. Instead of using inheritance, these classes are refactored with composition and delegation, namely ```python class Table: def __init__(self): self._data = DataFrame(...) ``` this pattern is also common in other software libraries using dataframe. Since this PR removes unreleased public classes, this should be merged before the release. Although this updates many files, these are just delegation of data handling logic to the class itself, which simplifies the implantation of classes that operate the container objects. Also new pattern allows more strict dtype management with dataframe. --------- Co-authored-by: Will Shanks <[email protected]>

nkanazawa1989 requested a review from wshanks November 14, 2023 20:52

Refactoring

8db9df9

- Remove context of fit model from data name and index; model_name -> name, model_id -> class_id - Remove extra metadata from the scatter table

nkanazawa1989 force-pushed the update-scatter-table branch from 7ecee38 to 8db9df9 Compare November 15, 2023 02:05

wshanks added this to the Release 0.6 milestone Nov 15, 2023

Update name assignment in data processing

72a5e6a

data_subfif_map becomes a single source of the data name. This means data is classified without notion of the fit model. Index mapping is made in the formatter function through the data-model name identity.

update test

3804c7e

nkanazawa1989 added a commit to nkanazawa1989/qiskit-experiments that referenced this pull request Nov 16, 2023

Rewrite formatter based on qiskit-community#1319

bdd83d1

nkanazawa1989 mentioned this pull request Nov 16, 2023

Replace the fitter of StarkRamseyXYAmpScan experiment #1243

Merged

wshanks reviewed Nov 16, 2023

View reviewed changes

nkanazawa1989 and others added 5 commits November 17, 2023 10:23

Typo fix

da331c5

Co-authored-by: Will Shanks <[email protected]>

Add upgrade doc about behavior change and replace class ID for unassi…

674ebd4

…gned data with null value.

Minor fix for variable name

c1b9a13

Update tutorial

61ebcd9

fix handling of nan value in averaging

1f10cc2

nkanazawa1989 force-pushed the update-scatter-table branch from 9dca9fc to 1f10cc2 Compare November 18, 2023 23:50

itoko mentioned this pull request Nov 21, 2023

Add layer fidelity experiment #1322

Merged

1 task

nkanazawa1989 mentioned this pull request Nov 22, 2023

Epic - Implementation of RFC 0007: Dataframe for Qiskit Experiments Qiskit/RFCs#62

Closed

5 tasks

coruscating approved these changes Dec 12, 2023

View reviewed changes

wshanks approved these changes Dec 13, 2023

View reviewed changes

wshanks reviewed Dec 19, 2023

View reviewed changes

qiskit_experiments/curve_analysis/curve_analysis.py Outdated Show resolved Hide resolved

nkanazawa1989 and others added 2 commits December 22, 2023 10:18

Fix typo in image file

ee6e3df

review suggestions from Will

b86186a

Co-authored-by: Will Shanks <[email protected]>

nkanazawa1989 enabled auto-merge December 22, 2023 01:32

nkanazawa1989 added this pull request to the merge queue Dec 22, 2023

Merged via the queue into qiskit-community:main with commit 5bb1fb4 Dec 22, 2023
11 checks passed

nkanazawa1989 deleted the update-scatter-table branch December 22, 2023 04:04

nkanazawa1989 mentioned this pull request Jan 18, 2024

Cleanup dataframes #1360

Merged

wshanks mentioned this pull request Feb 6, 2024

Remove pandas version bound prior to release #1367

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scatter table refactoring #1319

Scatter table refactoring #1319

nkanazawa1989 commented Nov 14, 2023

nkanazawa1989 commented Nov 16, 2023 •

edited

Loading

wshanks left a comment

wshanks Nov 16, 2023

nkanazawa1989 Nov 17, 2023

wshanks Nov 16, 2023

nkanazawa1989 Nov 18, 2023

nkanazawa1989 commented Nov 19, 2023

coruscating commented Nov 19, 2023

itoko commented Nov 27, 2023 •

edited

Loading

coruscating left a comment

wshanks left a comment

wshanks commented Dec 13, 2023

nkanazawa1989 commented Dec 22, 2023

Scatter table refactoring #1319

Scatter table refactoring #1319

Conversation

nkanazawa1989 commented Nov 14, 2023

Summary

Details and comments

nkanazawa1989 commented Nov 16, 2023 • edited Loading

wshanks left a comment

Choose a reason for hiding this comment

wshanks Nov 16, 2023

Choose a reason for hiding this comment

nkanazawa1989 Nov 17, 2023

Choose a reason for hiding this comment

wshanks Nov 16, 2023

Choose a reason for hiding this comment

nkanazawa1989 Nov 18, 2023

Choose a reason for hiding this comment

nkanazawa1989 commented Nov 19, 2023

coruscating commented Nov 19, 2023

itoko commented Nov 27, 2023 • edited Loading

coruscating left a comment

Choose a reason for hiding this comment

wshanks left a comment

Choose a reason for hiding this comment

wshanks commented Dec 13, 2023

nkanazawa1989 commented Dec 22, 2023

nkanazawa1989 commented Nov 16, 2023 •

edited

Loading

itoko commented Nov 27, 2023 •

edited

Loading