-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cleanup dataframes #1360
Cleanup dataframes #1360
Conversation
63ccb54
to
7736f19
Compare
7373897
to
0cae116
Compare
@@ -181,6 +181,7 @@ def __eq__(self, other): | |||
self.neg_coef_o1 == other.neg_coef_o1, | |||
self.neg_coef_o2 == other.neg_coef_o2, | |||
self.neg_coef_o3 == other.neg_coef_o3, | |||
self.offset == other.offset, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a followup of #1236 . PR can be separated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the nice things about dataclasses 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto_save
is not working right now. I get errors like {"errors":["experiment_uuid is required","type must be a string","chisq is not a decimal"]}
after this:
expdata=exp.run(backend)
expdata.auto_save=True
Then if I run save()
again on its own, everything seems to save correctly.
return instance | ||
|
||
@property | ||
def dataframe(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add __repr__()
so that the dataframe is shown when users try to print the table directly? I think that's a bit more intuitive than accessing table.dataframe
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One concern for returning the html representation is that a user might believe the object is a bare pandas dataframe and blame missing methods that pandas provides. Of course html representation is very helpful, and I'm not against adding it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true, maybe there can be a header before the dataframe content to distinguish the table from a dataframe.
Thanks @coruscating for catching the bug.
This was not correct. I just forgotten to specify |
901bb28
to
144127a
Compare
8090dae
to
7c0662c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nkanazawa1989 , thank you for your hard work on this cleanup. I've mainly reviewed public interface of ScatterTable
class newly introduced during this series of refactoring. Overall, looks good to me in that point.
Just one major comment from me. To tell the following points clearly to users, I think we need more introductory explanation is necessary at the beginning of ScatterTable
class (and/or in docs/tutorials/curve_analysis.rst
). For example, we may need to explain the important tuple (kind
, category
, analysis
) as a unique key; We may need to mention to what values are allowed for category
and what each of them represents; so on.
I understand ScatterTable
object now to be a single source of the truth as the data used for all plots. I think that's good for users who rework with some complex Analysis classes. Instead, it may contain multiple data sets. That requires users to fully understand the three concepts, kind
, category
and analysis
, because this triplet must be a unique key to retrieve one sequence of data points -- (x, y, y_err) arrays -- from multiple data sets. I think that's why all of the versatile filter
method and get_?
methods have those in their options. I personally like get_?
methods since most of experiments would be simple and ScatterTable
object is likely to have a single key (kind
, category
and analysis
) for all rows. They would prevent from calling filter
functions everywhere.
I'll left minor comments inline, each of which should not prevent merging this PR.
Thanks @itoko , I added more documentation in 346d23a. I added the documentation directly to the class doc of the ScatterTable for visibility and added the cross-ref link to the tutorial.
This is not true because even the model and analysis are trivial most of analysis subclasses produce two categories of raw and formatted, e.g. in RB, raw corresponds to P1 of each randomization seed and formatted is P1 of the sample average of them. I hesitate to add dedicated methods like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @nkanazawa1989, I like the lazy add rows approach and the overall the code looks fine to me. Since it's difficult to test the ScatterTable output from experiments in this PR, perhaps we can merge this first and handle any potential followups from adding unit tests in #1342 in another PR, if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read through all the changes and left comments. There are many but I think they are easy to address and most are just related to documentation.
The thing I am most uncertain about with ScatterTable is how much it is a user facing class versus an internal class. For a user just running experiments, I think it is not exposed, but for a user writing CurveAnalysis subclasses it is? So it only makes sense to document it in the context of curve analysis? There is a little documentation now but not too much (the table is shown in the the curve analysis workflow section and the columns are described in the ScatterTable doc string). One concern I have to make the roles of name
, data_uid
, category
, and model
(not a column but a term used in the same context) clear.
if key in self._data.index: | ||
return [key] | ||
# This key is name of entry | ||
loc = self._data["name"] == key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the key can match an ID or a name. That feels a little overloaded though they shouldn't collide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah unfortunately this is existing behavior :(
yerr_arr_fit = unp.std_devs(uval_arr_fit) | ||
else: | ||
yerr_arr_fit = np.zeros_like(xval_arr_fit) | ||
for xval, yval, yerr in zip(xval_arr_fit, yval_arr_fit, yerr_arr_fit): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It still surprises me that it is better to iterate over numpy arrays point by point and add them to them to lists to add to a new dataframe rather than just adding the numpy arrays to a new dataframe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handling of the empty column is expensive because it requires careful handling of missing values. Without doing this shots column may be accidentally typecasted to float because numpy doesn't support nullable integer. This means we first need to create a 2D object-dtype ndarray and populate values, then convert it into dataframe. Since current _lazy_add_rows
buffer assumes row-wise data list, arrays needs to be converted into this form internally.
def x(self) -> np.ndarray: | ||
"""X values.""" | ||
return self.xval.to_numpy() | ||
# For backward compatibility with CurveData.x | ||
return self._data.xval.to_numpy(dtype=float, na_value=np.nan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this pretty efficient? We might look at if it is worth caching these properties where it would be tempting to use them multiple times in the same function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC using the lru_cache for the class method may cause memory leak and I guess I need to implement custom logic (i.e. cache that clears the data when _lazy_add_rows
is changed).
According to the performance analysis of T1Analysis, the most time-consuming logic after 03aac67 is still iter_groups
, and the most expensive part is the data extraction from the dataframe. This means implementation of the cache doesn't drastically improve the performance.
Probably we may want to replace pandas with polars in the next release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was going to comment on iter_groups
that we should look at using polars there. Even just casting to polars for that one method and back to pandas might help (I mean, not refactoring the rest of the code to use polars).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option would be to stop over-engineering. We currently allow users to run experiment on existing experiment data to get more statistically confident result. But I don't think this is heavy used feature and we can remove. In principle we run very expensive operation for the sake of something useless; if we have 100 xvals, we need to iterate over them and create new table with size=1 to average... Even if we switch to polars there still be overhead of object creation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree that that feature has never seemed that useful to me. I think we could implement some support for post-processing instead, like a combine_results
function or something that could take multiple experiment results, warn if the options are not all the same, and then run the analysis with the combined job data.
We didn't have this ability to extend the data of a completed experient in the code we used before qiskit-experiments and I think it was a feature request of something to be included when starting over from scratch. I like the experiment being less mutable and extending the experiment being like "copy-on-write" where a new experiment ID is generated for the combination of multiple experiments if you need that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think polars could possibly help here though because I expect its groupby to be more efficient than iterating through values in Python. With the pyarrow backend (I think not yet the default in pandas), it should be possible for pandas and polars to work on the same in-memory representation of the data so it should be efficient to go back and forth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like your suggestion. IIRC when we started this project we had QV experiment in our mind. Namely we need to repeat until standard error gets converged within 1 sigma. The rest of experiments don't require such mechanism. Combine results is something useful :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this feature was already removed by #463. This means we can now just drop this formatter and raise a warning if we encounter duplicated xvals -- bur likely for next release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I opened #1394 to try to capture what we said here but maybe its description is out of date.
Also, I read all the code changes but one thing that is hard with a PR like this is to review all the code that didn't change. I tried checking the documentation and release notes and I think they still make sense with the changes that were made. |
Co-authored-by: Will Shanks <[email protected]>
and data points can also take metadata triplet (`data_uid`, `category`, `analysis`) | ||
to distinguish the subset. | ||
|
||
* The `data_uid` is an integer key representing the class of the data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding this info in the doc, @nkanazawa1989 . It helps me a lot to understand the concepts to know as a user of ScatterTable class. I've now understood data_uid
identifies a "series" of data to fit a model. And name
is another human-friendly "identifier" that indicates a series of data to fit a model as well. I've also found that the term series
has already been used in the Curve Analysis tutorial. Given that, series_id
and series_name
sound better to me. What do you think? @nkanazawa1989 @wshanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
series_index
instead of series_id
? Looking at the example in the tutorial, data_uid
can take a common value for different category. In that sense, it's not "ID" (to me, ID sounds like a solo unique identifier), it's more like "index" defined for each (category
, analysis
) data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this suggestion. Perhaps series_idx
to be shorter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One case to consider regarding the naming is StarkRamseyXYAmpScanAnalysis
. For that experiment, X and Y vs amp data are recorded. Then the X and Y data are converted into phase vs amp data which is fit. So there are two raw series that become the formatted series that is fit. Maybe series_
still makes sense since one series is passed to the model. It is the fit models though that have a one to one mapping the data_uid
values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Umm, I'm converging to Will's comment:
One concern I have to make the roles of name, data_uid, category, and model (not a column but a term used in the same context) clear.
Maybe, all we need is model_
with fixed categories (or "stage"s?) (as for StarkRamseyXYAmpScanAnalysis
, we could distinguish "ramsey_xy" and "freq" data sets using model_name
instead of category
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot define fit model for ramsey_xy because its P1 envelope is unpredictable. In this sense, "ramsey_xy" is another the intermediate data to produce the frequency curves which have fit models (Ramsey data is used for visualization and we cannot discard -- when fit doesn't go well experimentalists can get some insights from the raw Ramsey signal). This is why I cannot introduce fixed categories -- this requires the Ramsey analysis to define ScatterTable subclass with its own categories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyways I also like series. This was originally used in the curve analysis to indicate the data belongs to a particular fit model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in ac5bdd8. Thanks for the suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think the broadest way to think about the columns are that series
labels a single curve of some kind (ScatterTable
is under qiskit_experiments.curve_analysis
so it is safe to assume all the data belong to curves). category
is a label for a group of curves/series that all belong to the same stage of processing (raw/formatted/fitted). analysis
is a namespace for separating these groups of curves from different subexperiments.
@coruscating Is it important to get the release out today? I think I would like to clarify these terms a bit in the docs, but that could be done in a follow up if it is important to release today. I don't have any objections to the code (as long as further discussion doesn't make us want to change any names again).
f9d13bd
to
70867e8
Compare
70867e8
to
ac5bdd8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good. I will follow up about the documentation for ScatterTable
.
Summary
This PR updates the implementation of
ScatterTable
andAnalysisResultTable
based on the comment from @itoko .Details and comments
Current pattern heavily uses inheritance;
Table(DataFrame, MixIn)
, but this causes several problems. Qiskit Experiments class directly depends on the third party library, resulting in Sphinx directive mismatch and poor robustness of the API. Instead of using inheritance, these classes are refactored with composition and delegation, namelythis pattern is also common in other software libraries using dataframe. Since this PR removes unreleased public classes, this should be merged before the release.
Although this updates many files, these are just delegation of data handling logic to the class itself, which simplifies the implantation of classes that operate the container objects. Also new pattern allows more strict dtype management with dataframe.