-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New DataFrame feature: listify() and unlistify() #10511
Comments
How do you create these data frames with nested lists? My sense is that there is almost always a better way. I'm reluctant to expand the dataframe API for new methods unless they are broadly useful -- this is a large part of why we added the |
My use case is the use of a dataframe through various part of an experiment, containing planning, execution, data collection and analysis. At the planning stage when building the initial data frame I still don't know what one of the columns -- containing "perpendicular" (data independent from all the other columns) data will contain. Once I know it, I want to insert the new data. I.e. I want to go from:
to: With the unlistify() function this is trivial: df['B'] = [range(5),range(3)]
df = df.unlistify('B') But perhaps there is a different simple way that I have missed? I don't see how the |
Maybe you could show an example of how you make the dataframe? Presumably you're not reading it in from CSV. I guess my thought is that you might be able to easily create the "unlisted" dataframe in the first place. Pipe is indeed a side point here. Mostly I mentioned it to point out that we are trying to put other libraries on equal footings to what we put in pandas proper. |
For example, suppose the number of items in the lists depends on the other columns, e.g., suppose |
Thanks for the In any case, I think it is a legitimate use case to start off with a a few dimensions and then unlisting new dimensions to add additional complexity. I don't think it is justified to force the user to declare all dimenions in advance. Initially I thought of adding a function to |
@dov you are just using a much less-efficient form of multi-index. You lose all performance and indexing with the list/unlistify. Mainly the mixed-list structures force you to be in Compare to a regular multi-index
Not sure why one would prefer a non-native structure that would have not any advantages (and several key disadvantages) over the multi-index structure. Maybe you could shed some light on why you are not using a multi-index structure. |
I think the difference is that I see the DataFrame as something evolving in which you don't see the whole picture at the time of its construction. I will try to give an example. Let's say that you have an xy-table with a camera pictures of a plate filled with micro organisms. The field of view is much smaller than the plate. The goal is to image processing and classification of the micro organisms in the images. Here are the steps that needs to be carried out:
Of course the dataframe in 3 could be created with references back to the image dataframe of 1. But it may make more sense to expand the dataframe in 1 to make room for the detected image property. In this sense the dataframe is like a logbook for the experiment. It grows and possibly contracts as the experiment progresses. This is my goal. The idea of the unlistify() function was just a means of doing this. shoyer showed me the same functionality can be achieved through |
I think the idiomatic way to do such an operation in pandas would be to use Database-style DataFrame joining/merging. |
I'll save it here for better search for this issue. This SO answer explain how to do this thing really easily. |
as said above, this is not good data management, nor likely to be supported by pandas. this leads to really inefficient representations of data. |
@jreback Yeah, I agree with you. But, when I'm getting data from MongoDB with really nested structure, and I want to make some columns for data processing, this is only way to preprocess data and save shape references between origianl data and preprocessed dataset |
@libbkmz then you need a proper layer in between. you can certainly pre-process using pandas, but de-listifying (ala |
Hello @jreback , just wanted to ask if you want to close this issue? Do you want to tag it as |
I would think that the number of people that run into this problem of having lists or iterable structures within cells of a dataframe would give credence to this issue. I fully understand that it is less efficient and there are likely better dataframe architectures. However, I (and looking at the number of stackoverflow questions on this - many, many others) have run into this an astounding number of times. Therefore, I would think it highly useful to have at least an |
not sure what to make of that statement. pandas has almost 2200 issues, and no full time folks working on. People prioritize what they will. Do you want to submit a pull-request to fix this? if so great. |
Pandas might have 2200 issues (like any other software or app nowadays), but I believe this one is very important when dealing with unstructured data. I work with unstructured data, and listify() and unlistify() can be very handy here and save lots of valuable time. |
@dmarinav and you are welcome to submit a fix. what folks choose to work on is pretty much up to them. |
FWIW, I can see someone building a JSONArray on top of #19268 (probably not within pandas). I think that + a custom |
I end up having to do this kind of thing all the time.. and it's a complete PITA. Having a feature that allows us to "explode" a column containing lists into multiple rows would be wonderful. |
+1 I stumble upon a need for this many times (and made package with just this function). I think that |
While building up a DataFrame in several steps, I found it difficult to add a new "perpendicular" column, i.e. a column that adds another dimension to already existing columns. To solve this problem I got the idea that this may be done in two steps:
I.e. I propose two new DataFrame methods, listify() and unlistify().
listify(df, column)
: Takes as input a dataframe and the name of a column. It will do a groupby of the df for all columns except column and generate a single row where the values in the column cell is a list of the columncolumn
values.unlistify(df, column)
: Takes as input a dataframe and the name of a column. It will iterate over the values of the contents ofcolumn
for each row and generate a new row for each value.The functions may be expanded to support multiple columns.
listify()
may e.g. support a post processing function, that will be applied on the list.The following python code illustrates these two functions. But obviously the functionality may be implemented more efficienctly on the C-level.
The corresponding output:
The text was updated successfully, but these errors were encountered: