-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: concat unwantedly sorts DataFrame column names if they differ #4588
Comments
Looking at this briefly I think this stems from Index.intersection, whose docstring states:
Not sure in which cases they appear/are sorted, but the case when the columns are equal (in your first one) is special cased to return the same result... |
@smcierney what order would you expect instead? |
I found the auto sort was a bit annoying too (well, I should say depends on your purpose), because I was trying to concat a frame to an empty frame in a loop (like append an element to a list). Then I realized my column order changed. This change also applies to index, if you are concatenating along axis=1. In a case similar to that of @smcinerney , I expect the final order of CBDAE. E shows up last because the order CBDA shows up first when concatenating. Therefore I wrote a "hack" (kinda silly though) sorted = pd.concat(frameList, axis=axis, join=join, join_axes=join_axes, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False)
The function is included in my personal module called kungfu! Anyone can adopt the above algorithm, or have a look at my module at https://github.com/jerryzhujian9/kungfu Finally, I greatly appreciate the work of the development team for this great module! |
This behavior is indeed quite unexpected and I also stumbled over it.
Naively one would expect that the order of columns is preserved. Instead the columns are sorted:
This can be corrected by reindexing with the original columns as follows:
Still it seems counter-intuitive that this automatic sorting takes place and cannot be disabled as far as I know. |
I've just come across this too.
Produced a DataFrame:
It would seem more natural for the numerical DataFrame to be concatenated without being sorted first!! |
well, this is a bit of work to fix. but pull requests accepted! |
Just stumbled upon this same issue when I was concatenating say
|
I believe |
It's the default behaviour across the board. For example, if you apply a function, f, to a groupby() that returns a varying number of columns, the concatenation taking place behind the scene also auto-sorts the columns.
Likely because the known order of the columns is open to interpretation. However, this also happens for MultiIndices and all hierarchies in MultiIndices. So you can concat dataframes that agree on level0 columns and all bar one level1 columns, and all levels of the MultiIndices will be autosorted because of one mismatch within one level0 column. I don't imagine that is desirable. I'd love to help, but unfortunately fixing this issue is beyond my ability. Thanks for the hard work all. |
+1 for this feature |
Agreed, +1. Unexpected sorting happens all the time for me. |
+1, this was an unpleasant surprise! |
+1, I hate having the columns sorted after every |
+1 from me, as well. Because even if I did want to manually re-order after a concat, when I try to print out the 60 + column names and positions in my dataframe:
All 60+ columns are output in alphabetical order, not their actual position in the data frame. So that means that after ever concat, I have to manually type out a list of 60 columns to reorder. Ouch. While I'm here, does anyone have a way to print out column name and position that I'm missing? |
+1 for this feature, just ran across the same deal myself. @summerela Get the column index and then re-index your new dataframe using the original column index
Conversely, if you need to re-index a smaller subset of columns, you could use this function I made. It can operate with relative indices as well. For example, if you wanted to move a column to the end of a dataframe, but you aren't sure how many columns may remain after prior processing steps in your script (maybe you're dropping zero-variance columns, for instance), you could pass a relative index position to
Edit: Usage would look like the following:
|
I encountered this (with 0.13.1) from an edge case not mentioned: combining dataframes each containing unique columns. A naive re-assignment of column names didn't work: dat = pd.concat([out_dust, in_dust, in_air, out_air])
dat.columns = [out_dust.columns + in_dust.columns + in_air.columns + out_air.columns] The columns still get sorted. Edit: I spoke too soon.. Follow-up: fwiw, column order can be preserved with chained df1.join([df2, df3]) # sorts columns
df1.join(df2).join(df3) # column order retained |
I’m pretty sure that’s exactly what people in this thread have been discussing. I see lots of good solutions above that should work. |
@MikeTam1021 It's not about turning pandas into SQL (heaven forbid!), but I couldn't agree more with:
Concatenating DataFrames should have the same effect as "writing them next to each other", and that implicit sort definitely violates the principle of least astonishment. |
I agree. It shouldn’t. It also assumes an order to the columns, which is SQLish, and not pure computer science. You should really know where you’re data is. I hardly use pandas anymore after discovering this and many other issues. It has made me a better programmer. |
+1 on this |
Preserve column order upon concatenation to obey least astonishment principle. Allow old behavior to be enabled by adding a boolean switch to concat and DataFrame.append, mismatch_sort, which is by default disabled. Close pandas-dev#4588
Preserve column order upon concatenation to obey least astonishment principle. Allow old behavior to be enabled by adding a boolean switch to concat and DataFrame.append, mismatch_sort, which is by default disabled. Close pandas-dev#4588
Preserve column order upon concatenation to obey least astonishment principle. Allow old behavior to be enabled by adding a boolean switch to concat and DataFrame.append, mismatch_sort, which is by default disabled. Close pandas-dev#4588
Preserve column order upon concatenation to obey least astonishment principle. Allow old behavior to be enabled by adding a boolean switch to concat and DataFrame.append, mismatch_sort, which is by default disabled. Close pandas-dev#4588
Preserve column order upon concatenation to obey least astonishment principle. Allow old behavior to be enabled by adding a boolean switch to concat and DataFrame.append, mismatch_sort, which is by default disabled. Close pandas-dev#4588
Preserve column order upon concatenation to obey least astonishment principle. Allow old behavior to be enabled by adding a boolean switch to concat and DataFrame.append, mismatch_sort, which is by default disabled. Close pandas-dev#4588
Preserve column order upon concatenation to obey least astonishment principle. Allow old behavior to be enabled by adding a boolean switch to concat and DataFrame.append, mismatch_sort, which is by default disabled. Close pandas-dev#4588
Preserve column order upon concatenation to obey least astonishment principle. Allow old behavior to be enabled by adding a boolean switch to concat and DataFrame.append, mismatch_sort, which is by default disabled. Close pandas-dev#4588
This works for me:
|
Preserve column order upon concatenation to obey least astonishment principle. Allow old behavior to be enabled by adding a boolean switch to concat and DataFrame.append, mismatch_sort, which is by default disabled. Close pandas-dev#4588
…0613) * Stop concat from attempting to sort mismatched columns by default Preserve column order upon concatenation to obey least astonishment principle. Allow old behavior to be enabled by adding a boolean switch to concat and DataFrame.append, mismatch_sort, which is by default disabled. Closes #4588
I have to bitch about how this patch is rolled out. You have simultaneously changed the function signature of The problem with that is that we use pandas on multiple servers and cannot guarantee that all servers have the exact same version of pandas at all times. So now we have less technical users seeing warnings from programs they have never seen before, and are uncertain if the warning is a sign of a problem. I can readily identify WHERE the warning is coming from, but I can't add either of the suggested options because that would break the program on any server running an older version of pandas. It would have been preferable if you put the sorting capability in to 0.23, and added the warning to some later version. I know its a pain, but it's rather obnoxious to assume that the users can immediately update all deployments to the latest code. |
It sounds like you can just set a global filter for this warning and then
drop that when everyone is upgraded.
Functionally, that's the same right?
…On Thu, Oct 4, 2018 at 9:18 AM DavidEscott ***@***.***> wrote:
I have to bitch about how this patch is rolled out. You have
simultaneously changed the function signature of concat AND introduced a
warning about the usage. All within the same commit.
The problem with that is that we use pandas on multiple servers and cannot
guarantee that all servers have the exact same version of pandas at all
times. So now we have less technical users seeing warnings from programs
they have never seen before, and are uncertain if the warning is a sign of
a problem.
I can readily identify WHERE the warning is coming from, but I can't add
either of the suggested options because that would break the program on any
server running an older version of pandas.
It would have been preferable if you put the sorting capability in to
0.23, and added the warning to some later version. I know its a pain, but
it's rather obnoxious to assume that the users can immediately update all
deployments to the latest code.
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#4588 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHItEhYfv5kqB-R-pDX4zyIh45hF7kks5uhhiWgaJpZM4A6TeA>
.
|
@TomAugspurger There are a multitude of ways that we on our side can deal with this. Certainly filtering warnings is one. Its not great because the mechanics of warnings filters are a bit ugly...
In any case the deficiencies in the Nor is it your fault that we have an older server we can't easily upgrade, so that would be the other thing I can do (just upgrade all the damn deployments). Ultimately, I recognize that I have to do that and that it is my responsibility to try and keep our deployments close together. It just seems a bit bizarre to me that you were so concerned about a possible change in user visible end behavior that you added this sort option to what was previously an underspecified API, and yet have simultaneously thrown a warning at the programmer... both the warning and the proposed change in sort behavior constitute "user visible behavior" in my book, just of different severities. |
I've answered a related question on SO. |
When concat'ing DataFrames, the column names get alphanumerically sorted if there are any differences between them. If they're identical across DataFrames, they don't get sorted.
This sort is undocumented and unwanted. Certainly the default behavior should be no-sort. EDIT: the standard order as in SQL would be: columns from df1 (same order as in df1), columns (uniquely) from df2 (less the common columns) (same order as in df2). Example:
The text was updated successfully, but these errors were encountered: