BUG: concat unwantedly sorts DataFrame column names if they differ #4588

smcinerney · 2013-08-16T23:30:10Z

When concat'ing DataFrames, the column names get alphanumerically sorted if there are any differences between them. If they're identical across DataFrames, they don't get sorted.
This sort is undocumented and unwanted. Certainly the default behavior should be no-sort. EDIT: the standard order as in SQL would be: columns from df1 (same order as in df1), columns (uniquely) from df2 (less the common columns) (same order as in df2). Example:

df4a = DataFrame(columns=['C','B','D','A'], data=np.random.randn(3,4))
df4b = DataFrame(columns=['C','B','D','A'], data=np.random.randn(3,4))
df5  = DataFrame(columns=['C','B','E','D','A'], data=np.random.randn(3,5))

print "Cols unsorted:", concat([df4a,df4b])
# Cols unsorted:           C         B         D         A

print "Cols sorted", concat([df4a,df5])
# Cols sorted           A         B         C         D         E
``'

hayd · 2013-08-18T00:12:03Z

Looking at this briefly I think this stems from Index.intersection, whose docstring states:

Form the intersection of two Index objects. Sortedness of the result is not guaranteed

Not sure in which cases they appear/are sorted, but the case when the columns are equal (in your first one) is special cased to return the same result...

jtratner · 2013-08-18T12:22:46Z

@smcierney what order would you expect instead?

superkeyor · 2013-11-25T04:51:21Z

I found the auto sort was a bit annoying too (well, I should say depends on your purpose), because I was trying to concat a frame to an empty frame in a loop (like append an element to a list). Then I realized my column order changed. This change also applies to index, if you are concatenating along axis=1.

In a case similar to that of @smcinerney , I expect the final order of CBDAE. E shows up last because the order CBDA shows up first when concatenating.

Therefore I wrote a "hack" (kinda silly though)

sorted = pd.concat(frameList, axis=axis, join=join, join_axes=join_axes, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False)

if join_axes:
    return sorted
elif sort:
    return sorted
else:
    # expand all original orders in each frame
    sourceOrder = []
    for frame in frameList:
        sourceOrder.extend(frame.Columns()) if axis == 0 else sourceOrder.extend(frame.Indices())
    sortedOrder = sorted.Columns() if axis == 0 else sorted.Indices()

    positions = []
    positionsSorted = []
    for i in sortedOrder:
        positions.append(sourceOrder.index(i))
        positionsSorted.append(sourceOrder.index(i))
    positionsSorted.sort()

    unsortedOrder = []
    for i in positionsSorted:
        unsortedOrder.append(sortedOrder[positions.index(i)])

    return sorted.ReorderCols(unsortedOrder) if axis == 0 else sorted.ReorderRows(unsortedOrder)

The function is included in my personal module called kungfu! Anyone can adopt the above algorithm, or have a look at my module at https://github.com/jerryzhujian9/kungfu

Finally, I greatly appreciate the work of the development team for this great module!

asteppke · 2014-05-28T15:27:39Z

This behavior is indeed quite unexpected and I also stumbled over it.

 >>> df = pd.DataFrame()

>>> df['b'] = [1,2,3]
>>> df['c'] = [1,2,3]
>>> df['a'] = [1,2,3]
>>> print(df)
   b  c  a
0  1  1  1
1  2  2  2
2  3  3  3

[3 rows x 3 columns]
>>> df2 = pd.DataFrame({'a':[4,5]})
>>> df3 = pd.concat([df, df2])

Naively one would expect that the order of columns is preserved. Instead the columns are sorted:

>>> print(df3)
   a   b   c
0  1   1   1
1  2   2   2
2  3   3   3
0  4 NaN NaN
1  5 NaN NaN

[5 rows x 3 columns]

This can be corrected by reindexing with the original columns as follows:

>>> df4 = df3.reindex_axis(df.columns, axis=1)
>>> print(df4)
    b   c  a
0   1   1  1
1   2   2  2
2   3   3  3
0 NaN NaN  4
1 NaN NaN  5

[5 rows x 3 columns]

Still it seems counter-intuitive that this automatic sorting takes place and cannot be disabled as far as I know.

zadacka · 2014-11-19T21:46:19Z

I've just come across this too.

new_data = pd.concat([churn_data, numerical_data])

Produced a DataFrame:

     churn  Var1  Var10  Var100  Var101 
0      -1   NaN    NaN     NaN     NaN     
1      -1   NaN    NaN     NaN     NaN

It would seem more natural for the numerical DataFrame to be concatenated without being sorted first!!

jreback · 2014-11-19T22:08:03Z

well, this is a bit of work to fix. but pull requests accepted!

rasbt · 2015-01-13T03:40:41Z

Just stumbled upon this same issue when I was concatenating DataFrames. It's a little bit annoying if you don't know about this issue, but actually there is a quick remedy:

say dfs is a list of DataFrames you want to concatenate, you can just take the the original column order and feed it back in:

df = pd.concat(dfs, axis=0)
df = df[dfs[0].columns]

max-sixty · 2015-02-05T20:06:22Z

I believe append causes the same behavior, FYI

inbredtom · 2015-03-26T09:57:24Z

It's the default behaviour across the board. For example, if you apply a function, f, to a groupby() that returns a varying number of columns, the concatenation taking place behind the scene also auto-sorts the columns.

df.groupby(some_ts).apply(f)

Likely because the known order of the columns is open to interpretation.

However, this also happens for MultiIndices and all hierarchies in MultiIndices. So you can concat dataframes that agree on level0 columns and all bar one level1 columns, and all levels of the MultiIndices will be autosorted because of one mismatch within one level0 column. I don't imagine that is desirable.

I'd love to help, but unfortunately fixing this issue is beyond my ability. Thanks for the hard work all.

vitalyisaev2 · 2015-04-25T14:56:19Z

+1 for this feature

ashishsingal1 · 2015-05-30T16:01:31Z

Agreed, +1. Unexpected sorting happens all the time for me.

scyllagist · 2015-08-01T10:00:24Z

+1, this was an unpleasant surprise!

Zenadix · 2015-08-27T18:40:31Z

+1, I hate having the columns sorted after every append.

summerela · 2015-09-02T23:29:39Z

+1 from me, as well.

Because even if I did want to manually re-order after a concat, when I try to print out the 60 + column names and positions in my dataframe:

 for id, value in enumerate(df.columns):
      print id, value

All 60+ columns are output in alphabetical order, not their actual position in the data frame.

So that means that after ever concat, I have to manually type out a list of 60 columns to reorder. Ouch.

While I'm here, does anyone have a way to print out column name and position that I'm missing?

jmwoloso · 2016-01-20T00:01:02Z

+1 for this feature, just ran across the same deal myself.

@summerela Get the column index and then re-index your new dataframe using the original column index

# assuming you have two dataframes, `df_train` & `df_test` (with the same columns) 
# that you want to concatenate

# get the columns from one of them
all_columns = df_train.columns

# concatenate them
df_concat = pd.concat([df_train,
                       df_test])

# finally, re-index the new dataframe using the original column index
df_concat = df_concat.ix[:, all_columns]

Conversely, if you need to re-index a smaller subset of columns, you could use this function I made. It can operate with relative indices as well. For example, if you wanted to move a column to the end of a dataframe, but you aren't sure how many columns may remain after prior processing steps in your script (maybe you're dropping zero-variance columns, for instance), you could pass a relative index position to new_indices --> new_indices = [-1] and it will take care of the rest.

def reindex_columns(dframe=None, columns=None, new_indices=None):
    """
    Reorders the columns of a dataframe as specified by
    `reorder_indices`. Values of `columns` should align with their
    respective values in `new_indices`.

    `dframe`: pandas dataframe.

    `columns`: list,pandas.core.index.Index, or numpy array; columns to
    reindex.

    `reorder_indices`: list of integers or numpy array; indices
    corresponding to where each column should be inserted during
    re-indexing.
    """
    print("Re-indexing columns.")
    try:
        df = dframe.copy()

        # ensure parameters are of correct type and length
        assert isinstance(columns, (pd.core.index.Index,
                                    list,
                                    np.array)),\
        "`columns` must be of type `pandas.core.index.Index` or `list`"

        assert isinstance(new_indices,
                          list),\
        "`reorder_indices` must be of type `list`"

        assert len(columns) == len(new_indices),\
        "Length of `columns` and `reorder_indices` must be equal"

        # check for negative values in `new_indices`
        if any(idx < 0 for idx in new_indices):

            # get a list of the negative values
            negatives = [value for value
                         in new_indices
                         if value < 0]

            # find the index location for each negative value in
            # `new_indices`
            negative_idx_locations = [new_indices.index(negative)
                                      for negative in negatives]

            # zip the lists
            negative_zipped = list(zip(negative_idx_locations,
                                       negatives))

            # replace the negatives in `new_indices` with their
            # absolute position in the index
            for idx, negative in negative_zipped:
                new_indices[idx] = df.columns.get_loc(df.columns[
                                                          negative])

        # re-order the index now
        # get all columns
        all_columns = df.columns

        # drop the columns that need to be re-indexed
        all_columns = all_columns.drop(columns)

        # now re-insert them at the specified locations
        zipped_columns = list(zip(new_indices,
                                  columns))

        for idx, column in zipped_columns:
            all_columns = all_columns.insert(idx,
                                             column)
        # re-index the dataframe
        df = df.ix[:, all_columns]

        print("Successfully re-indexed dataframe.")

    except Exception as e:
        print(e)
        print("Could not re-index columns. Something went wrong.")

    return df

Edit: Usage would look like the following:

# move 'Column_1' to the end, move 'Column_2' to the beginning
df = reindex_columns(dframe=df,
                     columns=['Column_1', 'Column_2'],
                     new_indices=[-1, 0])

patricktokeeffe · 2016-03-23T17:58:25Z

I encountered this (with 0.13.1) from an edge case not mentioned: combining dataframes each containing unique columns. A naive re-assignment of column names didn't work:

dat = pd.concat([out_dust, in_dust, in_air, out_air])
dat.columns = [out_dust.columns + in_dust.columns + in_air.columns + out_air.columns]

The columns still get sorted. ~~Using lists intermediately resolved things, though:~~

Edit: I spoke too soon..

Follow-up: fwiw, column order can be preserved with chained .join calls on singular objects:

df1.join([df2, df3]) # sorts columns
df1.join(df2).join(df3) # column order retained

MikeTam1021 · 2018-03-31T12:09:03Z

I’m pretty sure that’s exactly what people in this thread have been discussing. I see lots of good solutions above that should work.

h-vetinari · 2018-04-01T14:43:21Z

@MikeTam1021 It's not about turning pandas into SQL (heaven forbid!), but I couldn't agree more with:

There should never be an unwanted automatic sort. If the user wants to sort the column names, let them do that manually.

Concatenating DataFrames should have the same effect as "writing them next to each other", and that implicit sort definitely violates the principle of least astonishment.

MikeTam1021 · 2018-04-01T18:40:28Z

I agree. It shouldn’t. It also assumes an order to the columns, which is SQLish, and not pure computer science. You should really know where you’re data is.

I hardly use pandas anymore after discovering this and many other issues. It has made me a better programmer.

armant · 2018-04-04T00:49:56Z

+1 on this

Preserve column order upon concatenation to obey least astonishment principle. Allow old behavior to be enabled by adding a boolean switch to concat and DataFrame.append, mismatch_sort, which is by default disabled. Close pandas-dev#4588

bcucek · 2018-04-18T21:12:09Z

This works for me:

cols = list(df1)+list(df2)
df1 = pd.concat([df1, df2])
df1 = df1.loc[:, cols]

Preserve column order upon concatenation to obey least astonishment principle. Allow old behavior to be enabled by adding a boolean switch to concat and DataFrame.append, mismatch_sort, which is by default disabled. Close pandas-dev#4588

…0613) * Stop concat from attempting to sort mismatched columns by default Preserve column order upon concatenation to obey least astonishment principle. Allow old behavior to be enabled by adding a boolean switch to concat and DataFrame.append, mismatch_sort, which is by default disabled. Closes #4588

DavidEscott · 2018-10-04T14:17:56Z

I have to bitch about how this patch is rolled out. You have simultaneously changed the function signature of concat AND introduced a warning about the usage. All within the same commit.

The problem with that is that we use pandas on multiple servers and cannot guarantee that all servers have the exact same version of pandas at all times. So now we have less technical users seeing warnings from programs they have never seen before, and are uncertain if the warning is a sign of a problem.

I can readily identify WHERE the warning is coming from, but I can't add either of the suggested options because that would break the program on any server running an older version of pandas.

It would have been preferable if you put the sorting capability in to 0.23, and added the warning to some later version. I know its a pain, but it's rather obnoxious to assume that the users can immediately update all deployments to the latest code.

TomAugspurger · 2018-10-04T14:24:57Z

It sounds like you can just set a global filter for this warning and then drop that when everyone is upgraded. Functionally, that's the same right?

…

On Thu, Oct 4, 2018 at 9:18 AM DavidEscott ***@***.***> wrote: I have to bitch about how this patch is rolled out. You have simultaneously changed the function signature of concat AND introduced a warning about the usage. All within the same commit. The problem with that is that we use pandas on multiple servers and cannot guarantee that all servers have the exact same version of pandas at all times. So now we have less technical users seeing warnings from programs they have never seen before, and are uncertain if the warning is a sign of a problem. I can readily identify WHERE the warning is coming from, but I can't add either of the suggested options because that would break the program on any server running an older version of pandas. It would have been preferable if you put the sorting capability in to 0.23, and added the warning to some later version. I know its a pain, but it's rather obnoxious to assume that the users can immediately update all deployments to the latest code. — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#4588 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHItEhYfv5kqB-R-pDX4zyIh45hF7kks5uhhiWgaJpZM4A6TeA> .

DavidEscott · 2018-10-04T15:06:33Z

@TomAugspurger There are a multitude of ways that we on our side can deal with this. Certainly filtering warnings is one. Its not great because the mechanics of warnings filters are a bit ugly...

I would have to add the filter to multiple programs
Not a great way to specify a specific warning to filter:

I can filter by module and lineno, but that isn't a stable reference,
I can filter by module and FutureWarning but then I wouldn't get any warnings at all from pandas and would be surprised by other changes,
or I can filter by your long multi-line message

And then remember to take that filter out when everything is upgraded and it no longer matters.

In any case the deficiencies in the warnings module are certainly not something I can put at the foot of the pandas team.

Nor is it your fault that we have an older server we can't easily upgrade, so that would be the other thing I can do (just upgrade all the damn deployments). Ultimately, I recognize that I have to do that and that it is my responsibility to try and keep our deployments close together.

It just seems a bit bizarre to me that you were so concerned about a possible change in user visible end behavior that you added this sort option to what was previously an underspecified API, and yet have simultaneously thrown a warning at the programmer... both the warning and the proposed change in sort behavior constitute "user visible behavior" in my book, just of different severities.

SHi-ON · 2019-06-24T13:39:14Z

I've answered a related question on SO.

jreback modified the milestones: 0.15.0, 0.14.0 Feb 18, 2014

jreback mentioned this issue Dec 4, 2014

concat sorts columns without preserving order #9001

Closed

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

jreback added Prio-high labels Jan 29, 2016

jreback modified the milestones: 0.18.1, Next Major Release Mar 12, 2016

jreback modified the milestones: 0.18.1, 0.18.2 Apr 26, 2016

brycepg mentioned this issue Apr 5, 2018

Stop concat from attempting to sort mismatched columns by default #20613

Merged

4 tasks

jreback modified the milestones: Next Major Release, 0.23.0 Apr 5, 2018

cpbl mentioned this issue Apr 8, 2018

interleave_se_columns_as_rows changes order of columns for MultiIndex columns cpbl/cpblUtilities-MOVED-TO-GITLAB#15

Closed

TomAugspurger closed this as completed in #20613 May 1, 2018

hdoupe mentioned this issue Jul 6, 2018

Revise cps_stage4/extrapolation.py script PSLmodels/taxdata#242

Merged

raoulcollenteur mentioned this issue Jul 16, 2018

Order of columns of parameters dataframe changes pastas/pastas#62

Closed

martinholmer mentioned this issue Jul 25, 2018

Different test results on pr-261-MH branch PSLmodels/taxdata#265

Closed

scottcha mentioned this issue Jan 24, 2020

Add defaults during concat 508 pydata/xarray#3545

Closed

4 tasks

mathrick mentioned this issue Sep 3, 2021

BUG: concat still sorts columns if they differ #43375

Closed

3 tasks

Wikilicious mentioned this issue Mar 6, 2024

DOC: Incorrect Description For pd.concat sort Argument #57753

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: concat unwantedly sorts DataFrame column names if they differ #4588

BUG: concat unwantedly sorts DataFrame column names if they differ #4588

smcinerney commented Aug 16, 2013 •

edited

Loading

hayd commented Aug 18, 2013

jtratner commented Aug 18, 2013

superkeyor commented Nov 25, 2013

asteppke commented May 28, 2014

zadacka commented Nov 19, 2014

jreback commented Nov 19, 2014

rasbt commented Jan 13, 2015

max-sixty commented Feb 5, 2015

inbredtom commented Mar 26, 2015

vitalyisaev2 commented Apr 25, 2015

ashishsingal1 commented May 30, 2015

scyllagist commented Aug 1, 2015

Zenadix commented Aug 27, 2015

summerela commented Sep 2, 2015

jmwoloso commented Jan 20, 2016

patricktokeeffe commented Mar 23, 2016

MikeTam1021 commented Mar 31, 2018

h-vetinari commented Apr 1, 2018

MikeTam1021 commented Apr 1, 2018

armant commented Apr 4, 2018

bcucek commented Apr 18, 2018 •

edited

Loading

DavidEscott commented Oct 4, 2018

TomAugspurger commented Oct 4, 2018 via email

DavidEscott commented Oct 4, 2018

SHi-ON commented Jun 24, 2019

BUG: concat unwantedly sorts DataFrame column names if they differ #4588

BUG: concat unwantedly sorts DataFrame column names if they differ #4588

Comments

smcinerney commented Aug 16, 2013 • edited Loading

hayd commented Aug 18, 2013

jtratner commented Aug 18, 2013

superkeyor commented Nov 25, 2013

asteppke commented May 28, 2014

zadacka commented Nov 19, 2014

jreback commented Nov 19, 2014

rasbt commented Jan 13, 2015

max-sixty commented Feb 5, 2015

inbredtom commented Mar 26, 2015

vitalyisaev2 commented Apr 25, 2015

ashishsingal1 commented May 30, 2015

scyllagist commented Aug 1, 2015

Zenadix commented Aug 27, 2015

summerela commented Sep 2, 2015

jmwoloso commented Jan 20, 2016

patricktokeeffe commented Mar 23, 2016

MikeTam1021 commented Mar 31, 2018

h-vetinari commented Apr 1, 2018

MikeTam1021 commented Apr 1, 2018

armant commented Apr 4, 2018

bcucek commented Apr 18, 2018 • edited Loading

DavidEscott commented Oct 4, 2018

TomAugspurger commented Oct 4, 2018 via email

DavidEscott commented Oct 4, 2018

SHi-ON commented Jun 24, 2019

smcinerney commented Aug 16, 2013 •

edited

Loading

bcucek commented Apr 18, 2018 •

edited

Loading