BUG: erroneous initialization of a DataFrame with Series objects #42818

raphaelquast · 2021-07-30T13:03:07Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

x = pd.Series(["a", "b", "c"])
y = pd.Series([1, 2, 3])

pd.DataFrame(y, x)
>>>     0
>>> a NaN
>>> b NaN
>>> c NaN

pd.DataFrame(x, y)
>>>      0
>>> 1    b
>>> 2    c
>>> 3  NaN

pd.DataFrame(x.values, y.values)
>>>    0
>>> 1  a
>>> 2  b
>>> 3  c

Problem description

I would expect pd.Series objects to be valid inputs for the DataFrame constructor.
If this is not the case a warning (or even raising an error) would be nice...

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : c7f7443
python : 3.9.6.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.17763
machine : AMD64
processor : AMD64 Family 25 Model 33 Stepping 0, AuthenticAMD
byteorder : little
LC_ALL : None
LANG : en
LOCALE : German_Austria.1252

pandas : 1.3.1
numpy : 1.21.1
pytz : 2021.1
dateutil : 2.8.2
pip : 21.2.1
setuptools : 49.6.0.post20210108
Cython : None
pytest : 6.2.4
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.4.2
numexpr : 2.7.3
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.7.0
sqlalchemy : None
tables : 3.6.1
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

jreback · 2021-07-30T13:11:53Z

use a list of the Series

what u r passing is actually valid but thr 2nd arg is the index

raphaelquast · 2021-07-30T14:45:42Z

@jreback thanks for responding!

I don't fully get what you mean "it is actually valid"... the result is really not what I've expected
(and i know that the 2nd arg is supposed to be the index... )

let me clarify this a bit more:
i was just expecting that passing a Series object would be similar to passing a list or a 1D numpy array
to summarize:

x = [1,2,3,4,5]
pd.DataFrame(x, index=x)                         # OK
pd.DataFrame(np.array(x), index=x)               # OK
pd.Series(x).to_frame().set_index(pd.Series(x))  # OK

pd.DataFrame(pd.Series(x), index=x)              # NOT OK
pd.DataFrame(dict(a=pd.Series(x)), index=x)      # NOT OK
pd.DataFrame([pd.Series(x)], index=x)            # NOT OK  (generates a 2D grid of values)

phofl · 2021-07-30T16:36:05Z

The index of the series and the DataFrame Index are aligned, this causes the nans, since you do not have matching entries in there.

x = pd.Series([2, 1, 0])
y = pd.Series([1, 2, 3])

pd.DataFrame(y, x)

returns

2  3
1  2
0  1

which should make clear what happens here

raphaelquast · 2021-08-03T16:52:30Z

@phofl
aaaah, OK i got it 😄 sorry for the confusion!

so basically if the data (e.g. y) already has an index defined, the index gets aligned with respect to the new index (instead of being replaced with the new index)

I consider myself as a long-term user of pandas but this was so far not clear to me!
(also the doc of pd.DataFrame does not mention this special treatment of Series and DataFrame compared to other iterables)

data ndarray (structured or homogeneous), Iterable, dict, or DataFrame
Dict can contain Series, arrays, constants, dataclass or list-like objects. If data is a dict, column order follows insertion-order.

not sure, but how about a comment in the doc of pd.DataFrame to clarify this?
...or maybe just an Example that uses Series or mixed objects to construct a DataFrame... something like:

>>> pd.DataFrame(dict(a=[1,2,3,4,5], 
>>>                   b=pd.Series([3,4,5], index=[3,4,5])), 
>>>              index=[1,2,3,4,5])
   a    b
1  1  NaN
2  2  NaN
3  3  3.0
4  4  4.0
5  5  5.0

simonjayhawkins · 2021-08-04T09:48:35Z

PRs always welcome to clarify/enhance the documentation.

tyuyoshi · 2021-08-05T23:51:42Z

Hi, can I pick this issue as my first OSS contribution?

simonjayhawkins · 2021-08-06T09:49:02Z

@tyuyoshi sure. go for it!

…as-dev#42818

dhivyadharshin · 2021-08-17T17:21:04Z

Can I fix the issues?

Stark-developer01 · 2021-08-28T16:27:05Z

@tyuyoshi sure. go for it!

I have proposed a PR at #43271 can the PR be approved for awaiting workflows?

rajaryanbanka · 2021-08-29T06:57:16Z

Hi, I would like to take up this problem as my first open source contribution.

ankitasankars · 2021-08-29T13:33:54Z

Hello, Can I use this issue as my first contribution?

goyaldhara · 2021-08-29T16:12:22Z

take

Tarunssd · 2021-08-29T17:08:55Z

take

amanj7820 · 2021-08-29T17:41:21Z

df = pd.DataFrame(prcpSeries, columns=['prcp'])
df['tmax'] = tmaxSeries
df = prcpSeries.to_frame(name='prcp’)
An alternative here is to create each as a DataFrame and then perform an outer join (using concat):

df1 = pd.DataFrame(prcpSeries, columns=['prcp'])
df2 = pd.DataFrame(tmaxSeries, columns=['tmax'])
...

df = pd.concat([df1, df2, ...], join='outer', axis=1)

In [21]: dfA = pd.DataFrame([1,2], columns=['A'])

In [22]: dfB = pd.DataFrame([1], columns=['B'])

In [23]: pd.concat([dfA, dfB], join='outer', axis=1)
Out[23]:
A B
0 1 1
1 2 NaN

tyuyoshi · 2021-08-29T20:46:22Z

Sorry, but I have already open PR #42960
I'm going to wait for review, so please take other issues.

@goyaldhara @Tarunssd @simonjayhawkins

… (#42960) * BUG: erroneous initialization of a DataFrame with Series objects #42818 * Modify: NaN is float object * Modify: decrease objects * Modify: format : * Modify: start index from 0 * Modify: grammar error * Modify: singular -> plural

raphaelquast added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 30, 2021

simonjayhawkins added Constructors Series/DataFrame/Index/pd.array Constructors Indexing Related to indexing on series/frames, not to indexes themselves Usage Question and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 30, 2021

simonjayhawkins added Docs good first issue and removed Usage Question labels Aug 4, 2021

simonjayhawkins modified the milestones: 1.3.2, Contributions Welcome Aug 4, 2021

tyuyoshi pushed a commit to tyuyoshi/pandas that referenced this issue Aug 9, 2021

BUG: erroneous initialization of a DataFrame with Series objects pand…

a57a02c

…as-dev#42818

tyuyoshi mentioned this issue Aug 9, 2021

DOC: erroneous initialization of a DataFrame with Series objects #42818 #42960

Merged

4 tasks

jreback modified the milestones: Contributions Welcome, 1.4 Aug 10, 2021

Stark-developer01 mentioned this issue Aug 28, 2021

DOC: erroneous initialization of a DataFrame with Series objects GH42818 #43271

Closed

4 tasks

github-actions bot assigned goyaldhara Aug 29, 2021

github-actions bot assigned Tarunssd Aug 29, 2021

github-actions bot assigned tyuyoshi Aug 29, 2021

phofl closed this as completed in #42960 Sep 8, 2021

rhshadrach mentioned this issue Sep 26, 2022

BUG: Wrong index constructor if new DF/Series is created from DF/Series #48674

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: erroneous initialization of a DataFrame with Series objects #42818

BUG: erroneous initialization of a DataFrame with Series objects #42818

raphaelquast commented Jul 30, 2021

INSTALLED VERSIONS

jreback commented Jul 30, 2021

raphaelquast commented Jul 30, 2021

phofl commented Jul 30, 2021

raphaelquast commented Aug 3, 2021 •

edited

Loading

simonjayhawkins commented Aug 4, 2021

tyuyoshi commented Aug 5, 2021

simonjayhawkins commented Aug 6, 2021

dhivyadharshin commented Aug 17, 2021

Stark-developer01 commented Aug 28, 2021

rajaryanbanka commented Aug 29, 2021

ankitasankars commented Aug 29, 2021

goyaldhara commented Aug 29, 2021

Tarunssd commented Aug 29, 2021

amanj7820 commented Aug 29, 2021

tyuyoshi commented Aug 29, 2021

BUG: erroneous initialization of a DataFrame with Series objects #42818

BUG: erroneous initialization of a DataFrame with Series objects #42818

Comments

raphaelquast commented Jul 30, 2021

Code Sample, a copy-pastable example

Problem description

Output of pd.show_versions()

INSTALLED VERSIONS

jreback commented Jul 30, 2021

raphaelquast commented Jul 30, 2021

phofl commented Jul 30, 2021

raphaelquast commented Aug 3, 2021 • edited Loading

simonjayhawkins commented Aug 4, 2021

tyuyoshi commented Aug 5, 2021

simonjayhawkins commented Aug 6, 2021

dhivyadharshin commented Aug 17, 2021

Stark-developer01 commented Aug 28, 2021

rajaryanbanka commented Aug 29, 2021

ankitasankars commented Aug 29, 2021

goyaldhara commented Aug 29, 2021

Tarunssd commented Aug 29, 2021

amanj7820 commented Aug 29, 2021

tyuyoshi commented Aug 29, 2021

Output of `pd.show_versions()`

raphaelquast commented Aug 3, 2021 •

edited

Loading