Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: erroneous initialization of a DataFrame with Series objects #42818

Closed
2 of 3 tasks
raphaelquast opened this issue Jul 30, 2021 · 15 comments · Fixed by #42960
Closed
2 of 3 tasks

BUG: erroneous initialization of a DataFrame with Series objects #42818

raphaelquast opened this issue Jul 30, 2021 · 15 comments · Fixed by #42960
Assignees
Labels
Constructors Series/DataFrame/Index/pd.array Constructors Docs good first issue Indexing Related to indexing on series/frames, not to indexes themselves
Milestone

Comments

@raphaelquast
Copy link

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

x = pd.Series(["a", "b", "c"])
y = pd.Series([1, 2, 3])

pd.DataFrame(y, x)
>>>     0
>>> a NaN
>>> b NaN
>>> c NaN
pd.DataFrame(x, y)
>>>      0
>>> 1    b
>>> 2    c
>>> 3  NaN
pd.DataFrame(x.values, y.values)
>>>    0
>>> 1  a
>>> 2  b
>>> 3  c

Problem description

I would expect pd.Series objects to be valid inputs for the DataFrame constructor.
If this is not the case a warning (or even raising an error) would be nice...

Output of pd.show_versions()

INSTALLED VERSIONS

commit : c7f7443
python : 3.9.6.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.17763
machine : AMD64
processor : AMD64 Family 25 Model 33 Stepping 0, AuthenticAMD
byteorder : little
LC_ALL : None
LANG : en
LOCALE : German_Austria.1252

pandas : 1.3.1
numpy : 1.21.1
pytz : 2021.1
dateutil : 2.8.2
pip : 21.2.1
setuptools : 49.6.0.post20210108
Cython : None
pytest : 6.2.4
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.4.2
numexpr : 2.7.3
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.7.0
sqlalchemy : None
tables : 3.6.1
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@raphaelquast raphaelquast added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 30, 2021
@jreback
Copy link
Contributor

jreback commented Jul 30, 2021

use a list of the Series

what u r passing is actually valid but thr 2nd arg is the index

@simonjayhawkins simonjayhawkins added Constructors Series/DataFrame/Index/pd.array Constructors Indexing Related to indexing on series/frames, not to indexes themselves Usage Question and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 30, 2021
@raphaelquast
Copy link
Author

@jreback thanks for responding!

I don't fully get what you mean "it is actually valid"... the result is really not what I've expected
(and i know that the 2nd arg is supposed to be the index... )

let me clarify this a bit more:
i was just expecting that passing a Series object would be similar to passing a list or a 1D numpy array
to summarize:

x = [1,2,3,4,5]
pd.DataFrame(x, index=x)                         # OK
pd.DataFrame(np.array(x), index=x)               # OK
pd.Series(x).to_frame().set_index(pd.Series(x))  # OK

pd.DataFrame(pd.Series(x), index=x)              # NOT OK
pd.DataFrame(dict(a=pd.Series(x)), index=x)      # NOT OK
pd.DataFrame([pd.Series(x)], index=x)            # NOT OK  (generates a 2D grid of values)

@phofl
Copy link
Member

phofl commented Jul 30, 2021

The index of the series and the DataFrame Index are aligned, this causes the nans, since you do not have matching entries in there.

x = pd.Series([2, 1, 0])
y = pd.Series([1, 2, 3])

pd.DataFrame(y, x)

returns

2  3
1  2
0  1

which should make clear what happens here

@raphaelquast
Copy link
Author

raphaelquast commented Aug 3, 2021

@phofl
aaaah, OK i got it 😄 sorry for the confusion!

so basically if the data (e.g. y) already has an index defined, the index gets aligned with respect to the new index (instead of being replaced with the new index)

I consider myself as a long-term user of pandas but this was so far not clear to me!
(also the doc of pd.DataFrame does not mention this special treatment of Series and DataFrame compared to other iterables)

data ndarray (structured or homogeneous), Iterable, dict, or DataFrame
Dict can contain Series, arrays, constants, dataclass or list-like objects. If data is a dict, column order follows insertion-order.

not sure, but how about a comment in the doc of pd.DataFrame to clarify this?
...or maybe just an Example that uses Series or mixed objects to construct a DataFrame... something like:

>>> pd.DataFrame(dict(a=[1,2,3,4,5], 
>>>                   b=pd.Series([3,4,5], index=[3,4,5])), 
>>>              index=[1,2,3,4,5])
   a    b
1  1  NaN
2  2  NaN
3  3  3.0
4  4  4.0
5  5  5.0

@simonjayhawkins
Copy link
Member

PRs always welcome to clarify/enhance the documentation.

@simonjayhawkins simonjayhawkins modified the milestones: 1.3.2, Contributions Welcome Aug 4, 2021
@tyuyoshi
Copy link
Contributor

tyuyoshi commented Aug 5, 2021

Hi, can I pick this issue as my first OSS contribution?

@simonjayhawkins
Copy link
Member

@tyuyoshi sure. go for it!

@dhivyadharshin
Copy link

Can I fix the issues?

@Stark-developer01
Copy link

@tyuyoshi sure. go for it!

I have proposed a PR at #43271 can the PR be approved for awaiting workflows?

@rajaryanbanka
Copy link

Hi, I would like to take up this problem as my first open source contribution.

@ankitasankars
Copy link

Hello, Can I use this issue as my first contribution?

@goyaldhara
Copy link

take

@Tarunssd
Copy link

take

@amanj7820
Copy link

df = pd.DataFrame(prcpSeries, columns=['prcp'])
df['tmax'] = tmaxSeries
df = prcpSeries.to_frame(name='prcp’)
An alternative here is to create each as a DataFrame and then perform an outer join (using concat):

df1 = pd.DataFrame(prcpSeries, columns=['prcp'])
df2 = pd.DataFrame(tmaxSeries, columns=['tmax'])
...

df = pd.concat([df1, df2, ...], join='outer', axis=1)

In [21]: dfA = pd.DataFrame([1,2], columns=['A'])

In [22]: dfB = pd.DataFrame([1], columns=['B'])

In [23]: pd.concat([dfA, dfB], join='outer', axis=1)
Out[23]:
A B
0 1 1
1 2 NaN

@tyuyoshi
Copy link
Contributor

Sorry, but I have already open PR #42960
I'm going to wait for review, so please take other issues.

@goyaldhara @Tarunssd @simonjayhawkins

phofl pushed a commit that referenced this issue Sep 8, 2021
… (#42960)

* BUG: erroneous initialization of a DataFrame with Series objects #42818

* Modify: NaN is float object

* Modify: decrease objects

* Modify: format :

* Modify: start index from 0

* Modify: grammar error

* Modify: singular -> plural
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Constructors Series/DataFrame/Index/pd.array Constructors Docs good first issue Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet