Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv epoch parsing with pd.to_datetime() #3757

Closed
nipunbatra opened this issue Jun 5, 2013 · 16 comments · Fixed by #3771
Closed

read_csv epoch parsing with pd.to_datetime() #3757

nipunbatra opened this issue Jun 5, 2013 · 16 comments · Fixed by #3771
Labels

Comments

@nipunbatra
Copy link
Contributor

References #3764 , #3540

!head 001EC00CC49D_processed        

timestamp,frequency,voltage,active_power,energy,cost,current,reactive_power,apparent_power,power_factor,phase_angle
1369210417,49.99,234.187,1.138,0.131,0.000,0.014,3.036,3.242,0.351,69.458
1369210418,49.98,234.276,1.043,0.131,0.000,0.014,3.183,3.350,0.311,71.855
1369210419,49.97,234.306,1.043,0.132,0.000,0.014,3.100,3.271,0.319,71.411
1369210420,49.97,234.288,1.045,0.132,0.000,0.014,3.155,3.324,0.315,71.668
1369210421,49.98,234.330,1.047,0.133,0.000,0.014,3.099,3.271,0.320,71.332
1369210422,49.97,234.228,1.004,0.134,0.000,0.014,3.140,3.296,0.304,72.275
1369210423,49.99,234.260,1.036,0.134,0.000,0.013,2.986,3.161,0.328,70.861
1369210424,49.97,234.292,1.073,0.135,0.000,0.014,3.089,3.270,0.328,70.842
1369210425,49.98,234.259,1.073,0.135,0.000,0.014,3.089,3.270,0.328,70.839

def epoch_to_date(timestamp):
    return datetime.datetime.fromtimestamp(int(timestamp))

def epoch_to_date_2(timestamp):
    return pd.to_datetime(float(timestamp)*int(1e9))

df=pd.read_csv('001EC00CC49D_processed',parse_dates=[0],date_parser=epoch_to_date,index_col=0,error_bad_lines=False);
df2=pd.read_csv('001EC00CC49D_processed',parse_dates=[0],date_parser=epoch_to_date_2,index_col=0,error_bad_lines=False);

df
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 327818 entries, 2013-05-22 13:43:37 to 2013-06-05 12:22:16
Data columns (total 10 columns):
frequency         327810  non-null values
...
dtypes: float64(9), object(1)

df2
<class 'pandas.core.frame.DataFrame'>
Index: 327818 entries, 1.369210417e+18 to 1.370415136e+18
Data columns (total 10 columns):
frequency         327810  non-null values
....
dtypes: float64(9), object(1)

Using pd.to_datetime() took more time and also index is not DateTime
Moreover, since a lot of people tend to record unix timestamps in order to escape different time formats, might be handy to add this functionality inbuilt.

@jreback
Copy link
Contributor

jreback commented Jun 5, 2013

This is a MUCH faster way of doing this

In [17]: df=pd.read_csv('proc.txt',index_col=0,error_bad_lines=False)

In [18]: df
Out[18]: 
            frequency  voltage  active_power  energy  cost  current  reactive_power  apparent_power  power_factor  phase_angle
timestamp                                                                                                                     
1369210417      49.99  234.187         1.138   0.131     0    0.014           3.036           3.242         0.351       69.458
1369210418      49.98  234.276         1.043   0.131     0    0.014           3.183           3.350         0.311       71.855
1369210419      49.97  234.306         1.043   0.132     0    0.014           3.100           3.271         0.319       71.411
1369210420      49.97  234.288         1.045   0.132     0    0.014           3.155           3.324         0.315       71.668
1369210421      49.98  234.330         1.047   0.133     0    0.014           3.099           3.271         0.320       71.332
1369210422      49.97  234.228         1.004   0.134     0    0.014           3.140           3.296         0.304       72.275
1369210423      49.99  234.260         1.036   0.134     0    0.013           2.986           3.161         0.328       70.861
1369210424      49.97  234.292         1.073   0.135     0    0.014           3.089           3.270         0.328       70.842
1369210425      49.98  234.259         1.073   0.135     0    0.014           3.089           3.270         0.328       70.839

In [19]: df.index = pd.to_datetime((df.index.values*1e9).astype(int))

In [20]: df
Out[20]: 
                     frequency  voltage  active_power  energy  cost  current  reactive_power  apparent_power  power_factor  phase_angle
2013-05-22 08:13:37      49.99  234.187         1.138   0.131     0    0.014           3.036           3.242         0.351       69.458
2013-05-22 08:13:38      49.98  234.276         1.043   0.131     0    0.014           3.183           3.350         0.311       71.855
2013-05-22 08:13:39      49.97  234.306         1.043   0.132     0    0.014           3.100           3.271         0.319       71.411
2013-05-22 08:13:40      49.97  234.288         1.045   0.132     0    0.014           3.155           3.324         0.315       71.668
2013-05-22 08:13:41      49.98  234.330         1.047   0.133     0    0.014           3.099           3.271         0.320       71.332
2013-05-22 08:13:42      49.97  234.228         1.004   0.134     0    0.014           3.140           3.296         0.304       72.275
2013-05-22 08:13:43      49.99  234.260         1.036   0.134     0    0.013           2.986           3.161         0.328       70.861
2013-05-22 08:13:44      49.97  234.292         1.073   0.135     0    0.014           3.089           3.270         0.328       70.842
2013-05-22 08:13:45      49.98  234.259         1.073   0.135     0    0.014           3.089           3.270         0.328       70.839

@jreback
Copy link
Contributor

jreback commented Jun 5, 2013

@nipunreddevil also..your conversion using pd.to_datetime needs an integer (and not a float) in order to create a DateTimeIndex (and is slower because its trying several things); but converting all at once (and I did above) is much faster in any event

@nipunbatra
Copy link
Contributor Author

Great!
Can there be dedicated section for such stuff in official documentation. Lest i keep bugging you guys :)
Would be great if this operation of using pd.to_datetime for such csv's be abstracted out from the user.
Also how do you post code here on issues? I struggle with copying stuff from notebook and then using 4 spaces per line.
Should this be closed?

@jreback
Copy link
Contributor

jreback commented Jun 5, 2013

click on the 'github flavored markdown' to the right of the comment box, will tell you howthis

This might make a nice little doc for the cookbook or main docs, want to put a short example together?

It is really hard for pandas to 'figure out' what the user wants sometimes, that's why we have these options!
that said it should be possible for the user to indicate that a date column is an epoch date and do the right thing

If you could post a small example I'll update the docs, then close

thanks

@nipunbatra
Copy link
Contributor Author

Do you have some recommended dataset or should i put my own (can't put my
whole since it is not published yet)?
Would surely like to contribute!

On Wed, Jun 5, 2013 at 7:16 PM, jreback [email protected] wrote:

click on the 'github flavored markdown' to the right of the comment box,
will tell you howthis

This might make a nice little doc for the cookbook or main docs, want to
put a short example together?

It is really hard for pandas to 'figure out' what the user wants
sometimes, that's why we have these options!
that said it should be possible for the user to indicate that a date
column is an epoch date and do the right thing

If you could post a small example I'll update the docs, then close

thanks


Reply to this email directly or view it on GitHubhttps://github.com//issues/3757#issuecomment-18976294
.

@jreback
Copy link
Contributor

jreback commented Jun 5, 2013

just make up a simple example, e.g. just take the first 10 rows and first couple of columns from your dataset (replace with random data), and rename the oclumns to ABCD....

@nipunbatra
Copy link
Contributor Author

Here is a quick attempt from my side: http://nbviewer.ipython.org/5714493
Haven't talked about timezone as i am not very sure about it myself as of
now.
The gist is located here https://gist.github.com/nipunreddevil/5714493 .
Let me know what all can be/ should be added.

On Wed, Jun 5, 2013 at 7:56 PM, jreback [email protected] wrote:

just make up a simple example, e.g. just take the first 10 rows and first
couple of columns from your dataset (replace with random data), and rename
the oclumns to ABCD....


Reply to this email directly or view it on GitHubhttps://github.com//issues/3757#issuecomment-18978962
.

@jreback
Copy link
Contributor

jreback commented Jun 5, 2013

I would just do using read_csv, after all this is pandas!
make a bit shorter if you can, can we link directly to this viewer? (e.g. is it 'permanent')?

I am just going to put a linke (and short description) here:
http://pandas.pydata.org/pandas-docs/dev/cookbook.html

@nipunbatra
Copy link
Contributor Author

I think this should be permanent. nbviewer doesn't host the stuff. It only
pulls it from the source (in my case gist). Pretty much a standard way, so
should be up if Github is up!
Ok. Would make it shorter.
My pandas version 0.11.0 doesn't allow me to specify time zone in
.to_datetime() reference #3746, so haven't added that bit yet.

On Wed, Jun 5, 2013 at 8:34 PM, jreback [email protected] wrote:

I would just do using read_csv, after all this is pandas!
make a bit shorter if you can, can we link directly to this viewer? (e.g.
is it 'permanent')?

I am just going to put a linke (and short description) here:
http://pandas.pydata.org/pandas-docs/dev/cookbook.html


Reply to this email directly or view it on GitHubhttps://github.com//issues/3757#issuecomment-18982463
.

@jreback
Copy link
Contributor

jreback commented Jun 5, 2013

ok...do you want to do a PR to add it?

since you are providing a utc date (by definition epoc seconds are utc)

you can do:
index.tz_localize('your_time_zone') to localize (you can show that too)

@nipunbatra
Copy link
Contributor Author

Haven't done a Pull Request before. Would be great if you can help with the
same.

On Wed, Jun 5, 2013 at 8:50 PM, jreback [email protected] wrote:

ok...do you want to do a PR to add it?

since you are providing a utc date (by definition epoc seconds are utc)

you can do:
index.tz_localize('your_time_zone') to localize (you can show that too)


Reply to this email directly or view it on GitHubhttps://github.com//issues/3757#issuecomment-18983712
.

@jreback
Copy link
Contributor

jreback commented Jun 5, 2013

read this, pretty much explains how to add to the docs: http://pandas.pydata.org/developers.html

@nipunbatra
Copy link
Contributor Author

Would get back pretty soon after reviewing the same.

On Wed, Jun 5, 2013 at 9:03 PM, jreback [email protected] wrote:

read this, pretty much explains how to add to the docs:
http://pandas.pydata.org/developers.html


Reply to this email directly or view it on GitHubhttps://github.com//issues/3757#issuecomment-18984670
.

@jreback
Copy link
Contributor

jreback commented Jun 6, 2013

thanks for the PR!

@nipunbatra
Copy link
Contributor Author

Thanks for merging my first PR in a big project!
Any other similar stuff where i can do a PR?

On Thu, Jun 6, 2013 at 5:27 PM, jreback [email protected] wrote:

thanks for the PR!


Reply to this email directly or view it on GitHubhttps://github.com//issues/3757#issuecomment-19040343
.

@jreback
Copy link
Contributor

jreback commented Jun 6, 2013

in issues, filter on 'Good as first PR' / DOCS

nipunbatra referenced this issue in nilmtk/nilmtk Dec 7, 2013
* Can now load arbitrary channel.
* Loads mains data into memory.
* added a `load_redd.py` example.

Also changed a small detail in the comments for Electricity:
* REDD channels are indexed from 1 so I figured that we should
  probably index channels and meters from 1 too.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants