-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for openpyxl when reading XLSX (Excel 2010) files #11499
Comments
@Themanwithoutaplan : Sorry that this went unnoticed for AWHILE ! 😮 Is this something you would still be interested in pursuing? |
@gfyoung Sure. We've got a sprint coming up this weekend where I'll be working on this. I'd basically like to see the packages work together using explicit APIs rathe than the current, largely unavoidable, spaghetti code. Seems to me that a telco would make sense. |
That makes a lot of sense to me. I don't think any of us would object to making the Excel reading process a lot nicer to use and understand. |
I suspect that the XLS code will have to stay as it is because xlrd is now effectively abandonware. Personally, I'd think about deprecating it as a format… but anyway. 2.6 will add the From our side: it would be nice if DataFrame.columns could give richer information such as how many columns a heading is valid for. This would make writing dataframes to a stream a lot easier. If you're around later today (I'm +0100) then we could perhaps discuss this over Hangouts or some such. |
I think I see what you're saying, but I suggest opening an issue for this separately so as to explain more what you mean. |
FWIW I've now released an alpha of 2.6 and backported the I'm not sure whether being able to run in parallel is useful for Pandas, nor whether this would be best using threads or processes: there is a lot of I/O but also quite a lot of CPU work. I'll also create a separate ticket for what I'd like to see in the Pandas API. |
@Themanwithoutaplan I've been working to decouple xlrd in this code and have implemented a base class for excel reading which you can see below: Line 379 in 3fd47fe
Still planning to shuffle things around but I'm hoping this represents an improvement over when you first started looking at this. PRs are always welcome and if there's anything I can do to assist feel free to reach out |
Looks a lot better (especially the writer code that doesn't have to worry about the different ways we've handled styles). The reader code is definitely less tightly coupled to xlrd than it was. You will probably want to use I notice you've got your own code for converting Excel coords to numerical indices but I think all the libraries their own robust functions you could probably use (ok, openpyxl uses 1-indexing) and your own evil code for handling headers with multiple layers. I'd really like to see this in the dataframe or index API, or anything that openpyl can use directly. |
Well if no one else has started this I'll see if I have some time this weekend to make an openpyxl reader. Hopefully I can reuse all the same tests, just swap out the engine. |
Congratulations! |
@selik can you open a PR or new issue for that? Otherwise it's likely to not get done. |
@TomAugspurger Looks like someone took care of it, but it's not reflected in v0.25.3 |
I took a brief look at the existing code for importing files with xlrd but it looks tightly coupled with that library and not particularly straightforward to adapt.
I think that the read-only mode of openpyxl would be a good fit for Pandas and would like to work with you to provide the necessary API to go from rows to a dataframe. Data is stored by row so this is the only sensible approach. We will be adding read-support for NumPy types, ie. when passed them, but probably not when reading Excel files.
The text was updated successfully, but these errors were encountered: